Lining up to detect text reuse

People say that being copied is one of the greatest honours, but in the world of journalism it is hard to spot. Researchers say they can now identify which articles are really just rewritten press releases and wire reports. Using a combination of genetic and linguistic analysis, computer-based comparisons will identify plagiarists, and help press agencies track how much their services are used.

“In an age where text is passed around freely, it is important to track originality,” states Dr Rob Gaizauskas from the University of Sheffield. “Examiners and teachers like to know if their students have lifted essays from textbooks; disappointed authors would welcome the chance to prove that the scripts of films are actually taken from the ones they had sent in years before.”

In the context of press journalism, agencies like the Press Association (PA) supply thousands of stories a week to publishers as sources for stories. “Press agencies have a very strong commercial interest in knowing whether or not any given story X that appears in a newspaper is derived directly from a story Y that they put out,” Dr Gaizauskas explains. “If an agency could know this they would be able to plan and charge for their services in a more rational way. They could also focus their journalists’ efforts to cover what the papers are likely to use.”

Leading a team from the Department of Computer Science, Dr Gaizauskas has employed a variety of techniques to compare newspaper articles with PA wire stories. In one approach, the researchers take ’tiles’ – sequences of words cut from one text – and overlay them onto matching sequences in the comparison text. The amount of unmatched text and the average size of the tiles give clues as to the likelihood of derivation.

Another technique involves ‘lining up’ the two texts using a method originally intended to align words in translated passages with those in the source language copy. The number of aligned sentences and the degree of overlap between aligned sentences again indicates potential reuse.

Analysing newspaper stories, Dr Gaizauskas distinguishes which stories are wholly derived, partially derived or not at all derived from PA material. Best results to date show wholly or partially derived stories can be distinguished from non-derived texts with greater than 90% accuracy; the three way classification is more than 70% accurate.

“To improve these results we will require more sophisticated modelling of journalists’ rewriting techniques,” says Dr Gaizauskas. “For example, we need to take into consideration the use of newspaper house style and vocabulary, and allow more for distractors such as quoted speech, which may occur identically in independently written texts.

“Our work will have commercial significance for our collaborator, the PA, and help it to monitor the penetration of its stories in the press. In the long term, however, this could be developed by publishers to spot plagiarism. With a bit more understanding, the same techniques may even help newspapers write their stories. It could show how original materials could be automatically rewritten to fit house styles and other criteria.” Perhaps journalists will leave the rewrites to computers while they go out and find a scoop.