Wikipedia:New histmerge list Information & Wikipedia:New histmerge list Links at HealthHaven.com
advertise
add site
services
publishers
database
health videos
Bookmark and Share

search wiki for    ?
web dir firms image gallery news pdf wiki shop video 
about
toolbar
stats
live show
health store
more stuff
JOIN/LOGIN
Featured Results:
 List of Courses: New York New Jersey Public Health Training Center
List of Courses: New York New Jersey Public Health Training Center
nynj-phtc.org
 

I (Mikaey) have written a bot (which I call "AarghBot") to compile a list of potential cut-and-paste moves. How do I have a bot determine whether or not a cut-and-paste move has occurred? Here's how:

  1. The bot starts by going through the most recent database dump and looking for redirects. At this time, it is only looking at article space.
  2. The bot looks through the page history, starting with the most recent edit, to see if the article was ever not a redirect. If the page history shows that it has only ever been a redirect, it skips the page. However, if the article was not a redirect at some point in its history, the bot makes a note of the timestamp when the article was turned into a redirect (we'll call this the predate), as well as the text of the page before being turned into a redirect (the pretext).
  3. The bot looks at the target page of the redirect (the target). If the target is also redirect, the bot skips the page and moves on to the next redirect in the list.
  4. The bot looks at the page history of the target (starting with the first edit), and searches until it finds a version of the page that is not a redirect. The bot makes a note of the timestamp of this version (the postdate), and the text of this version of the page (the posttext). If the predate and the postdate are not within 24 hours of each other, the bot skips the page and moves on to the next redirect in the list.
  5. The bot performs a diff of the pretext and the posttext. If the number of lines that changed is more than 10% of the number of lines in the pretext, the bot skips the page and moves on to the next redirect in the list.
  6. At this point, the bot records the information to a log file on my computer. The contents of the log file are then uploaded to the list here.

In these lists, you will see a diff score for each entry. The diff score is computed as the number of lines that changed between the pretext and the posttext, divided by the total number of lines in the pretext. The diff score is designed to be a measure of uncertainty that the two articles in question are based off the same text -- e.g., a higher diff score would mean that it is less likely that the two articles are based off the same text, while a lower diff score would mean that it is more likely that the two articles are based off the same text. Likewise, a diff score of zero indicates that the two texts are identical -- the only exceptions allowed are whitespace and casing. Note that, when the diff is performed, empty lines are stripped from both texts, and the diff is performed case-insensitive and whitespace-insensitive.

To admins who work on this list: Please feel free to remove any items from the list that you take care of.

To anyone who works on this list: If you come across a false positive, tag the source page with {{nahmc|<destination page>}}, where <destination page> is the page in the destination column of the report. This will cause the bot to ignore that particular match on the next run. ("nahmc" = "Not A HistMerge Candidate".)

[edit] The Lists

Each list contains 500 items.




Product Results (view all...)

search wiki for    ?
web dir firms image gallery news pdf wiki shop video 



↑ top of page ↑about thumbshots