User:Alanyst/Edit collision research

This is the location for an account of some research I did related to Requests for arbitration/Mantanmoreland.

Aim of the research
The research aims to provide insight to the question of whether two user accounts 'MM' and 'SH' on Wikipedia are independent, or else are controlled by the same individual (sockpuppets). This research focuses on editing timestamps by the editors as a metric for independence.

Initial assumptions
If A makes an edit at the same time as B, it is quite likely (though not absolutely certain) that A and B are not controlled by the same individual, because of the relative difficulty of one person making simultaneous edits under two different accounts. The more occurrences of simultaneous edits by A and B, the greater the likelihood that the accounts are independent as the effort to contrive such occurrences would be significant.

If edit counts for A and B are low, then the chances of independent simultaneous edits (a collision) are likewise low. The odds of a collision grow with the edit counts. Another positive influence on the odds of a collision is a similar pattern of activity between users A and B. If A only edits on Mondays and B only edits on Thursdays, or A only during 3:00-6:00 UTC and B only during 14:00-19:00 UTC, they will never collide no matter how high their edit counts; conversely, shared periods of editing make collisions more likely if edits are independently made.

Simultaneity
One core notion underpinning this research is the meaning of "simultaneous edits". Our data file has timestamp precision in seconds. With 77443453 edits in 2007, under a uniform distribution this averages to nearly 2.5 edits per second. So, we could define "simultaneous" as "occurring during the same clock second".

However, this standard is too strict to yield much useful information. A lapse of five seconds between edits could still be too small to be easily faked by a sockpuppet master. What, then, is a reasonable standard?

I chose to define "simultaneous" to mean "within the same clock minute". This has the merit of easy computation (just ignore the seconds part of the timestamp) but weakens the standard. Edits that are nearly two minutes apart from each other (12:00:00 and 12:01:59, for example) look closer than that (12:00 and 12:01), while edits that are a mere second apart (12:00:59 and 12:01:00) appear the same. And, although 12:00:59 is close to 12:01:00, it would be equated with the much more distant 12:00:00 under this scheme.

Still, this standard is sufficient. Prior work determined in this case that A and B never edited within 2 minutes or less of each other, so this standard of simultaneity is stricter than necessary to examine the likelihood that two accounts of similar edit count and similar edit times will collide. (If collisions with other accounts are common under the stricter 1 minute standard, a lack of collisions under a 2-minute standard would be even more remarkable. If collisions are rare under the 1 minute standard, there can be no conclusion about the likelihood of a paucity of collisions under the 2-minute threshold.)

Methodology
I began by downloading, which is the set of all revisions (sans article text) of Wikipedia as of 2008-01-03 (6.2GB of compressed XML). The file timestamp is 2008-01-06 05:21, and contains at least all revisions up through 2007-12-31, so far as I can tell.

I then ran a filter to extract all revisions with a timestamp that started with '2007' (a period in which both MM and SH were active), and reformatted the results as newline-delimited records containing pipe-delimited fields: username (or IP, for anonymous edits), timestamp, and edit summary. I also truncated the seconds portion of the timestamp (see discussion of simultaneity above).

Then I ran a script that counted each editor's revisions and stored those results. I then filtered those results to editors who had between 1000 and 2000 edits, as a best comparison against users MM and SH, who both had edit counts within that range.

Finally, I compared the edit timestamps by editors in that 1K-2K list to the edits by users MM and SH, to see how frequent simultaneous edits were. All comparisons involved either SH or MM; I did not do a full cross-compare between all editors in that set.

Results

 * There were 77443453 revisions in 2007.
 * There were 3629 editors who had between 1K and 2K edits in 2007, including MM (1680) and SH (1201).
 * 343 (9.5%) of the 1K-2K editors never had edits that collided with MM, including SH.
 * 610 (16.8%) of the 1K-2K editors never had edits that collided with SH, including MM.

Analysis

 * MM had nearly 500 more edits than SH in 2007, which presumably explains the larger number of collisions between MM and other editors.
 * SH did not start editing until 2007-01-31, so SH had comparatively fewer opportunities to collide with other editors. This also probably influences the lower collision count of SH.
 * A lack of collisions does not necessarily mean a lack of overlap in editing attention. User A might be working on an article at the same time as User B, but A might save changes 5 minutes before B does.
 * The data, viewed in isolation, is inconclusive under this standard of simultaneity.
 * Answers to the open questions posed below might clear things up.

Open questions

 * Of those that never collided with the accounts in question, how many have edit times that resemble the patterns of MM and SH?
 * How do these numbers change if we vary the strictness of the simultaneity standard?
 * Of those who did collide with the accounts in question, how does the number of collisions correlate with the degree of similarity in editing pattern?
 * Of those that never collided with the accounts in question, how often do single edits interleave?

Unresolved questions

 * I worked a little bit on your first and third questions, as noted here.


 * Correlation does increase the likelihood of collisions, as shown in the graph right. Here are a breakdown of number of accounts and accounts with collisions for any given correlation coefficient bar. For example, the first entry 0.865 is unfairly placed under these accounts own correlation for this period&mdash;very few editing patterns correlate this well, and few non-collision accounts exist at higher correlation values.


 * Roughly speaking, it looks like the top third in terms of correlations only account for about one sixth of the no-collision comparisons, while the bottom third accounts for half of them. This makes sense: users who edit at the same times of day would be more likely to have collisions. Cool Hand Luke 05:15, 19 February 2008 (UTC)