User:Alanyst/Vector space research

This is another page that documents research pertaining to Requests for arbitration/Mantanmoreland. See also User:Alanyst/Edit collision research.

Aim of the research
The research aims to provide insight to the question of whether two user accounts 'MM' and 'SH' on Wikipedia are independent, or else are controlled by the same individual (sockpuppets). This research focuses on the similarity of edit summaries as a metric of independence.

Initial assumptions
If two Wikipedia accounts are controlled by the same individual, we can expect a higher degree of similarity in the edit summaries made by those accounts than for a typical pair of unrelated accounts. Even if the individual is consciously avoiding using similar language, there is still likely to be significant overlap in terms used simply because people's speech patterns are deeply ingrained.

Similarity
The notion of similarity needs to be defined. It is not enough to show that two accounts share similar terms, since unrelated accounts might also employ those terms. It is also important to show that terms that one account employs and the other does not are less of a factor than the similar terms. Also, the more commonly used the term is among the general population of editors, the less weight the term should have in the analysis.

Edit summary similarity as an information retrieval problem
To gauge similarity between different editors' edit summaries, we can view the problem in terms of information retrieval. Let A be the set (more strictly, the bag) of all terms used by editor A in A's edit summaries. We define B similarly for editor B, and so forth for all editors who have contributed edit summaries.

We can view A, B, and so forth as individual documents, much like web pages (though without HTML markup and full of vaguely Wikipedia-related gibberish). If we want to know whose edit summaries are most similar to a particular editor's (say, editor 'MM'), we treat the document MM as a search query over the set of all documents in our corpus, and find the closest match. Just as a search engine ranks the results by relevance, so we can assess the relevance of MM to each other editor's combined edit summaries.

The vector space model is a well-known algorithm in information retrieval that can provide this kind of analysis. Term frequencies are calculated for each document and for the entire corpus. Each term is treated as one dimension of a vector, with the frequency of that term in that document (normalized by the overall frequency of the term in the corpus) supplying the magnitude of that component of the vector. Then the similarity of the documents is calculated by measuring the angle (actually, the cosine) between the vectors that represent each document. The cosine is maximized when two vectors are colinear (complete similarity), and minimized when they are orthogonal (zero similarity).

Methodology
I began with the set of all revisions made in 2007. Each record contained the editor name (or IP address for anonymous editors), a timestamp, and the full edit summary for the revision. The timestamp was dropped for this analysis.

The set of all 2007 revisions is too large for a vector space analysis on my hardware, so I reduced the set to all revisions made by editors who had between 1000 and 2000 edits (inclusive) during that time. These include edits made by MM and SH.

I then combined each editor's edit summaries to produce a set of tf-idf scores as follows: // first gather raw counts of terms per user (=document) and for entire corpus for each revision do  if no edit summary then skip remove commonly seen automated edit summaries: * automatic section comments (/* like this */) * revert-tool messages ("Reverted P by Q to R by S") * undo-tool messages ("Undo/Undid...by User" with wiki markup) * Twinkle ("...using TW") remove HTML entities (&amp;quot;, &amp;gt;, etc.) remove all non-alphabetic and non-whitespace sequences (punctuation, digits, symbols) condense all whitespace sequences to single space characters trim whitespace from the start and end of the text if nothing left in edit summary then skip split edit summary into tokens using space character as delimiter // note: token case was preserved in order to capture similar styles // of capitalization, so "Foo" is treated as a different token than "foo". increment token count for the user and for the entire corpus end loop on each revision

// next calculate inverse document frequency for each term (see vector space model) for each term do  term.idf = log(number of users/number of times term appears in corpus) end loop on each term

// finally, calculate term frequency for each term and user // and multiply by idf to get vector of term weights for each user for each user do  for each term appearing in the user's edit summaries do     user.term.weight = (number of times term appears in user's edit summaries) * idf end loop on each term write out the vector of term=>weight pairs for the user end loop on each user

I calculated tf-idf vectors for all 3629 editors who have between 1000 and 2000 edits in 2007.

Finally, I calculated similarity rankings for all 3629 editors with respect to MM and SH individually. In other words, I treated MM's tf-idf vector as the query and ranked all other editors' vectors by similarity, and then did the same for SH's vector.

Artifacts of the process
The following table details the artifacts involved in the analysis, which can be found on my server (link posted on arbitration case's Evidence page).

Results
Note: These results are erroneous. Please see the "Correction" section below.

The 20 lowest-similarity editors are (lowest first):

The 20 highest-similarity editors are (lowest first):

Analysis

 * MM and SH are in each other's top two similarity rankings, disregarding the trivial self-similarity.
 * Interestingly, User:Piperdown is also in both editors' top two, and in fact is the most similar with respect to SH, over MM.
 * Piperdown's strong similarity ranking challenges the hypothesis of collusion somewhat, as it is well known that Piperdown is on the opposite side of the Overstock issue than MM and SH are. However, the terms Piperdown has most strongly in common with MM and SH are mostly connected to the Overstock battle, in which SH participated more than MM did during 2007:
 * SEC
 * Forbes
 * Bloomberg
 * hedge
 * shorting
 * Byrne
 * Weiss
 * SHO
 * piperdown
 * material
 * DOB
 * RS
 * While MM and SH also share strong correlations with some Overstock-related terms, there are also some distinctive terms of habit that strongly correlate:
 * SEC
 * rply
 * expanding
 * clarifying
 * distort
 * regulatory
 * duplicative
 * NPA
 * RS
 * naked
 * The technique of stripping out non-alphabetic characters and tokenizing on whitespace means that phrases, numbers, and punctuation do not factor into the results. This means that the "as per" and " -- " tics do not influence these findings.

Variations
At User:Noroton's suggestion, I re-ran the vector space algorithm for Samiharris and Mantanmoreland with a set of words excluded: SEC, Forbes, Bloomberg, naked, shorting, Byrne, Weiss, hedge, SHO, piperdown, material, DOB, and RS. The results:

Comments
(please move to discussion page if you think that's the best place)

Deleting the Topic references (financial words in this case) seems like a very good idea. It was also mentioned on the workshop page by User:Avruch. I'd like to see the Mantanmoreland results as well (if possible). This does suggest that the topic words are quite important in determining these results, but that there is more, i.e. Piperdown is still highly "correlated." I was wondering what else might be driving this, so checked out AniMate's edit summaries - it's pretty clear, he uses "verb"-ing quite a bit.

I'm not familiar with this method (yes, I'll try to check it out) but it seems good, if it's not too sensitive to a couple of things like same articles edited, and (maybe) "verb"-ing. I'm afraid I don't see the "timing correlations" as anything but proof that they edit from the same time zone - which I think we already knew. Keep up the good work. Smallbones (talk) 20:03, 21 February 2008 (UTC)


 * "Verbing" will only matter if they are using the same verbs repeatedly in that fashion. Both "revising" and "extending" are "verbing" forms, but they are completely different tokens for this algorithm.


 * As to the timing correlations, it is just one piece of the puzzle. Sort of like the Blind Men and an Elephant story, you mislead yourself if you only look for proof from individual pieces.  All of the evidence needs to be looked at as a complete pattern.  The timing correlation combined with the lack of interleaving edits while working in the same topic areas to me indicates two accounts that were actively working to keep their activities from looking at first glance coordinated.  The punctuation similarity is independent of the method described on this page, because this algorithm treats all punctuation as whitespace.  So we have multiple strong threads of evidence that in my mind form a coherent pattern - timing data as a whole, topics chosen and POV, punctuation and word choice.  GRBerry 21:20, 21 February 2008 (UTC)

Okay, I've added the results for Mantanmoreland with the same topic filter as I used for the Samiharris variation above. alanyst /talk/ 05:04, 22 February 2008 (UTC)


 * Remove 13 finance-related terms and Piperdown's similarity recedes, most dramatically in the Manatnmoreland table. This would seem to indicate that subject matter is not the underlying reason for this similarity between these two user accounts but rather because they share a deeper, more personal style independent of the article subjects they edited. Another strand in the rope. Noroton (talk) 06:52, 22 February 2008 (UTC)

Impressive. A few comments:

It might be worth noting that there is a long tradition of statistical analysis for determining text authorship, cf. forensic linguistics and stylometry. (For example, this thesis has a detailed overview.) This includes the application of concepts from information retrieval, like above. I don't say this to diminish your achievement, but to stress that in such a situation, it is an entirely reasonable decision to use a careful application of statistical methods, contrary to some "with statistics you can prove anything" comments in the debate about this case (I don't mean Smallbones, who seems to have taken a more nuanced stance).

For a paper specifically about detecting sock puppets in online communities, whose conclusions could be of value here, see:


 * Jasmine Novak, Prabhakar Raghavan, Andrew Tomkins: Anti-Aliasing on the Web. In: Proceedings of the 13th international conference on World Wide Web (2004) online version

They used a real-life data set to evaluate the accuracy of different similarity measures. (More concretely: They looked at 100 posters on a board of the web forum of CourtTV.com, with at least 100 postings each, and split the 100 accounts artificially into 200, such that each account a had one "artificial sock puppet" a´. The accuracy of a similarity measure is then the probability with with a´ is the account most similar to a in this measure among the 199 others. See Chapter 4. The number of users is smaller than in your data set, but the size of the text corpus for each user should be comparable.)

They compared tf–idf (in its variant without the log) against the Kullback–Leibler divergence and found that the KL divergence yielded a similarity measure with much better accuracy.

They also state that smoothing the distributions (weights) improved the accuracy greatly. Smoothing here means replacing the term frequency vector for one user by an linear combination of it with the overall (whole corpus) frequency vector. They say this is because the unsmoothed distribution over-emphasizes highly infrequent terms, which could be the reason for the Piperdown result (the highly infrequent terms coming from an external issue which affected several users - here: the Overstock battle -, rather than from each person's default preferred vocabulary). In other words: Smoothing might achieve automatically and less arbitrarily what has been done above by selecting and removing these terms by hand.

Quote from their conclusion:
 * In this paper, we have shown that matching aliases to authors [i.e. accounts to real life persons] with accuracy in excess of 90% is practically feasible in online environments.

(And they did not use time stamps and other features which have been analyzed elsewhere in this case.)

Regards, High on a tree (talk) 12:15, 23 February 2008 (UTC)


 * I've had another idea. Would it be possible to extract the Mantanmoreland 2006 contributions and use them to query the data set.  Since Mantanmoreland has more than 2K contributions in 2006, maybe just the last 2K of them?  It would be interesting to see how Mantanmoreland 2006 compares to Mantanmoreland 2007 and to Samiharris.  (I think we all believe than Mantanmoreland 2006 == Mantanmoreland 2007.)  However, I'm not certain if the method allows this.  If it doesn't allow that simple an approach, would it be worth adding Mantanmoreland 2006 to the complete data set then running with MM 2006, MM 2007, and SH as the three queries?  GRBerry 22:54, 26 February 2008 (UTC)

Correction
I have detected an error in my original VSM work that affects the results given above. Those results should be considered unreliable.

I have corrected the error and computed new results, which I give below. I will also detail what went wrong.

New results
These new results show, for Mantanmoreland and Samiharris, the top 20 editors in terms of edit summary similarity, for two different datasets: (Note that the second is a superset of the first.)
 * editors with edit counts between 1000 and 2000 in 2007
 * editors with edit counts between 500 and 3500 in 2007

Overview of results:
 * Samiharris is not in Mantanmoreland's top 20 similar editors for either dataset, contrary to the previous results. In fact, Samiharris ranks at about #98 of 3628 in the 1000-2000 dataset, and at #188 of 11377 in the 500-3500 dataset.
 * However, Mantanmoreland is #1 in Samiharris's rankings of similar editors, for both datasets.
 * Note that Piperdown no longer appears in any of these rankings. He ranks at #288 in the 1000-2000 dataset for Mantanmoreland, and #151 for Samiharris in the same dataset.  In the 500-3500 dataset, he ranks at #688 for Mantanmoreland and #351 for Samiharris.

What went wrong
In my original code, I attempted to filter out automated parts of edit summaries, since these do not reflect a writer's word choices. Unfortunately, my filter caught some but not all automated edit summaries. Most importantly, it missed the automatic section headings (enclosed in C-style comments /* like this */).

When I adjusted the filter to include those and re-ran the code, the results were quite different, as can be observed above. I believe these new results to be much more reflective of a true similarity measure.

Analysis of new results

 * With section headings being filtered out of the edit summaries, Piperdown disappears from these lists. This shows that Piperdown's high ranking in the erroneous results was almost wholly due to having edited in the same articles as MM or SH, which caused the edit summaries to have similar terms taken from the section headings.
 * Samiharris also drops in Mantanmoreland's rankings, but Mantanmoreland continues to be at the top of Samiharris's rankings. This suggests that Mantanmoreland uses a higher number of distinctive terms that Samiharris does not, but that there are still distinctive terms that the two accounts do share that relatively few others do.
 * The new results seem to lend some credence to Mantanmoreland's argument that Samiharris is a separate individual who has adopted the same habits of phrasing that Mantanmoreland has used. This is a plausible hypothesis under these new results, but the sockpuppet hypothesis is IMO not debunked by these results.  Other evidence available needs to be considered in deciding between these hypotheses.
 * It may be profitable to examine the terms that correlate best between Samiharris and Mantanmoreland, to see if it's plausible that Samiharris could have picked up terminology from Mantanmoreland. Are the unusual terms that they share used by Mantanmoreland where Samiharris is likely to have seen them and picked them up?  Or, are the timestamps and articles corresponding to Mantanmoreland's use of those terms so distant from Samiharris's editing times and areas of interest that the mimicry hypothesis is implausible?

Mea culpa
I sincerely apologize to the arbitrators, involved parties, and interested observers for this error. I urge anyone who has based their conclusions on the erroneous results to reexamine them. I also welcome any additional scrutiny of my work, as well as inquiries into my methods if there are further doubts as to the reliability of my work. alanyst /talk/ 00:08, 27 February 2008 (UTC)

Comments

 * Alanyst, I hope you don't mind my adding this "Comments" section. The possibility of error in any one set of data is one of the reasons why many of us are trying to rely on a range of different sets of data and different types of data. Also, this data is weaker in terms of showing a link between the two accounts, although it does show similarities.
 * SlimVirgin had asked at Wikback about whether or not one editor might pick up edit-summary styles from another editor. My assumption is that this would diverge over time, with similarities most evident earlier on. At this point, I don't know if it would be worth your time to do this kind of research, but if you're curious (and if it isn't too difficult) you might want to try comparing earlier edits (say, the first half of the year) with later edits (the second half of the year). I have to admit, I'm not sure that a divergence would prove anything. Also, is it possible that accounts with more edits but still within your range (say, 1,900 edits) would be more likely to show up as similar than would counts with a smaller number of edits (say, 1,001)? The accounts with more edits would have more opportunities to use similar words, and with a range of 1,000 to 2,000 the accounts with the most edits could be almost double the size of the accounts with the fewest edits. Or maybe I'm missing something. I'm asking more because I'm curious than because I think any new results would matter much at this point. Anyway, thanks for the effort. I think you've given us all some valuable information. Noroton (talk) 06:23, 28 February 2008 (UTC)