User:TerryE/Supplemental SPI analysis

Introduction
From comments on the SPI discussion page and on individual SPIs, the standard Wikipedia check-user process seems to be based on analysis of two data elements (User IP address and user agent details) mined from the Apache/Wikipedia logs. Unfortunately this process can be defeated by counter-measures adopted by sophisticated sockpuppeteers: simple techniques to frustrate such analytic techniques, such as the use different browsers and different ISPs when logging on to different accounts to prevent IP address and User Agent continuity between them; the creation of deliberate stylist differences for any edit content of the different accounts to defeat linguistic and content analysis / comparison. I became interested in the use of time series analysis as an approach to investigate such sockpuppets and have based my approach on a message analysis technique that I have previously used professionally. The core of this approach is to construct a testable hypothesis for analysis by a two-sample Kolmogorov–Smirnov test.

This article discusses the issues of what to analyse and how to construct to two suitable sample datasets. It frames such a test and includes a Perl script to evaluate it. It then uses a test case of a sample user, U (who has been the subject of an SPI investigation) to explore its effectiveness, and carries out an analysis against four other putative sockpuppets: A, B, C and D. One of these four was proposed as the sockpuppet though the check-user yielded a possible but overall inconclusive result. The other three are independent users who are used purely as controls in the analysis, and have no known association with U'. These tests confirm that D is a definite sockpuppet of U, whilst A, B and C (the controls) are acting independent of U. This result supports the possibility that such an analytic tool might prove a useful addition for certain categories of sophisticated SPI.

Constructing the two samples

 * Wikipedia provides full access to any user's posting history both interactively through the Special:Contributions transaction and through the corresponding Mediawiki API query. This functionality enables a comparative analysis the posting history for any two users to be carried out.


 * Any post to a Wikipedia page can be thought of as a message issued by the posting user at a given date and time. When analysing the posts from two such users, such posts can be collated into ascending time order to create a series (e.g. ... MA MA MB MB MB MB MA MA MB MA MA MB ...).


 * As editors will often make a set of edits in quick succession when working on pages, it is therefore dangerous to make any assumptions about such repeated edit sequences (e.g. MA MA MA). However, editors take editing breaks between such edit sessions, and what we can meaningfully consider is the distribution characteristics of the time intervals between such sequences (that is the time interval between the first post of user B in such a sequence and the immediately preceding post from user A, or visa-versa).  Can we construct a suitable KS test base on this frequency distribution of the A↔B transition intervals?


 * As the posting patterns can vary uniquely for any given individual according to time of day, day of week, home time zone, occupation, other interests, etc., it is unsafe to make any assumptions about any parametric distribution (e.g negative exponential distribution) for such transition times. However, if the two sets of posts are effectively independent, then it is still possible to construct a testable hypothesis based on this assumption.


 * If the two users posting sets are truly independent (but broadly following some weekly pattern), then time shifting one of the sets (say the MA time stamps) by an arbitrary number of weeks (that is a multiple of 168 hours) will not materially change any measured distribution. So the time collated sets for the time series of these A↔B transition intervals, where the offset is 7k days and where k = (-3,-2,-1,0,1,2,3) say, can be used as set of samples on which to base this analysis.


 * These observations can be partitioned into two subsets, with these subsets forming the basis for the 2-sample test:
 * The base set is where k=0.
 * The reference set is where k≠0,


 * The base set is what actually happened. With the hypothesis that the posting patterns of the two users are time-independent (other than this week relative pattern), if this is true. then time shifting should not materially change the measured cumulative frequency distribution (CFD) for the reference set; hence this reference distribution should reflect the same underlying distribution as the base.   This can be constructed as a two-sided KS test.


 * The reason for the ±3 range for the reference set is somewhat arbitrary, but seems a reasonable compromise: on the one hand posting patterns will evolve over time so I want to avoid making k too large; on the other hand a larger k helps to approximate a one-sample test as
 * $$\sqrt{\frac{n n'}{n + n'}}D_{n,n'}\xrightarrow{n'\to\infty}\sqrt{n}$$

I have coded this up in Perl in the section Example code below.

Real example of performance
I have identified five users to use for this example, but have removed identity information for the purposes of this analysis, so let's call them U, A, B, C, D. U has been the subject of a SPI investigation in the past, where D was suggested as a sockpuppet with inconclusive results. A, B and C are just a control set of three other users who happen to post on some of the same articles as U. I wanted use this approach to validate that the hypothesis (that the users are independent) would succeed for A, B and C but potentially fail for D.

The following table gives the results as reported by the script. NB and NR are the number of samples in the base and reference sample datasets respectively. The next column is the 2-sample KS weighting factor given in the KS-test reference. The next column is the maximum absolute frequency difference used in the KS-test and lastly the factored term used to match the 2-sided KS tables.

I have a program which calculates the exact Kα terms for the KS statistic, but I don't really need need to go into this detail with these results, as the first three are so low that we can accept the hypothesis for these; the last is so high that we must reject it with high confidence (e.g. K0.0001 = 1.95). In other words:
 * In the case of the three controls, the tests comparing U to A, B and C work well with the base and reference are extremely close. These controls were not specifically chosen other than they had a comparable posting interaction to D.  These control results support the assumptions underlying the tests and that is it is effect in showing that the posting distributions for A, B and C are independent of U.
 * However, we must reject the hypothesis for U↔R. In other words, U's posting patterns are measurably and definitely not independent of Ds.

I have also uploaded a couple of graphical plots generated from the script output, which underline this overall result in more intuitive terms:


 * [[Image:KStest-UvsABCD-CFD.png|thumb|tight|top|CFD of U vs A,B,C,D]] The CFD on the right shows the four pairs of plots of the cumulative frequency distribution(CFD) against elapsed time. Even though there are eight plots on this figure, it is easy to identify the plot-pairs for U vs A, B and C as these almost overlap pairwise and you can see that they are almost identical within sampling noise.  Note that C lives in a time zone near U and they have similar time-of-day patterns; A and B post from different timezones and this gives rise to a bimodal probability distribution that in turn creates a double bump in the CFD. (This underlines why a two-sample analysis is necessary.)  Even so the base and reference CFDs are still extremely close.  Hovever, the base and reference plots for U vs D are clearly very different in nature and are clearly separated.


 * [[Image:KStest-UvsABCD-KSS.png|thumb|tight|top|KSS plot for U vs A,B,C,D]] The next plot shows the absolute difference in base vs reference CSD (normalised for sample sizes) again for the same four user comparisons. The KS-test is known to have poorer convergence in areas of low probability density (e.g. the tails in the case of the normal distributions). In the cases of A and B, the dips cause by time zone offsets cause corresponding rises in the delta function.  The sample sizes used in these tests mean that the asymptotic approximation to the Kα is valid hence I've already normalised these plots by the √[(NA+NB)/NANB] factor.  The "marginal to confident rejection zone" perhaps runs from (α=0.1) at 1.20 to (α=0.01) at 1.63.  The U vs D figure is off the end of scales!  Again this shows just how the plot for U vs D is fundamentally different than the three controls.

What this graphical analysis does is to highlight the abnormal interaction of U's and D's posting patterns. This sort of separation (with a minimum of 52 mins between posts in the base case and roughly a 90 minute offset) just will not happen by chance. There has to be some causal interaction. The reference CFD shows that we should expect ~23% of intervals between the posts of U and R should occur within 52 mins. In reality, this never happens even once in 268 such transitions.

The reference and base CFDs for the three controls are approximately negative exponential and in all six roughly 40% of intervals are less than 1-hour. The reference distribution for U vs D has 23% of intervals less than 1-hour. If they were independent then one would expect roughly 60 intervals to be less than 1-hour, whilst in fact there are only 5. The chances of this happening by chance are infinitesimal (doing the math, pN≤5 ≈ 4×10-22). Looking at a time-of-day analysis (which I have not included here), the most plausible explanation to me is that U and D are the same person, one who is using a geographic and time separation e.g. using a work PC and internet access to post as U and a home PC to post as D, and in this way frustrating detection by check-user.

The analysis script
This script uses the standard libraries Time::Piece and MediaWiki::API (Version 0.30 min to work with Wikipedia). It is invoked using at the command line, e.g.

perl SPIdistAnal.pl en.wikipedia.org/w $WIKIUSER $WIKIPWD U A /tmp/A

This will output the result summary to stdout, and the tab-separated CFD and KSS files for loading into Excel/Calc. I used my aggregation script to combine the output files for A-D and then imported them into Openoffice.org calc to produce the referenced graphs. Note that running this script does not require Bot approval since it is (i) read-only; (ii) manually initiated; (iii) throttled to minimise load on WP; and (iv) based on a standard API implementation. Even so, I developed and tested this on a local Wikimedia instance, and only used it on a one-shot against en.wikipedia.org to generate these test results.

I also have scripts (not included) for aggregating multiple TSV files for easy X-Y plotting in Calc/Excel and for generating Time-of-Day histograms.