Wikipedia:Wikipedia Signpost/2019-11-29/Special report

Let's say you are interested in how many active editors from France are editing the English-language Wikipedia; or conversely, you'd like to know how many editors from the UK are editing the French-language Wikipedia. All the necessary information needed to calculate these numbers is recorded, at least temporarily, by the Wikimedia Foundation, but unless you worked for the WMF and had access to the Geoeditors Monthly database you could never find those numbers. The WMF did not wish to disclose this data out of concerns that the numbers were precise enough that governments or others could back out material that might lead to the identification of individual editors.

This month a new dataset was made public by the Wikimedia Foundation Geoeditors/Public, or more informally Active Editors by country. It allows the public to see, more or less, how many active editors (5–99 edits in a month) and very active editors (100+ edits) from about 180 individual countries contribute to active Wikipedia versions, each month from January 2019 onward. For example, if you wanted to know how many people editing from the UK made more than 99 edits to the French version of Wikipedia in September, you can look it up in this dataset. The answer is somewhere between 11 and 20.

Because of privacy concerns exact numbers are not given. Data from 30 countries are excluded, e.g. China, Kazakhstan, Russia, Saudi Arabia and Venezuela. Exact data on the number of editors in each category (editors from country x who edited Wikipedia version y) are not given. Rather these numbers are only given in “buckets” of ten: 1–10, 11–20, 21–30, 31–40, etc. Technical information is available here. The data are available here.

But enough for the preliminaries! What questions can the dataset answer that I’ve been dying to know the answer to? The following analysis is only the briefest overview of data from one month, September, quickly done. It’s not in any sense academic research, but hopefully will allow people to understand what type of data the dataset contains and what type of questions it can be used to address.

My main questions – of personal interest – are:
 * What countries contribute most to the English-language Wikipedia (enwiki)? Are they the richer, or the more populous English-speaking countries? Or perhaps those countries where English is widely spoken as a second language?
 * Do these relations differ across different Wikipedia language versions? Answering the above questions for the Spanish-language Wikipedia (eswiki) allows a simple comparison.
 * And finally, how do contributions across countries to different language versions compare. Edits from the US and UK are examined here.

Who edits enwiki?
Table 1 shows the 11 countries with the most active editors and the 11 with the most very active editors to enwiki (14 countries total), plus two other large English-speaking countries, Ireland and South Africa. Numbers marked * are not in the largest 11.

The countries with the most very active editors in enwiki are the US (43%) and the UK (17%), or almost 60% of the total reported editors between them. The two large rich countries predominate. Two rich but less populous countries, Canada and Australia, are also well-represented with almost 12% of the total very active editors between them.

The much smaller but still relatively rich New Zealand and Ireland, with about 1% of the total reported very active editors each, trail among those countries where English is the predominant first language.

The proportion of native English speakers by country is shown at English language. The four countries with the largest native English-speaking populations are also the largest four contributors to enwiki – in the same order: USA, UK, Canada, and Australia.

India, which has the 5th largest group of very active editors (4%) and third largest group of active editors (9%), has a very large population, for whom English is an important medium of instruction but the first language of only a small fraction. The Philippines, with nearly 2% of the reported very active editors, may be affected by similar factors as India. The percentages of reported active editors (5–99 edits) appear to be similar to the percentages for very active editors.

Six rich European Union countries where English is not the mother tongue, Germany, the Netherlands, Italy, Sweden, France and Spain, together account for 8.4% of the reported very active editors. Of the countries in this table, only the rankings of Brazil and perhaps South Africa do not appear to be directly explained by the three factors of mother tongue, population, and wealth.

Who edits eswiki?
Table 2 shows analogous rankings for the Spanish language Wikipedia. While Spain and Argentina combine for slightly over half of the reported very active editors, the very active editors are distributed more evenly over all the reported countries. Only one country without Spanish as its predominant language, the United States, has a fairly large proportion of the very active editors. The same three factors that seem to explain the rankings for enwiki editors, mother tongue, population, and wealth, may very well explain the rankings for eswiki as well.

Nevertheless, wealth – or perhaps dialect – may be playing a stronger role in eswiki than it does in enwiki. The 12 largest countries by native Spanish-speaking population are, in order, Mexico, Colombia, Spain, Argentina, the United States, Venezuela, Peru, Chile, Ecuador, Cuba, Guatemala, and the Dominican Republic. Note that Venezuela and Cuba are excluded by the WMF from the dataset. The population rankings for native English-speaking countries are almost identical to the rankings in Wikipedia contributions of the same countries. But the population rankings for native Spanish-speaking countries are much less similar to their rankings in Wikipedia Spanish-language contributions.

US and UK editors editing on non-English Wikipedias
Table 3 shows how very active editors from the US and the UK edit the non-English Wikipedias. Altogether very active editors from the US edit in 44 different Wikipedia versions. Those from the UK edit in 29 versions. Among those versions with 11–20 very active editors from the US are an interesting mix of the Chinese, Spanish, Farsi (Persian), Japanese, and Russian Wikipedias. The similar data from UK editors only includes the French Wikipedia.

So what else can you do with this dataset?
Time is the main variable of interest that was left out of the above examinations. Right now we could see how edit contributions from different countries change over the nine months from January through September 2019. As time goes by, more months of data will be released, and the effect of time will likely be of greater interest. For example, let's say that there was a new program introduced intended to increase the number of editors from country Y. The full effects of the program might not be seen after 9 months, but after 2 or 3 years hopefully any effects could be seen in the data.

Another area of interest might involve combining this dataset with other datasets. For example, say a program is undertaken to increase the quality – rather than the quantity – of articles about country Z. Using this data in conjunction with data on readership might give a more complete understanding of the effects of the program.