User:Visviva/Dumpster dive

Data on user and article stratification over time, gleaned from the 2022-07-01 stub-meta-history dump of EN wiki.

General numbers
As of the 2022-07-01 stub-meta-history dump:
 * Total pages in mainspace: 16666393
 * Total users who have contributed at least one non-deleted revision of a page currently in mainspace: 8783518
 * Total edits associated with a mainspace page: 722794550
 * Total edits to a mainspace page that are associated with a user: 722772635
 * Difference (revisions with attribution removed): 21915 (0.003%)

General annual stats
Mainspace only. Excludes bots. Excludes redirects from page count but not from edit count.

By first-year edits
Cf. m:Research:Editor classes (which uses all-namespace edits). Does not currently exclude bots. Note that 2021 and 2022 will be substantially incomplete (since 12 months from first edit will not be up until December 2022 for some 2021 starters, and not until December 2023 for some 2022 starters).

2021 edits by first-year edits
Excludes editors joining during or after 2021. See results for other calendar years.

By calendar-year edits
Excludes bots. Counts edits to pages currently in mainspace only. IP edits not included in percentages.

Reflections
Comparing the above to my priors, two things are apparent: (1) I did not express my expectations with sufficient precision, but also (2) while the general trend toward stratification of the editor community is apparent, when the data is sliced as above, the trend is neither as pronounced nor as consistent as I expected. Since I phrased my predictions in "prove me wrong" terms, it seems fair to say that the data has proven me at least partially wrong. (But not as wrong as I would have liked to be.)

That said, there is an unmistakable upward trend in the proportion of edits made by users with 1000 or more mainspace edits per year, from 55.4% in 2006 to 60.8% in 2011 to 69.3% in 2021, albeit with various hiccups on the year-to-year level. And if we include IP edits in the denominator (which seems reasonable since anons are Wikipedians too), the trend becomes much more pronounced, and almost consistent with what I expected to see -- although there is one year in this period in which stratification fell by 0.1%:

(Limited to 16 years by template constraints -- 2007 represented a historic low in stratification by this metric, so this may misrepresent the overall historical picture.)

Of course, including IPs in this way is as problematic as excluding them: many IP edits are actually by logged-out registered users (although notably this would suggest that stratification is under-counted), and some individual IP users may actually make more than 1000 edits per year (which might be distributed among multiple IPs or combined with other users of a shared IP). It will be instructive to see if, as per my expectations, the above trend holds when using this metric of (edits by bands 4 and 5) / (edits by all bands and all IPs) against different ways of slicing the data, such as looking at edits across all namespaces.

It will also be interesting to see if the seeming sharp reduction in stratification in the first half of 2022 holds true for the year as a whole. It seems likely to be a seasonal artifact of some kind, or simply an effect of the year being only half complete. But it is also possible that the uptick in stratification in 2020-2021 was related to the pandemic, in which case a drop would not be entirely unexpected.

Article classes
Mainspace only, including IPs, excluding bots and redirects.

Editor age
Excludes bots. Mainspace only.

Reflections on editor age and retention
The above table likely undercounts first-year repulsion, since it goes by calendar years rather than editor years. Yet even by this metric, editor repulsion after the first year has exceeded 90% every year since 2012. Even as recently as 2020, a higher proportion of editors who joined in 2001 were still active on the project than those who joined in 2019.

Of course, editor numbers don't tell us much about edit numbers. Maybe the older editors are just kibitzers who come back to fix a typo once in a while. One simple metric for evaluating this would be the average age of editors weighted by number of edits. That is, for any given edit, how old is the editor making it likely to be?

In the steady-state case, this weighted average editor age would be constant over time. In the case of a small core of editors aging in place with nobody leaving or joining, it would grow at one year per calendar year.

The growth in weighted editor age was comparatively flat in the project's early years (likely due to the project's explosive growth in that period). But since 2006, weighted editor age has grown at a grimly consistent pace of approximately six months per year:

This is not as bad as it could be, but is definitely not the mark of a healthy community. Interestingly, 2021 showed the lowest absolute increase in weighted editor age of any year since 2006, going from just 7.62 to 7.77 years. This seems likely to be a pandemic anomaly; I hereby predict that 2022 will return to trend and will be the first year in Wikipedia history in which the weighted editor age exceeds 8 years.

It is perhaps noteworthy that from 2015 to 2017 the increase was reduced to approximately 4 months per year, before skyrocketing to more than 8 months in 2018. The cause of that shift would be interesting to know -- was it due to improved new-editor retention, or were older editors editing less?

Using the metric of two-year retention (i.e. those editors who first made a mainspace edit in 2001 who also made a mainspace edit in 2003, et seq.), retention appears to have been uniformly terrible throughout the 2010s:

Accordingly, it seems fair to say that the 2015-2017 editor age anomaly was not due to improved retention, but to some other shift in community composition.

Turning back to the overall trend, the simplest explanation for the decline is that the community became much more hostile in the mid-to-late 2000s and has stayed that way ever since, with the "new normal" of hostility becoming self-perpetuating once it was established.

But the trend could also be explained by the community becoming better at filtering out bad or uncommitted users more quickly. In that case, we would expect that the five-year and ten-year retention rates would have improved, or would at least have become closer to the two-year retention rate: we would lose more users up front, but the ones we don't lose would stick around longer. Unfortunately, this expectation is not borne out:

Apart from some anomalies in the earliest years, retention has consistently dropped by roughly half from two years to five years, and again from five years to ten years. It appears that if someone is still around after two calendar years, changes in the Wikipedia environment no longer have much effect on whether they will stick around longer; the odds will be about 50/50 at year 5, and for those still around at year 5, they will be about 50/50 at year 10.

In view of that pattern, it appears that we are not actually doing a better job of finding dedicated Wikipedians. We are simply doing a more effective job of driving them away.

As a member of the anomalous entering class of 2004, I am tempted to take this a little further. Perhaps an editor's formative on-wiki experiences shape the editor's overall attitude toward the project so that editors who have had positive early experiences will not only be less likely to attrit in Y2, but also -- if they are still around in Y2 -- less likely to attrit in Y5 and Y10, even if the wiki has become a less friendly place in the meantime. Given that Wikipedia was a vastly friendlier place in its early years than it is today, this would at least partially account for the early anomalies. It would also suggest that the experience of the entering class of 2012 was uniquely bad even by 2010s standards. Of course, this is merely a hypothesis, but it appears plausible based on the data above.

In voice-exit-loyalty terms, this would mean the quality of a user's experiences during the formative period determines the user's degree of loyalty to the project going forward, i.e. how likely the user is to stick with the project when the going gets rough. Put that way, this result is scarcely surprising. But it also suggests a more damning inference: since Wikipedia is almost completely insensitive to the signal from users choosing to exit, it is precisely those users who had the best experience joining Wikipedia who have had the greatest voice in causing that experience to be so much worse for subsequent Wikipedians. We have, in essence, pulled the ladder up behind us.

This is not what good project stewardship looks like.

Age and edit band
The weighted age metric gives us only a crude idea of who is doing what. What if we break down the edit bands by age?

One thing that stands out here is that the bulk of activity from the oldest editors comes from a small number of power users, while the newest editors are responsible for most of the edits by long-tail users. Put differently, among the oldest users it is the smallest number who make the greatest number of edits, while among the youngest users it is the greatest number who make the greatest number of edits. It also stands out that there is no editor stratum in which either of the middle-aged bands predominates. (But perhaps this just reflects a poor choice of age bands.) What did these patterns look like in 2015?

That's somewhat different, but what's really happening here? Is this just the effect of the especially prolific class of 2006 not having reached year 10 yet?

On first look, at least, that seems to be the case.

How about the First Pandemic Year of 2020?

There are some mildly interesting changes here, but overall I suppose the most striking thing is how even that surge in newcomer activity changed very little in the overall power structure, being largely swamped by increased editing by older editors.

Project space stratification by mainspace band
Prior to processing any data, I expected that edits in project space (which for practical purposes I define as both namespaces 4 and 5, since their purposes are seldom clearly distinguished) would show a greater level of stratification than mainspace, and that this trend would have steadily increased over time.

Reflections on project space stratification
After running these initial numbers, although there is a noticeable upward trend in the percentage of project-space edits made by editors in the mainspace five-digit band, the overall degree of stratification seems to be somewhat less (or at any rate not significantly greater) than in article space. I have once again been proven at least partially wrong. Hooray!

But the biggest takeaway for me is that there's more going on in project space than I would have anticipated (1.7 million edits a year?!). I suspect that a meaningful analysis of trends in project space would have to be more granular -- trends in participation on noticeboards, wikiprojects, policy pages, and the deletion zone may all be quite different.

How about editor age? If we run the same weighted-age metric on edits to namespaces 4 and 5 that we ran above for mainspace, we get a familiar but somewhat less regular trend:

The weighted average age for editors in project space rose by about 8 months a year from 2007 to 2019, jumped by 1.1 years to 9.1 in 2020, and rose even further to a staggering 9.3 in 2021. (While the number itself is alarming, it is perhaps noteworthy that this reflects the same, possibly pandemic-related flattening in 2021 that we saw in mainspace.)

I am not quite sure what to make of the apparent drop in weighted age in 2022. It seems like the weighted age should be overstated at the middle of the year -- but if that is the case in 2022, then when the year is complete, this will be the first year in all of Wikipedia history when the weighted age of participants in project space fell from the previous year (and perhaps by as much as two years).

Again, a more granular analysis of exactly what is going on in various sub-areas of project space would probably be necessary to make sense of these numbers.

Priors
Before processing any data, I anticipate the following results:
 * 1) User stratification, operationalized as the percentage of non-bot edits made by very active editors, will have become considerably more pronounced than in 2011.
 * 2) * This will be true regardless of whether the edit count is limited to mainspace, or whether users are grouped by their highest level of monthly activity, overall mean level of monthly activity since first edit, or in-year level of monthly activity.
 * 3) Indeed, there will be no year since 2006 in which user stratification did not increase.
 * 4) * As above, this will be true regardless of how the data is sliced.
 * 5) Article stratification, operationalized as the percentage of non-bot mainspace edits made to very active articles, will also have consistently increased year-on-year since at least 2006.
 * 6) * As above, this will be true regardless of how the data is sliced (in-year level of monthly activity, overall mean level of monthly activity since creation, or in-year level of monthly activity).

Code
The current rough code can be found here.

Data
CSV datasets generated by the above code, in the form of 22 annual user-page-month CSV files and one CSV file each of user and page definitions, can be found here.