User talk:Mr.Z-man/newusers

This is really interesting research, but it is there any chance of differentiating longterm editing rates between the different reasons why articles get deleted? I'm assuming that we never convert spammers and rarely get authors of non-notable articles to shift to more notable subjects, but it would be great to know.  Ϣere Spiel  Chequers  17:02, 21 March 2011 (UTC)


 * This is fantastic. I took the liberty of putting this into a visual.  Would you mind check out this chart to see if I've captured the information correctly?


 * A few other questions/comments:


 * I wanted to make sure I understood the numbers for the Article --> edit page --> deleted flow. Does this suggest that of Registered Users that created an account in Feb 2010 and subsequently edit a page in the main namespace as their first edit, 3.2% (1,223/38,404) of these first edits got deleted (e.g., reverted)?  I'm trying to see if this data is roughly consistent with the revert ratio data that we've captured elsewhere.  The flip-side of this, of course, is that 97% of the first edits to main namespace pages from registered users are actually kept, which is very interesting.
 * Would it be possible to repeat this analysis for the Februaries of 2004-2009? The most important years would be 2004-2007 -- I'd like to see if the data have changed along with the other trends we're witnessing (see Editor Trends Study for data and discussion of these trends).
 * Regarding the comment "Whether a user is still active was determined based on whether or not they had made any action in recentchanges." What's the timeframe of this? I'm assuming this would be from edit to date of script run?
 * Thanks again for this work. It really helps push forward our understanding of the initial experience for registered editors. Howief (talk) 21:21, 21 March 2011 (UTC)
 * The chart looks right. However, it only looks at actual deletion for both cases, as in the article they edited was subsequently deleted. Determining whether an edit was reverted is much more difficult and error-prone. Doing it with any degree of reliability requires using the full history DB dumps. The timeframe for recentchanges is simply the 30 days prior to when the script was run (22 September 2010). Doing it with older years wouldn't be an exact comparison because it only looks at users currently active. So for 2004 it would only capture users who are still active 7 years later. It would probably be possible to do it (using the revision/archive table and setting timestamp bounds) but I don't have time to do that now. I'll try to put the source of the script up somewhere soon. Mr.Z-man 20:54, 24 March 2011 (UTC)
 * I was wondering, of the 2,375 articles that weren't deleted, is there any way of telling how many were redirected to existing articles and how many are still articles?  Ϣere Spiel  Chequers  21:18, 24 March 2011 (UTC)
 * Thanks -- your explanation is very helpful. As you mention, determining reversion is difficult and we're actually getting started on a project that digs into reversion patterns and how they have changed over time.  And your point on the timeframe is understood -- for this data, we're looking at whether users who registered in Feb 2010 were active in the 30 days prior to 22 Sep 2010.  As an fyi, we're using your analysis as the starting point for developing a user model to help us understand where the trouble spots are with new users (e.g., where new editors are falling off).  I can share this with you if you're interested.  Howief (talk) 17:42, 25 March 2011 (UTC)


 * This is seriously great analysis going on, but I was just wondering... what was your methodology? Did you use the toolserver, the dumps, RecentChanges? It's sort of similar to not only Howie's study but to the stuff I'm working on for the Contribution Taxonomy Project, so I'm super curious which method you used to get the data and how you ran your script(s?). Steven Walling at work 21:35, 25 March 2011 (UTC)


 * Yeah I love this data though don't totally agree with some of the conclusions (I'd say the amount creating pages is still very significant) and I love the visual WereSpielChequers!. It would be really interesting to do this for more recent as well. It would be great if you could post the source. Jalexander--WMF 22:27, 26 March 2011 (UTC)
 * Actually Howie made the visualization, though the data wasn't produced by him... Steven Walling at work 06:03, 27 March 2011 (UTC)
 * AHAH! Sorry Howie!! Jalexander--WMF 06:11, 27 March 2011 (UTC)

Query
In running some number, this report suggests we had about 1774 active editors in the 30 days prior to the report, of which roughly half would be admins. (This contradicts official stats here which says we should have abotu 3500 active editors) Am I reading this wrong, or do we really have this few active editors? If so, current decline trends would put our project about couple years off from collapse.... Based on your average retention number, and looking at the number of editors we are projected to gain the next year, we will gain 160 editors (100+ edits) in the next 12 months, while during the same period we would loose 212 editors, based on the average 12% annual loss we've been having since since 2007, for a net loss of 10% of editors annually, which would put us at zero editors in about 12 years, or half our current numbers in 6 years. I don't know a metric for the minimum number of editors needed to maintain the project, but I would take a stab and say we can't do with much less than 1000. I would love to explore this more. &mdash;Charles Edward (Talk 17:58, 1 April 2011 (UTC)

I think a good metric to judge our minimum requirements would be to count up all the administrative actions - rollbacks, blocks, page protections, page patrols, etc, over a certain period of time to determine the average number conducted daily. Then determine the average number of those actions done by existing editors. Determine how many average editors are needed to conduct that average amount of maintenance actions, and we will have a metric for the minimum number of editors required to keep the project from falling into disrepair. This would allow us to determine an approximate threshold at which the project will begin a gradual collapse. I am not sure how to compile those numbers though. :( &mdash;Charles Edward (Talk 18:30, 1 April 2011 (UTC)
 * Yes, you're reading it wrong. This only counts active editors who created their account in February 2010. That said, it also uses a fairly loose definition of "active" (1 edit in 30 days) while the official stats use a rather strict (and IMO questionably useful) definition of 100 edits per month (You can make 3 edits per day and not be active?). If they're still making an occasional edit several months after creating an account, then for all intents and purposes they're an active editor. Mr.Z-man 22:41, 1 April 2011 (UTC)
 * Phew! This really an excellent study though, cudos! It seems to establish as fact some editors' long held positions about out the source of our retention problems.&mdash;Charles Edward (Talk 13:54, 2 April 2011 (UTC)

Article creations in first ten edits
Since the proposal is for requiring autoconfirmed, it is also of interest to have information on the number of users who have created articles/redirects as part of their first ten edits. Also taking into consideration registration date would be even more encompassing, but I'm not sure of the feasibility. At least for the ten edits, can this be done ? Cenarium (talk) 21:40, 4 April 2011 (UTC)

Update: April 2011
I ran an updated version of the script a few days ago on February 2011 accounts (source code coming soon). It collected a lot more data, which is always nice, but unfortunately means it no longer lends itself to simple representations. Instead of looking at the first edit, it looked at the first 5 edits (I ran it before Cenarium's comment above). It recorded:
 * The user's ID#
 * The user's total edit count
 * The age of the account, in days
 * Whether they created an article
 * Whether they created something in their userspace
 * Whether they created something else
 * If they created an article, was it deleted
 * Did they edit a mainspace page (excluding ones that they created)
 * Did they edit the sandbox
 * Did they edit any other page (excluding ones that they created)
 * If they edited an article, was the edit reverted (based on an estimate looking at the edit summary of the next user)
 * If they edited an article, was their edit deleted
 * Whether they were welcomed
 * Whether they were warned
 * Whether they were given a non-indef block
 * Whether they were still active

Remember that this is looking at 1-2 month retention rate, not 7 or so months as last time. A brief-ish summary of the results is below: If there's any other combinations you'd like to see, let me know. For those with a toolserver account, the data is available on the s1-user server in the u_alexz_p database. If others are interested, I could potentially provide a dump of the table. I'll write a GUI frontend for this eventually. I plan to re-run the script each month to see if there's any change in long-term retention. Mr.Z-man 03:21, 9 April 2011 (UTC)
 * 54677 users made at least one edit, 5608 were still active, for an overall retention rate of 10.26%
 * 9926 users created an article, 7320 (73.75%) had their article deleted.
 * Of users whose article was deleted, 351 (4.8%) were still active.
 * Of users whose article was not deleted, 683 (26.2%) were active.
 * Of users who edited other articles in addition to creating an article (2320), 1460 (62.9%) had their article deleted
 * The overall retention rate for article creators was 10.4%
 * 37195 users edited a mainspace page other than ones that they created
 * 4231 (11.4%) were still active
 * 34875 edited another page but did not create an article
 * 3747 (10.7%) were still active
 * 10149 had their edit reverted
 * 950 (9.4%) were still active
 * Of users whose edits were not reverted, 3281 (12.1%) were still active
 * 7572 created something in their userspace, 1036 (13.7%) were still active
 * 1173 edited the sandbox, 130 (11.1%) were still active
 * 18460 were welcomed, 2166 (11.7%) were still active
 * 14304 were warned, 1243 (8.7%) were still active
 * 139 were blocked, 40 (28.8%) were still active


 * So our highest retention rate (at the 30–60 day time period) is... people who got blocked? WhatamIdoing (talk) 17:22, 9 April 2011 (UTC)
 * I'm hoping that this is a slightly different group we are talking about here, and that the forty active editors who've been blocked are mostly recent blocks rather than editors blocked in February who are still active. Welcoming seems to give a fairly small uplift overall, but I suspect there are dramatic skews according to welcome message. Many of the editors whose articles are deleted will get a welcome, often delivered with a deletion tag. It would be really useful to get some retention artes by welcome template, or at least a breakdown by whether or not they had an article deleted. Also some of the welcome messages are actually warnings such as Welcome vandal were they counted as warnings, welcomes or both?  Ϣere Spiel  Chequers  18:02, 9 April 2011 (UTC)
 * Yes, the "active" check doesn't distinguish if the activity was before or after the block. Since the full results are recorded now instead of the aggregate, it will be possible to see how many of those blocked users are still active next month. Mr.Z-man 19:57, 13 April 2011 (UTC)


 * This is valuable research, Mr.Z-man - thanks! Is it possible to get a breakdown under new user first article creation of how many created unassisted, how many used Articles for Creation and how many used the wizard?  If so, is there data on how many are still active users in each case?  RedactionalOne (talk) 17:42, 13 April 2011 (UTC)
 * Not with the data I have, all it includes is the list at the top of this section. Mr.Z-man 19:57, 13 April 2011 (UTC)


 * There doesn't seem to be an easily discoverable method for creating an unassisted article, so I guess it's reasonable to assume most new users have been using one of the other two methods (unless there's been a recent change). The significance is that the data appears to show that of those whose articles were deleted, only 4.8% are still active, versus 10.4% retention for new users overall.  If the bulk are already using the Article wizard, making this a requirement for the first article - while still a good idea - might not significantly increase retention.  RedactionalOne (talk) 21:58, 13 April 2011 (UTC)

Statistical analysis
Rather than just presenting tables, interesting although they are, it's possible to do some statistical analysis on the data. It's important to do this, because there's a strong relationship between two of the variables which may predict whether new editors stay or go: creating a new article is much more likely to lead to a deletion than making an edit.

One way of analysing the data is via "logistic regression". (The article Logistic regression seems, like a lot of mathematical articles, to be written entirely for mathematicians and so useless to most readers.) In logistic regression on this example (the data in the table in the article) we ask whether we can predict the likelihood of Staying (i.e. not being classified as Gone) from three potential predictor variables: Create (was the first action to create a page?), Deleted (was the page deleted?) and Main (was the first edit to a mainspace page?).

Step 1: I applied logistic regression using all three predictor variables (via this web page). The result showed, as is obvious from the data, that collectively all three predictor variables do predict whether users Stay or not (p < 0.0001). However, the third variable, Main, did not have a statistically significant impact (p = 0.40), i.e. it did not add to the prediction by the other two.

Step 2: So I removed Main, and used only two predictor variables, Create and Deleted. Both were statistically significant predictors of Stay at the 5% level (Create with p = 0.02, Deleted with p < 0.0001). (For the statistically minded, I also checked the interaction and it was non-significant.)

One way of looking at the size of the effect of the two predictor variables is to look at the "odds ratio". The odds ratio for Create is 1.16, i.e. holding constant the effect of Deleted, users whose first action is Create are predicted to be 16% more likely to stay (however the 95% confidence range on this is very wide, from 2% to 32% more likely to stay). The odds ratio for Deleted is 0.22, i.e. holding constant the effect of Create, users whose first edit is deleted are predicted to be 78% less likely to stay (the 95% confidence range on this is 72% to 83%).

So, perhaps not surprisingly, although there is a smallish positive effect of being allowed to create new articles, it's far outweighed by the negative effect of deletion.

But what we need to know is why there was a deletion. If it was because the new editor was trying to advertise or push some POV, then we should not be concerned if they never came back. If it was because of some technical failings in understanding Wikipedia policies, then we should be concerned, and should do something about it. Peter coxhead (talk) 13:55, 12 April 2011 (UTC)

Cleanup of results page
I was reviewing comments I made in the "autoconfirm to create" discussion, and came across the results page for your newusers script. I'm normally hands-off when it comes to userpages, but I took the liberty of formatting the latter section, simply to make the presentation more uniform. There are other, more minor changes as well (e.g. arranging numbers in descending order; punctuation, case... all the things I fuss over). At first, I just wanted to create an anchor to the results matrix, but it devolved into 45 minutes of me playing around with the tables and such. I hope this doesn't breach etiquette, and more importantly, doesn't interfere with further analysis or updates to your dataset. I'll be happy to undo the changes if they aren't to your liking. 19:29, 12 April 2011 (UTC)

mainspace?
What does mainspace mean?211.225.34.163 (talk) 12:22, 24 April 2011 (UTC)
 * It's where we keep the encyclopaedia articles. We have a number of other spaces including for categories, templates and also user pages - this page is a "user talk" page.  Ϣere Spiel  Chequers  19:19, 15 June 2011 (UTC)