User:Manning Bartlett/Moni3 ANI analysis

OK, this is a page for discussion of Moni3's proposed AN/I analysis.

Background discussions (copied & edited for relevance)
Note: The following collapsed discussions provide all the background to this project. They have been edited (by me) for relevance - please see the linked version for the complete discussions.

=== Excerpted from AN_talk discussion===

 Data collection to assess how effective ANI is at responding to complaints

I apologize right up front here. I will be unable to do this because I simply don't have the time, but I'm hoping this idea will spur on someone else who does.

To be better informed about how effective ANI is at responding to editor complaints, those who are participating in this discussion should be aware of how ANI operates on a daily basis. If I had the time, I would chart the success of each thread at ANI going back at least three months. I would rate each thread on a scale of 1 to 5, much like the DEFCON ratings:


 * 5: Thread is succinct. Admin action taken quickly (including the original poster blocked or warned per WP:BOOMERANG). Admins are polite and understanding to new editors.
 * 4: Thread is short. Original poster told to go elsewhere or not answered. Admin replies incomprehensible to new editors (using multiple/frequent acronyms or Wikipedia jargon--such as "Diffs?", NPA, BEANS, etc., without links/explanations); Thread is archived or collapsed without the original problem sufficiently addressed.
 * 3: Thread is longer than necessary because of admins or other editors arguing. Original poster's concern overcome by decreasing quality of communication by multiple participants in the thread. Admins are dismissive and/or rude to original poster and/or each other.
 * 2: Thread has more than one section; admins arguing with each other; multiple accusations of personal attacks and incivility; original complaint forgotten.
 * 1: Thread is absolute chaos, resulting in one or more of the following: multiple sections; discussion is sidetracked multiple times; an argument involving two or more editors moves from some other page to ANI, each of them accusing each other of the same behavior that brought them to ANI and it escalates; person performing this review cannot discern what the problem is or what solution is being offered; edit warring in thread; wheel warring in thread--or elsewhere because of thread; some unforeseen factor(s) that I cannot list but results in dissatisfaction by original complainant and multiple participants in the thread.

A: "Sidetrack" means any instance someone inserts a comment irrelevant to the original complaint, including attempts at humor, comments about responding admin(s), accusations that one or more responding admins should not be participating because they are involved, or something else that does not address a solution to the stated problem.

For whoever may take this on, and I really hope someone will, my DEFCON rating system here is based on my experiences at ANI. However, if you see fit to tweak or change the rating method, it should follow that any changes still rates each thread's success rate on:
 * 1) How well the original complaint was handled,
 * 2) The register of language in the thread: polite and professional, informal, rude, dismissive, or abusive (which I know is subjective, but I hope you get the idea).
 * 3) The overall efficacy of how editors--and admins in particular--communicated and applied Wikipedia's standards to whatever the original problem was.

You may have to ping my talk page to get me to respond here if you have questions. Please consider taking this on, and again, I apologize for not being able to do this myself. --Moni3 (talk) 22:43, 10 February 2012 (UTC)


 * Moni - a very worthwhile exercise, but obviously not the briefest of tasks. I think I'll get started on it, though - I'll use an Access database here at home. I'll do one archive file, let you see the results and then we can tweak the approach before tackling any more archives. Manning (talk) 00:06, 11 February 2012 (UTC)
 * Wow, props to you Manning I didn't think anyone would go with this. One suggestion I have; could there also be an alternate way of scoring a 5: Thread is lengthy and stays open for a comparatively long time, but evidence is gathered, a consensus reached and a confusing situation becomes clear. Short and quick is not always the way to go and occasionally it's the right thing to let a discussion run, as we have seen in recent days. Kim Dent-Brown   (Talk)  00:12, 11 February 2012 (UTC)
 * Hey Kim - this is what I do for a living, so it's not entirely unfamiliar. I'm actually not going to try to apply the Defcon formula yet. The best approach is to gather the data first, then figure out how best to interpret it. I'll look at each thread, then each comment, and capture who posted it, what time, and mark it against criteria for relevance, tone, etc. While this last part is necessarily going to be subjective, it's better than trying to score an entire thread off the bat. As I said, I'll do one archive and then report back. You guys can also examine my scoring against that archive to see if you agree with how its being done. If either of you have MS Access available, I can send the DB to you once I've built it. Manning (talk) 00:28, 11 February 2012 (UTC)

Excerpted from Moni3's talk page
Work proceeds

I finished the basic framework yesterday. The schema allows for diffs to be mapped to a user, to a specific thread and to an archive. These individual diffs can then be scored, and these scores can be rolled up to the thread level. I've also made room to expand the system to consider other noticeboards apart from AN/I, but I'm not planning to use that functionality in the short term.

Thread data includes who started the thread, how many comments it attracted, the open and close date, the general category (user complaint, sock, vandalism, page prot, etc) and whether it was resolved. User data includes whether they are an admin or an editor. (There are numerous other editor attributes we could include, but this was enough to get started). Diff data includes which user started it, the full text of the diff, the edit summary, and some analytic parameters.

The diff analytics consists of a set of 1-10 scores on various parameters. To get started I chose the following parameters: "Tone", "Relevance", "Sarcasm", "Hostility", "Constructiveness". I also have a flag for "Side Topic", and a measure for "Side topic relevance" - part of an attempt to analyze thread drift.

Note - It is already apparent to me that these parameters need refinement, but whatever, that's why we have prototypes.

I have loaded about 1000 users into the system (based on a scraping of who edited AN/I) and the threads headers from AN/I Archive 729 and 730. I've loaded about 50 diffs as well, but I want to get to 1000 before sending it over to you for review (1000 diffs is only about 3 days worth).It takes about a minute to enter a single diff (this will get faster), so there are quite a few hours of work ahead. I hope to have something to you by the end of the week.

Do you have MS Access available to use? I chose Access because it is great for "quick and dirty" development, as there are (no doubt) numerous fundamental changes yet to be made before we have a system we're all happy with. If you don't have Access I can dump the results out to a Google spreadsheet or similar resource. If the system seems useful we can look into making it more permanent in nature, but that's a long way off. Manning (talk) 03:47, 13 February 2012 (UTC)


 * Thanks for letting me know about this, Manning, and thanks for the work. I'd love to look at what you've done - I might even have something sensible to say - it's not unknown, just rare :-) My home computers all stop at Access97, but I have the later versions on my work machines. It might be a day or so before I can spend much time looking at it, though, so don't hold anything up on my account. Thanks again for the work, and the heads up. Begoon &thinsp; talk  04:52, 13 February 2012 (UTC)


 * I can downgrade to 97 no problem, I'm only using 2000 (I prefer it to the later versions). Since posting the above I'm looking into getting a SQL dump out of the mediawiki database. While extracting one diff at a time is fine for preliminary development, it would be a lot easier if I could do a bulk read to get a few thousand updates loaded. (We'll still need to score each one manually, but at least having the basic data in place will save a massive amount of time). Of course, I've never tried to get my head around the Mediawiki schema before but I'm sure I'll figure it out. I might even take the 13 GB dump of the whole thing just for giggles. Manning (talk) 08:01, 13 February 2012 (UTC)


 * Cool - well if/when you get to a point where you think a second pair of eyes might help you, let me know and we'll work out a way for me to get hold of it. As I say, I'd be happy to help, but if it's all going along swimmingly, then maybe wait till you're ready for "reviews"? Whatever suits you, really, since you're doing the work. I've got a local MediaWiki install on my web server that I just use for a testbed really, but I haven't, like you, studied the schema in depth, just poked around a bit when I needed to modify something. I've never had a problem finding or extracting what I needed though. Thanks. Begoon &thinsp; talk  08:11, 13 February 2012 (UTC)


 * Nah should be good for now. I just discovered this service - Query Service which should give me everything I need in terms of raw data. I can then upload that and we'll have thousands of diffs to look at. The real challenge will be deciding how to go about scoring them, but that will come later. All of you will get to have some fun then. Manning (talk) 08:45, 13 February 2012 (UTC)
 * You could also ask User:CBM for help. He has toolserver access and seems very good with SQL. ASCIIn2Bme (talk) 10:35, 13 February 2012 (UTC)
 * Manning, thanks for the heads up on this and very glad you're still going. I have Access too if I can help at all, and some statistical skills if we come to start analysing data - my preferred tool is SPSS but I think it can take an Access flat database and read it. But maybe this is already something you have covered. In terms of actually gathering the ratings, it's probably completely far-fetched and over the top, but I was thinking about the crowdsourcing solution at Galaxy Zoo. But I think that's probably just silliness.
 * I was looking at the suggested variables of "Tone", "Relevance", "Sarcasm", "Hostility", "Constructiveness". Will you have anchor text to describe each end of the 10-point scale? Also, I suspect there'll be a lot of correlation between some, which will make some redundant - eg I suspect tone, hostility and constructiveness will all correlate. We could get rid of one or two of these, and maybe have a further variable on something like "Use of policy" - ie the degree to which WP policy is invoked, quoted or linked to in the diff? Does the thread data need a bit more detail on the resolution as well? Not just whether it was resolved but how - block, page protected, editor warned, complaint dismissed etc etc...
 * Sorry if that comes off like an academic research supervisor advising a PhD candidate - I AM an academic research supervisor and old habits die hard. Thanks very much for doing this, I think it will be good to have the data to go with all the speculation we've been having. Apart from anything, it'll give us a benchmark against which we can measure any changes we eventually decide on. Thanks Moni3 for your input and the loan of your talk page also!! Kim Dent-Brown   (Talk)  11:05, 13 February 2012 (UTC)

Kim - I'm just using Access for prototyping of a data entry system - once I load all the diffs, we'll need to go through and score them and I can't use SPSS for that. Ultimately the best system would be a web front end over a MySql database, but I'm using Access for now because it is ideal for "quick and dirty". If we started with a web-based system then every time we change our mind about the params we'd have to recode the user interface (which gets really tedious). Once we have the data loaded and scored then SPSS would be ideal for analysing it (or Cognos which I work with).

I'm not real stressed about the params just yet, I think when you actually have diff data to work with your thinking might change anyway (it did for me). Once I get something out to you with diff data loaded you'll be in a much better position to make design choices. Cheers Manning (talk) 11:25, 13 February 2012 (UTC)
 * Heck yes, I didn't mean SPSS for data capture - just for analysis once the data are assembled, cleaned and locked down. Look forward to seeing what develops! Kim Dent-Brown   (Talk)  12:02, 13 February 2012 (UTC)

Manning, et al, when I said I was short on time...that wasn't an exaggeration. This is the earliest I've been able to reply to Manning's original post on my talk page, and the rest of my week(s) are going to go the same. I feel like a dingus suggesting stuff and then having others do it, but it hadn't been suggested and it seems Manning and company know what they're doing--far more than I would.

Feel free to use my talk page to discuss it. --Moni3 (talk) 21:54, 13 February 2012 (UTC)


 * Thanks Moni. No dramas about you being time-poor. For general info, there's now a parallel discussion with CBM going on here (Thanks to Ascii for the tip). Manning (talk) 22:46, 13 February 2012 (UTC)

See Also - this discussion with CBM (a tool server developer).

Prototype development
OK, the first phase of the Access DB prototype is complete. It's got about 500 diffs in it (chiefly related to AN/I Archive 729). Each diff is linked to a parent thread (stored in the thread table) and parent threads are linked to an archive (stored in the archive table). Anyone who wants a copy of this database in Access 97 format please send me an email and I'll return the DB to you in a zip format.

The next step is to decide on the methodology to assess these. I've actually removed all the assessment parameters from the prototype I will send you, as I'd like you guys to look at it without my preconceptions attached.

Things I have learned so far:
 * My original idea of "copying the diff into a text field for later evaluation" is not workable. For example there is no way to copy a deletion or rollback. Hence it is not possible to create an extract file which "contains all the diffs".
 * The process is not quite as simple as "evaluating a diff". Some diffs are mere copyedits (or link/indent/formatting fixes) and don't need to be assessed if the result has no influence on the final version. Some are redactions (toning down) which alter how the final version appears.
 * A great deal of "tweaking" goes on. Some editors (not naming any names) revisit their posts a large number of times.

That's it for now. Am looking forward to your feedback. Manning (talk) 02:24, 15 February 2012 (UTC)


 * Thanks for keeping everyone up to date. I mailed you for a copy. I'll look at it as soon as I can, and see if I have anything sensible to say about it. Cheers.  Begoon &thinsp; talk  02:45, 15 February 2012 (UTC)

CBM's Extract and first database
CBM just came through with a kick-ass extract. I've got 15000 URLs for diffs, plus the edit summary and user details. From here we'd need to build an interface which sucks in the diff and allows us to score them. Anyway, while I've brought the extracted diffs into the prototype, I haven't done anything else with them (other than format the revision number into a URL).

OK, the prototype database is getting sent out now. There is one form there which has example parameters (although they are disabled at the moment). That's the kind of thing that can be built for the prototype. Once we get a better idea of what we actually want, I can build a new form, and once we are happy with that, we can turn it into a web-based tool. Manning (talk) 11:10, 15 February 2012 (UTC)


 * Got the mail - thanks enormously. 5 minutes looking at that and every comment you made above makes sense.


 * One thing that strikes me is that every initial thought I had about how you might exclude certain diffs from processing automatically starts to fall apart because it filters out vandalism too, which we need to see... And anyway, if people fiddling with prose is so prevalent it could be causing edit conflicts, that would be a relevant finding - so I'm back in my thinking to scoring everything, even if it needs to be scored as "prose fiddling". Of course, depending on sequence, people adjusting their own comments can mean other things too...


 * As you say, it's all about the "scoring" methodology. I'll spend some proper playtime in next few days, see what happens. I need to seriously score some diffs in a session to properly say anything more useful than that, right now, other than thanks again for putting this together. Begoon &thinsp; talk  12:05, 15 February 2012 (UTC)

Scoring
I'm no data analyst and I'm more than happy to leave that to those who are. However I'm not entirely clear on how it's intended to score/categorise/whatever the threads and what the end result will be. So far I think we have Moni's initial DEFCON score and Manning's list ("Tone", "Relevance", "Sarcasm", "Hostility" etc). Assuming we're aiming for an end result that tells us some basic stats like the ratio of resolved to unresolved threads, and then gives a more detailed analysis of what made for a successful or unsuccessful resolution, do we need an agreed set of scoring criteria and scores? EyeSerene talk 13:27, 15 February 2012 (UTC)
 * Reposting a bit of what I said in the hatted section above: "I was looking at the suggested variables of "Tone", "Relevance", "Sarcasm", "Hostility", "Constructiveness". Will you have anchor text to describe each end of the 10-point scale? Also, I suspect there'll be a lot of correlation between some, which will make some redundant - eg I suspect tone, hostility and constructiveness will all correlate. We could get rid of one or two of these, and maybe have a further variable on something like "Use of policy" - ie the degree to which WP policy is invoked, quoted or linked to in the diff? Does the thread data need a bit more detail on the resolution as well? Not just whether it was resolved but how - block, page protected, editor warned, complaint dismissed etc etc... "
 * I think the key thing that our selection of variables needs to be driven by a hypothesis. Eg, we might have a hypothesis that incivility in a thread is associated with higher levels of off-topic chat; or that threads closed quickly are more likely to be rated as having a clear, agreed ending. In which case we need to record variables for levels of incivility; levels of off-topic chat; speed of closure; and clarity/consensus ending. What are our hypotheses here? Kim Dent-Brown   (Talk)  16:09, 15 February 2012 (UTC)
 * I agree that the key to selecting the variables will be having some idea of what we think the end product ought to tell us. Regarding the scale, subjectivity is going to be almost impossible to avoid but I think we might minimise it with a smaller range rather than a larger one. For example, a three-point scale (eg good/neutral/poor) is easier to score and more likely to get the same scores from a range of different people than a ten-point one. It inevitably sacrifices some of the nuance, but given that scoring is subjective would the difference between, say, a '2' and a '4' on a ten-point scale be meaningful anyway? EyeSerene talk 16:29, 15 February 2012 (UTC)


 * Kim and ES - send me an email and I'll send you the prototype DB. You'll probably get a much stronger sense of what's needed when you look at the actual data. (ES, if you don't have MS Access I can dump it to another format like Excel). Manning (talk) 21:21, 15 February 2012 (UTC)

Thread and diff table design
The key tables in the DB are the thread and diff tables. The real value of this system will be in the thread analysis obviously. The thread table contains the following: This last column currently has ten choices: Block|Unblock Review / User Conduct Complaint / Content Dispute /Page Prot Request / Current Event response / Sock related / Admin conduct complaint / Vandalism in progress / Legal Threat / Policy Interpretation.
 * The archive the thread belongs to
 * The thread name
 * The thread number (based on the TOC on the archive page)
 * User starting the thread
 * Thread category

There may be other categories that need to be created. Based on the discussion above it is clear that the thread table needs more criteria. Some initial ideas could be:
 * Resolved status
 * Total comments
 * Total "on-topic" comments
 * Total users
 * (other?)

I've got more ideas about the DIFF table, but sadly I have to go to work now. Damn capitalism. Manning (talk) 21:15, 15 February 2012 (UTC)


 * That looks great :) I've got two initial questions:
 * Are we filtering out 'illegitimate' threads (vandalism, wrong board etc) or including them in the analysis? I guess there's a case for removing them, but their handling (prompt removal, correct redirection etc) could form part of the analysis too.
 * I'm slightly wary about personalising the analysis, so do we need to know the user who started the thread? I'm not sure what that would tell us beyond contributing towards a primary key, though possibly knowing whether the respondents are admins/non admins might be useful in analysing the diffs.
 * EyeSerene talk 08:37, 16 February 2012 (UTC)


 * - currently not filtering out anything. I think a thread that really belongs at AIV/WQA is still "resolved" in a sense. Vandalism threads get erased pretty quickly and don't appear as threads in the archives. The diffs involved have been recorded chiefly because it is best to capture everything at first and then prune (much harder to work in the other direction).
 * - Including that was just "data analyst paranoia". As with the diffs, it's much easier to remove things than try to shoehorn them into the data model later on. If we decide we don't need it then it's no hassle to excise it later on.
 * I got your email, sending DB as requested :) Manning (talk) 09:38, 16 February 2012 (UTC)


 * Makes sense, and thanks very much. I'll have a look at it tonight :) EyeSerene talk 10:10, 16 February 2012 (UTC)

Experimenting with a totally different approach
As the purpose of this analysis is to review the effectiveness of threads, I've tried a new approach for collecting raw data (again using the randomly chosen Archive 729). This time I've simply scraped the entire text contents of the Archive page, and then mapped it to users (OK that part's still in progress).

The drawbacks of this approach is that we can't directly link back to the diff involved (as the archives are posted by MizsaBot II), and linking to users is a bit trickier. I've also had to replace all the return characters with (BR) tags and I stripped out unicode. (Also the dates and times all end up being Australian EST, due to my user settings). The advantage is it makes looking at the thread as a whole easier, and there is no uncertainty about which comment goes with which thread. I'll have a prototype version out fairly soon for anyone interested.Manning (talk) 23:42, 15 February 2012 (UTC)

Coding the first few items from ANI729
Hello Manning and fellow proponents of ANI improvement. Here's a table with my analysis of the first eight sections from Administrators' noticeboard/IncidentArchive729. I've tried to assign Moni's DEFCON code numbers. She also gives the meaning of her codes in the green collapsed section near the top of this page, labeled 'please open to read the various discussions..'.

This group of cases is boringly straightforward and most of them seem to be well handled. (These are threads 1-11, neglecting some repeats). They all get code 5, except for one that is code 4 (since no answer was provided). I invite anybody to scan through the remainder of ANI729 to look for cases of Moni's 'bad' codes. If all codes are good, then ANI does not need improving. Perhaps the table should be extended to just include the 'bad' cases.

-- EdJohnston (talk) 22:59, 22 February 2012 (UTC)