User talk:WereSpielChequers/Archive 12

This is where I archive threads that are mainly about Poop patrol.

Happy coincidence
Hi there WSC. Yesterday I was thinking to myself that I'd like to have a go at writing a bot, but since I've not done so before (and I'm a little rusty at python, the language I'd probably use), I wanted a nice simple task that wouldn't throw me too many surprises. I had a bit of a look through the bot requests page but didn't see anything too suitable.Then, lo and behold, just an hour later I saw the poop patrol link on your guestbook. This seems like exactly the simple task I wanted: searching for known text isn't hard, and since it only has to write to a sandbox the potential for disaster is low. So I think I'll give it a go (assuming you're still looking for such a tool?).

It might take me a while to actually start coding as I'm quite busy this couple of weeks and there'll be a fair amount of material to familiarise myself with - I wouldn't be offended if you found a quicker solution in the meantime. Let me know what you think.

Cheers, Olaf Davis (talk) 23:12, 6 January 2010 (UTC)
 * Thanks, yes I would be delighted to have someone write such a routine for me. I suspect that poop itself may be rarer in our articles now that the various bots screen it out, but I went through Posses again recently and the same principle applies. If I had a sandbox with a list of articles where a particular word is used legitimately, and I knew that simply copying an article where the word was legit would mean that article didn't turn up next time I looked for articles containing that word it would make patrolling them far more practical. Pubic is probably a more pressing priority than poop as the one time I tried to patrol every article containing that word I found an embarrassing number of "pubic transport" type mistakes - in fact I've just gone through a fifth of the 710 articles that contained Pubic and found two worth changing, including a pubic school. So it is worth checking for that sort of an error but frankly without a bot to exclude 700 articles that use the word correctly it is way too time consuming to keep doing manually. I also have an idea I floated at Wikipedia talk:Autoreviewer - though I was hoping to get a bit of feedback on it before requesting a bot be written.  Ϣere  Spiel  Chequers  23:55, 6 January 2010 (UTC)
 * I've done some more thinking about this, and I reckon I've taught myself most of the necessary wikipedia-related python. I've started noting down a few thoughts here - let me know if I've missed anything. Olaf Davis (talk) 18:00, 8 January 2010 (UTC)

Progress report
Hi Chequers. Hope the meetup went well this weekend. I'm still looking forward to the next one, despite having realised it's on Valentine's Day. Who needs a date when you've got Wikipedia, eh?

After initially making fast progress I've put the bot project on hold for a bit following some developments in my real-life work. Still, it has an account now and is close to fully functional: its first two contributions were done manually, but the third and fourth were entirely automated based on the content of User:Botlaf/Job requests. Hopefully its output looks reasonable? (I modified it to add bullet points before running the second job request.) The only input I gave it besides that was telling the script to run on my machine: I need to set it to run continuously on a server and check the request page every, say, couple of minutes. Apart from that I think I just need to address the near-infinite task of adding error-handling, and have it remove processed requests. I'll let you know when I get time to make more progress, but feel free to chip in with comments.

On a more import note: you won't have to retroactively withdraw your !vote at my RfA, since I've taken the appropriate steps. Olaf Davis (talk) 23:22, 19 January 2010 (UTC)
 * I've tabled a few requests, no need for it to run continually - I figure weekly would be quite a challenge to keep up with. But would it be possible to have a version run in user space?  Ϣere Spiel  Chequers  21:20, 6 March 2010 (UTC)
 * Once a week sounds good. A userspace version is emminently doable as well: I can add another parameter to the job request template, something like space=user/main/both. I'm really kicking myself about not making a backup of the source code before my machine broke - but cautiously optimistic about being able to get hold of the necessary component tomorrow. Once I do I'll set him to work on your real-life jobs, so watch this space! Olaf Davis (talk) 16:24, 7 March 2010 (UTC)
 * Great. Would it be ok to split the results into two or more results pages? It may all be the same to the bot but some of these are usually typos, some vandalism and once we start cruising userspace for badwords we are liable to find a lot of attack pages. I also need to trackdown a punk music fan who can tell which uses of the c word in mainspace are legit.  Ϣere Spiel  Chequers  16:33, 7 March 2010 (UTC)
 * Yep, no problem. When things are back up and running I'll make a directory of subpages on the main user page so we don't lose anything, but given that the results should go wherever's most useful. Can't say I'd know who to ask about punk, but I'm sure we can find someone to make our day... Olaf Davis (talk) 16:56, 7 March 2010 (UTC)

Botlaf rides!
He's finally online again, and has been processing the tasks you set up. Everything seems to be going pretty smoothly as far as I can see. A few comments: I've made User:Botlaf/Improvements, so if you have any suggestions feel free to list them. Olaf Davis (talk) 21:41, 16 March 2010 (UTC)
 * I've set 50 pages per job as a current maximum, partly so as not to overwhelm the user who's processing them and partly because the bot takes about a second per page to process them, so doing 500 pages on each of ten search terms would take a while. I can change this limit or maybe add an option for the user to control it.
 * He currently ignores capitalisation, but does not ignore punctuation: so "Insane Clown Posse's" is not deemed to contain "posses". This can be changed if we want.
 * Safe phrases only discount a page if the search phrase appears only in safe phrases. So, poop deck is flagged because it contains "poop cabin" as well as multiple instances of "poop deck". This is probably desirable at least for vandalism-prone words; typo-prone words may be less necessary. Also, at present the fact that "poop deck" is a safe phrase does not automatically make poop deck a safe page.
 * No support as yet for user-space searches, but it's easily doable.
 * Brilliant, but I was wondering whether safe word process has a limit as it doesn't seem to pick up that sex pistol song in the bollocks query. Is there a limit in the safe word process on bytes or words?  Ϣere Spiel  Chequers  00:44, 17 March 2010 (UTC)
 * You mean because Never Mind the Bollocks, Here's the Sex Pistols appears on the 'danger' list? I think that's because the lede refers to "the coarse slang word (in British English) 'bollocks'", which counts as a seperate hit from the many instances of "never mind the bollocks". Safe phrases can be essentially as long as you like. Olaf Davis (talk) 08:51, 17 March 2010 (UTC)
 * Hmm, although I notice submission gets flagged even though the only incidence of the word is a safe one... I'll have to look into that tonight. Olaf Davis (talk) 08:53, 17 March 2010 (UTC)
 * Thanks, happy for userspace to wait its turn - plenty to do already.. As for the other bits, punctuation would help as I'd like to have "dog's bollocks" on the safe list. But I could just put "dog s bollocks" or "dogs bollocks" depending on whether Botlaf blanks or strips out punctuation. Cheers  Ϣere  Spiel  Chequers  16:50, 17 March 2010 (UTC)
 * Any character except [\^$.|?*+( or ) is treated just like a letter: so putting "dog's bollocks" as a safe phrase is fine, but it will not cause "dogs bollocks" to be viewed as a safe phrase; likewise "bollock's" wouldn't be picked up as 'dangerous' at all. The reason those eleven characters are special is that they mess up the regular expressions. I can fix that without much trouble when I get some free time, but things have been conspiring to keep my evenings and weekends pretty full since I fixed my PC. Priority is also working out why submission is being flagged... Olaf Davis (talk) 13:25, 19 March 2010 (UTC)
 * Thanks Olaf. Whether you've fixed that or not it has already let me fix quite a few problems, so if you have time to rerun even with the existing code I'd be interested to see what happens - I think there are also some requests not yet run.  Ϣere Spiel  Chequers  13:45, 19 March 2010 (UTC)
 * Another run started just now. I'm off to bed, but let me know if the new run works smoothly. I've addressed your two points on the improvements page, but I'm still stumped as to why submission is coming up. Olaf Davis (talk) 23:44, 19 March 2010 (UTC)

They should have sent a poet faster programmer...
Since the login script is still playing up I've given Botlaf a manual mode. He's normally 'fire and forget' but now I'm having to copy and paste results in by hand, so they'll probably come in a trickle over the next day or so. I may also send a few to the wrong results page by mistake, so be warned! Olaf Davis (talk) 22:32, 13 June 2010 (UTC)


 * Actually, that was less painful than I expected: I think it's all done. By the way, would it be helpful to add a feature to automatically collate the stats under the right headers as you've been doing manually on the results pages? I realise I still have some bug fixes to attend to which obviously take higher priority, but if you would like me to do that after I get time for those just say the word. Olaf Davis (talk) 22:52, 13 June 2010 (UTC)

finding problems in BLPs
First step is to find count of BLPs with key words "Alleged", "allegation", "reputed", "rumo(u)red", and "accused" to find non-proven claims would be one step. And particularly "acquitted" as indicating the allegation did not result in a conviction. Mext check for BLPs where the subject is called "communist", "fascist", "neo-nazi", "anarchist", and "extremist" as indicating potential problems where sourcing well ought to be rock solid. Third area is the one where "homophobia" and the like get bandied without rock solid reliable sources. Any more likely indicator words to add? Thanks. Collect (talk) 18:47, 21 June 2010 (UTC)
 * Have a look at user:Botlaf/Abuse. I'll have a check of your suggestions and see which ones would work with the bot as it is.  Ϣere Spiel  Chequers  18:51, 21 June 2010 (UTC)
 * Will the bot allow specifying the category of BLP so we do not get too many false positives? This is not a matter of looking for "forbidden words" but one of finding the BLPs which are most likely to have substantial problems. Collect (talk) 20:06, 21 June 2010 (UTC)
 * I've posted a request on User:Botlaf/Improvements - though my view is that this sort of stuff can be all over the pedia - and some of the nastiest stuff I've come across has been in articles that are not BLPs. BTW you might want to check out the slides I've prepared for Wikimania for poop patrol.  Ϣere Spiel  Chequers  22:00, 21 June 2010 (UTC)
 * I've had a quick look at some of them and I don't think alleged or reputed would work on the current version of Botlaf. However "punched him" should. There were only 236 occurences in mainspace, and I've just removed several as uncited.  Ϣere Spiel  Chequers  15:21, 22 June 2010 (UTC)

Rumours of my retirement may not be that exaggerated after all
Hi there WereSpiel. Thanks for fielding that user's question on my talk page. Although I thought this break would be relatively short when I embarked on it, as you've no doubt noticed it's become fairly long. It may even be permanent: I can't say for sure now. You know how it goes, I'm sure.

I'm afraid my work on Botlaf's code has been downgraded from 'sporadic' to 'non-existent'. I'll still keep up weekly runs of his current version if you like, and if you do find someone else to work on it I'd happily help by explaining the code if necessary. I'm sorry to abandon you on this.

I've very much enjoyed working with you here and there on the 'pedia, and meeting you in meatspace. If I do end up as an unfortunate statistic, best of luck for the future and always remember to keep things in perspective. Cheers, Olaf Davis (talk) 23:40, 23 October 2010 (UTC)
 * Hi Olaf, nice to hear from you, hope it is something good in real life that is dragging you away rather than something deterring you from here. Weekly runs of Botlaf would be great if you be happy to do so - the existing code works well for many useful searches. I may look into Python myself with a view to migrating Botlaf into a Bot I can run, and maybe even doing those tweaks. Presumably if I set up a bot account I could just run the code from either a Linux machine or a windows netbook?  Ϣere Spiel  Chequers  00:19, 24 October 2010 (UTC)
 * I've filed a request at Bot_requests as its obviously safer to have a Pythonista run it than me. But if you can continue in the meantime it would be appreciated.  Ϣere Spiel  Chequers  16:48, 28 October 2010 (UTC)

Bot
I have looked over the code, at glance (aka haven't tested it yet, and that's usually when issues are found) it looks all good. I should get around to testing it and (if it all works) deployment. -- DQ  (t)   (e)  18:51, 18 February 2011 (UTC)
 * Thanks DeltaQuad, can you tell me what the new bot will be called? As I may have some new queries to add.  Ϣere Spiel  Chequers  19:02, 18 February 2011 (UTC)
 * It will just be added onto User:DeltaQuadBot. I will get to some work after dinner and hopefully all up later tomorrow. -- DQ  (t)   (e)  01:08, 21 February 2011 (UTC)
 * ✅ First bot run complete. Will be set on a weekly cycle soon. Please let me know you have seen this message since I see you can get burried here. :P -- DQ  (t)   (e)  00:23, 1 March 2011 (UTC)
 * Hi Thanks I really appreciate that, I've fixed a few dozen of the articles on that list, but it looks like Olaf didn't update the code after he made some of his fixes at User:Botlaf/Improvements, so some of the searches have a very high proportion of records that should have been filtered out. I think one issue that Olaf had was that his bot was doing a case sensitive search against the safe phrases, another issue which he may not have ever fixed was that the bot was thrown by square brackets and was unable to exclude articles where the safe phrase was in an internal link.  Ϣere Spiel  Chequers  21:08, 2 March 2011 (UTC)

Talkback
DeltaQuad.alt It's the very first section on the page. -- DQ on the road   (ʞlɐʇ)  15:23, 28 September 2012 (UTC)

Spell checking
I don't know if you still like spell checking but if you are here are three that require human input. "a only"(can be either 'only' or 'an only') and "a another", "a other" the same. Unfortunate the Wikipedia search finds many false positives but you can use the search to find ones that require correction. Regards, Sun Creator(talk)22:58, 20 October 2012 (UTC)
 * Thanks, yes I've had a look through those three and they all have the right mix of positives and false positives to go into poop patrol rather than say AWB.  Ϣere Spiel  Chequers  11:03, 22 October 2012 (UTC)