User:Botlaf/Improvements

Ignore redirects
Hi, any chance that redirects could be ignored? Currently there are various redirects that contain typos, for example Small Soldiers doesn't actually contain the typo solider but it comes up on the list because there is a redirect from the typo. Obviously I don't want to add this as an exception because it is entirely possible that in future the article could genuinely have that typo. Thanks.  Ϣere Spiel  Chequers  17:55, 17 March 2010 (UTC)


 * ✅ Sure, makes sense. I've set him to ignore redirects and a quick test seems to work: let me know if any are getting through. Olaf Davis (talk) 23:31, 19 March 2010 (UTC)

Totals
Hi, looking at "Typo Patrol results for "poop" 69 pages containing the phrase were processed in total. 85 were deemed safe using safe phrases and 5 using safe page list. Maximum page number was reached so additional hits may exist." & "Typo Patrol results for "preform" 58 pages containing the phrase were processed in total. 8 were deemed safe using safe phrases and 9 using safe page list. Maximum page number was reached so additional hits may exist. Warning: search phrase is not contained in safe phrase "pre-form"". 69 - 85 is less than 50.... as is 58 - 8 - 9. I suspect the safe phrases figure is the number of matches of which there could be several on a page, and have a suspicion that Safe pages are filtered out before the counter is started - so it is saying "9 safepages were ignored then a further 58 pages were screened with 8 hits against the safe phrases and 50 others reported."  Ϣere Spiel  Chequers  09:44, 18 March 2010 (UTC)


 * ✅ Yes, I'd noticed that discrepancy too but couldn't trace it - I now realise that it's pretty much what you describe. Not bad considering you didn't have the source code and I did! Olaf Davis (talk) 23:24, 19 March 2010 (UTC)


 * Actually it's still the case that the limit (currently 50) is applied to the count of (total - those discarded due to safe phrases) rather than (total) or (total - discarded due to phrases - discarded due to pages) which is the number actually output. Should perhaps change that just for least surprise if nothing else. Olaf Davis (talk) 23:49, 19 March 2010 (UTC)


 * Ok, threshold is now applied to the number of results actually returned. Olaf Davis (talk) 14:17, 21 March 2010 (UTC)

Submission
✅

Why does submission show up in the bollocks search despite containing only the phrase "Never Mind the Bollocks"? Olaf Davis (talk) 23:41, 19 March 2010 (UTC)
 * The only thing I can think of is that the safe phrases might be ignoring stuff in  or  Ϣere  Spiel  Chequers  08:25, 20 March 2010 (UTC)
 * Got it! It had to do with capitalisation. I've fixed it by making safe phrases case-insensitive - that means you can't for example set "Never Mind the Bollocks" as a safe phrase without "Never mind thE BOLLocks" also being safe. I assume that's fine but if not let me know and I'll come up with an alternative. Olaf Davis (talk) 15:56, 22 July 2010 (UTC)

Planing
Why does the search for "planing" stop after only 9 pages? Olaf Davis (talk) 00:00, 20 March 2010 (UTC)
 * I don't know. There are currently 439 matches in mainspace, and though most are ones that the safe phrases should filter out, there was at least 1 that I would have expected to be in the report. Looking at preform and crowed as well is it possible that the fifty maximum is running before the safe phrase filter is applied? If so that would explain both of these.  Ϣere  Spiel  Chequers  08:17, 20 March 2010 (UTC)
 * ✅ Turns out that when it came across a redirect it was ignoring it and quitting the job, instead of taking the next page. Fixed now! Olaf Davis (talk) 17:53, 28 March 2010 (UTC)
 * Excellent. And the other problem was that I left a double space in a safe phrase.... planing, Discuss throw and Solider are all working well.  Ϣere Spiel  Chequers  16:18, 2 April 2010 (UTC)

Phrase in title
✅ Would it be useful to add an extra parameter which tells it whether to include pages which contain the search term in the title? I'm imagining you might want to do that for the typo words but not the vandalism one - the trough search for example produces lots of pages with 'trough' in the title. Olaf Davis (talk) 17:32, 28 March 2010 (UTC)
 * I'm not sure, hopefully for most of these it should be practical to get them into the safe page list once thats working. Both Faggot and trough included lots of safe pages in the 50, would you mind seeing if I mucked up the parameters or something.  Ϣere  Spiel  Chequers  17:58, 28 March 2010 (UTC)


 * Yeah, I suppose it's just a one-time setup cost getting them all onto the list and we can forget about it.
 * Re. the safe pages: whoops, that was me accidentally turning off the safe page checker. Re-running now. Olaf Davis (talk) 18:28, 28 March 2010 (UTC)

Completed requests
It might make sense to rephrase completed requests as "regular requests" once the bot settles down and runs them weekly. Incidently there are several where I've gone through the reports, do you want me to remove the reports when I'm ready for a rerun or shall we just get Botlaf to over-write them weekly?  Ϣere Spiel  Chequers  17:58, 28 March 2010 (UTC)


 * Yes, that makes sense. As for removing them vs. automatic rewriting - either is fine really, depending on what feels easiest to you. Removing as you go might help to keep track of which pages have yet to be processed (especially if multiple people start working on the results) but if you want an auto-overwrite function I can add one. Olaf Davis (talk) 18:18, 28 March 2010 (UTC)

safe phrases
I think Forest of Bowland should have been screened out by the safe phrase trough of Bowland. The only things I can think of are the outstanding bug on "posse's" or that special characters like quotes or square brackets could be throwing the safe phrase screen, or that the safe phrase screen can only handle one occurrence per phrase per article.  Ϣere Spiel  Chequers  22:16, 28 March 2010 (UTC)


 * Thanks. The 'quotes and brackets' theory was a promising one, but: I couldn't really see why 'Trough of Bowland' was bolded in its first occurrence so I unbolded it, and it's still showing up on the search. Clearly some more investigation is called for... Olaf Davis (talk) 23:22, 28 March 2010 (UTC)


 * Thanks, George A. Strout House still shows up in Planing, despite planing Mill being a safe phrase. Could it be thrown by multiple occurrences in one line, or by capitalisation on the second occurrence? Tyrone, Pennsylvania is more extreme - I have no idea why that wasn't filtered out, except poossibly that the safe phrase was capitalised on its second word - I've changed it to planing mill so lets see what happens.  Ϣere Spiel  Chequers  13:38, 2 April 2010 (UTC)
 * Any joy on this one? I've got some searches working by putting lots of extra pages into the safe pages list, but with trough, forth and some other searches we really need to be able to exclude the word or phrase when linked and or piped.  Ϣere Spiel  Chequers  14:16, 23 June 2010 (UTC)

Frequency
I was wondering if we could go live on a weekly basis? I think there are still some bugs to iron out, but several of the queries are running well.  Ϣere Spiel  Chequers  13:28, 16 April 2010 (UTC)

category
Would it be possible to exclude/include by category? I would like to have the ability to exclude articles in the category fictional characters, and User:Collect would like to be able to restrict the query to BLPs.  Ϣere Spiel  Chequers  21:49, 21 June 2010 (UTC)
 * Should be doable in principle, yes. I'll have a fiddle... Olaf Davis (talk) 22:00, 22 June 2010 (UTC)

10?
✅ The cutoff seems to have dropped from 50 to 10. Any chance of taking back to 50, or better raising it to 100?  Ϣere Spiel  Chequers  17:28, 27 June 2010 (UTC)
 * Fixed, and just raised to 100 for the current run. Note to self: the problem was due to the number parameter in site.search, which apparently not defaults to 10. Olaf Davis (talk) 08:32, 21 July 2010 (UTC)

Repeats
Some of the queries repeat examples, I'm assuming this means they check part of the database twice and other parts not at all. user:Botlaf/Results and user:Botlaf/Results both had examples this week  Ϣere Spiel  Chequers  15:49, 13 October 2010 (UTC)

Solider
There are several articles such as Confederate States Army which appear in user:Botlaf/Results every week, but I can't work out why. It is almost as if the ignore redirects logic is run by evey query except that one.  Ϣere Spiel  Chequers  15:49, 13 October 2010 (UTC)