Wikipedia:Bots/Requests for approval/Jakebot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol oppose vote.svg Withdrawn by operator.

Jakebot
Operator:

Time filed: 03:00, Sunday December 15, 2013 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): Java

Source code available: Posted at http://pastebin.com/miRVsakD

Function overview: Jakebot searches Special:NewPages for unreferenced articles and adds the Unreferenced tag.

Links to relevant discussions (where appropriate): I think that User:MenoBot II is sufficiently close to the task of this bot that it can simply be approved by BAG.

Edit period(s): At least a few hours per day.

Estimated number of pages affected: Typically a few edits per hour, maybe.

Exclusion compliant (Yes/No): Yes.

Already has a bot flag (Yes/No):

Function details: Jakebot searches Special:NewPages to get a list of the 500 newest articles (using the  method. It then scans the markup of each new page (using the   method). If the article does not contain one of the following (as determined by the   method), it is tagged as unreferenced: Additionally, an article is not tagged if it:
 * &lt;nowiki> A Reflist or &lt;references/> tag and &lt;ref> tags &lt;/nowiki>
 * A references section
 * One or more external links
 * A further reading section
 * Contains "disambiguation"
 * Is tagged for speedy deletion

Discussion
While there was some objection on the Village Pump a while ago, I think I've addressed the concerns. — Preceding unsigned comment added by King jakob c 2 (talk • contribs) 14:04, 15 December 2013‎
 * Have you see Creating a bot? There doesn't seem to be any brake on how rapidly new pages will get tagged - why? Josh Parris 07:43, 15 December 2013 (UTC)
 * Although it shouldn't be editing more than once every few minutes (like I said), I've added in a throttle to it so that it can't make more than 10 edits in 30 seconds. Please bear with me here, I'm not very familiar with bots.--Jakob (Scream about the things I've broken) 12:44, 15 December 2013 (UTC)
 * You'll find your code a lot less fragile if you use one of the frameworks I linked to.
 * Special:NewPages reminds us "Don't bite the newcomers: cleanup tagging within minutes of creation can discourage new users." Approximately how long could an article exist before being tagged by your bot? Josh Parris 08:34, 17 December 2013 (UTC)

I think the code is nowhere near ready to be run on Wikipedia: what if instead of ==References==, an article contains == References == or ==Notes==? What if instead of simple&lt;ref&gt; it has &lt;ref name="foo"&gt;? What if an article title contains an ampersand? What if NewPages HTML changes slightly? The whole screen-scraping is horrible, horrible, horrible. Max Semenik (talk) 14:14, 18 December 2013 (UTC)
 * Pardon if I'm blind, but where does it say he's screen-scraping? Was that in VP? — HELL KNOWZ  ▎TALK 14:40, 18 December 2013 (UTC)
 * Look at the source code. Legoktm (talk) 16:09, 18 December 2013 (UTC)
 * Oh right, I didn't even think of that... Surely api can do this? — HELL KNOWZ  ▎TALK 17:51, 18 December 2013 (UTC)

Right, this really needs to use the API for everything. JWBF and Wiki.java both seem to be decently maintained frameworks. Other comments (some of which would also be addressed by using a framework): -- Mr.Z-man 02:57, 20 December 2013 (UTC)
 * As written, this will get the most recent 500 new pages, which means it could be tagging seconds after creation. There needs to be some sort of delay.
 * It doesn't record which pages it has already checked or edited, which means if someone removes the tag, it could edit war.
 * Editing using  is really hacky and won't work because edit requests need to be sent by POST (you also need an edit token).
 * cleanarticle appears to be the wikimarkup, not the title
 * You don't appear to actually check for  like you say
 * External links can be made without the http prefix. [//wikipedia.org //wikipedia.org] is a valid link. The easiest way to check for links is the API -
 * - bots alone means that all bots are allowed. nobots forbids all bots. And it can get more complicated with allow/deny lists.
 * - Not all dab pages will say this in the wikitext. The best way to do this is to check the list of templates (using the API ) for Template:Dmbox.
 * If the article is in Category:Living people it should use BLP unsourced

you've become very quiet. Josh Parris 03:59, 20 December 2013 (UTC)
 * Hi. I apologize for not responding sooner. I'll be able to examine the concerns in more detail starting tomorrow. Thanks. --Jakob (talk) 12:46, 20 December 2013 (UTC)
 * Progress made:
 * Use the API to get a list of new pages. Will do, but if someone knows how to only include pages in namespace 0 in the query, that would be good.
 * Make the bot check the 50 to 550 newest pages instead of the 1 to 500 newest pages. ✅
 * Improve the isEligibleForTagging method with the suggestions of MaxSem and Mr.Z-man. ✅
 * {{xt|If the article is in Category:Living people it should use BLP unsourced. ✅
 * Curb edit warring by not tagging if the bot has already edited the page. I thought bots were allowed to edit war {{smiley}} Will look into this.
 * {{xt|Editing using cmd /c start  is really hacky and won't work because edit requests need to be sent by POST (you also need an edit token).}}. Will try to fix this, if I can.
 * {{fixed}} the &title=" + cleanarticle+" typo.
 * I'll do more work on this tomorrow. I'll also post an updated version to Pastebin then.--Jakob (talk) 00:49, 21 December 2013 (UTC)

Does the bot look for different kinds of reference templates, such as {{tl|sfn}}? GoingBatty (talk) 14:12, 21 December 2013 (UTC)
 * It now looks for {{tl|sfn}}. Revised code now posted at http://pastebin.com/KE48mJFV. --Jakob (talk) 15:22, 21 December 2013 (UTC)
 * Aren't Java string functions case sensitive, as pretty much most langauges? — HELL KNOWZ  ▎TALK 16:42, 21 December 2013 (UTC)
 * {{ping|Hellknowz}} Yes, but the string is converted to lowercase. --Jakob (talk) 16:46, 21 December 2013 (UTC)
 * I guess it's unlikely in articles and there are no false positives here, but that will also match everything uppercase or variously mixed case regardless if valid syntax like {{tl|NoBOTS}}. — HELL KNOWZ  ▎TALK


 * You don't check if article has {{tl|Unreferenced}} aliases, like {{tl|No references}}, {{tl|Unref}}, etc. Also they can change, so you should ideally be regularly updating the list (automatically, if possible).
 * You don't check if article has {{tl|BLP unsourced}} (or its aliases)
 * What if the page already has {{tl|refimprove}}? or even {{tl|More footnotes}} or {{tl|One source}} and alike?
 * You are using {{tl|BLP Unsourced}} -- that doesn't exist
 * You don't include all the whitespace cases for section title syntax, like "==Notes ==". And what if it's a level 1 heading?
 * What about " " or " "? What about " " or " "?
 * Wouldn't your "appendtext" place the template at the very bottom instead of where maintenance tags are supposed to go?
 * What about " " (with a space) or "  ", in fact, what about the category being transcluded or otherwise not directly on the page? What about it being in a comment and so the page not actually in it (thus BLP unsourced is not applicable)? This really should be category API check.
 * I can think of many other viable alternatives to "notes", "references" and "further reading".
 * Also I see you are still screen-scraping. Here's api link |timestamp&rcnamespace=0&rclimit=500 (mw:API:Recentchanges) — HELL KNOWZ  ▎TALK 17:10, 21 December 2013 (UTC)
 * 1){{fixed}}. 2){{fixed}}. 3)If there are sources (even if there aren't enough for a decent article), the article shouldn't be tagged as unreferenced anyway. 4){{fixed}}. 5){{fixed}}. 6){{fixed}}. 7){{fixed}}. 8)Now uses an API query to search for the category. 9)"Citations" and "Sources" have been accounted for. Can't think of any others off the top of my head (and I doubt there would be many that don't use one of those). 10) I will fix that. --Jakob (talk) 20:25, 21 December 2013 (UTC)
 * 3) Well, yes, but you don't check for them, so it will be tagged by the bot as unreferenced as well. 9) "Bibliography", "Literature", "Footnotes", "Notes", etc. And don't forget all the variations thereof with possibly unrelated words, like "References and citations", "Works cited", etc. Also there are more ways than {{tl|reflist}}, like {{tl|Refbegin}}, {{tl|Refend}}, {{tl|Notelist}}. — HELL KNOWZ  ▎TALK 21:15, 21 December 2013 (UTC)
 * {{ping|Hellknowz}} 3)If there is a source (why else would there be a refimprove tag, etc?), then it isn't unreferenced, which the bot should be able to pick up. 9) {{tl|Refbegin}} and {{tl|Refend}} are accounted for. I already have "notes". The vast majority of articles don't exclusively use sections like "works cited" etc. for the references. --Jakob (talk) 21:55, 21 December 2013 (UTC)
 * They don't use them exclusively, but they can use them. If you don't account for them than that's false positives. What if the article has refimprove and the like, but your bot doesn't find any? What if that template is placed for other reasons, because you are assuming there will be syntax you are looking for, which is a big assumption for newbie articles. — HELL KNOWZ  ▎TALK 22:02, 21 December 2013 (UTC)
 * {{ping|Hellknowz}} I think you misunderstand. In order for there to be a false positive, an article would 1)Have to use entirely offline references 2)List them without using footnotes and 3)Not list them under any one of the five or ten most common reference section header names. This, I think is very unlikely. False positives are going to be rare. They could be reported at, say, User:Jakebot/False positives and looked over by a human. False positives would be inevitable anyway - someone could list references without making a separate section, or list them under a section like == rejhgierh   == . --Jakob (talk) 15:41, 25 December 2013 (UTC)
 * {{ping|Mr.Z-man}} All concerns have been addressed. Latest code at http://pastebin.com/KKDWUvGm. --Jakob (talk) 17:30, 30 December 2013 (UTC)
 * I still see glaring issues, most of them mentioned. Like "referneces" still being misspelled, bots not used, extra whitespace not considered, many potential reference section names not included with no way to expand the list, or any other syntax that could happen? Some of these we can't have as potential errors. Other can prove to be very unlikely. With that I can give you a trial of a large number of pages without actually editing them -- just a report which pages were or weren't skipped. — HELL KNOWZ  ▎TALK 17:56, 30 December 2013 (UTC)
 * In addition to what Hellknowz mentioned:
 * It still relies on screen scraping to get the wikitext
 * Most of the API queries do not specify a format parameter, which means you're going to get data that looks like {{code|1= &lt;rc type=&quot;new&quot; ns=&quot;0&quot; title=&quot;Muhammad Akhtar&quot; timestamp=&quot;2013-12-30T18:40:17Z&quot; /&gt; }} (the HTML representation of the XML) instead of actual XML or JSON.
 * Note that with the actual XML or JSON, you won't be able to just iterate over the lines because there are no line breaks (they're just added to the HTML versions for readability), you should use an actual JSON or XML parser.
 * parseFromNewPages appears to still be based on screen scraping
 * There still appears to be no protection against tagging articles too soon after creation
 * The edit function still won't work because it doesn't use an edit token. And it's still just going to open a browser window, which means it won't do a POST request. See (there may be a simpler way, I'm not familiar enough with HTTP in Java)
 * noblpkeywords will not work, because you put the wikitext in lowercase, but use template names with upper-case Mr.Z-man 19:23, 30 December 2013 (UTC)


 * {{TakeNote}} This bot appears to have edited since this BRFA was filed. Bots may not edit outside their own or their operator's userspace unless approved or approved for trial. AnomieBOT ⚡ 20:56, 31 December 2013 (UTC)
 * {{facepalm}} I was testing something and forgot to log back into my main account when I was done. --Jakob (talk) 20:58, 31 December 2013 (UTC)

{{tl|OperatorAssistanceNeeded}} you've become very quiet. Josh Parris 06:15, 7 January 2014 (UTC)
 * {{ping|Josh Parris}} Sorry, I've been kind of stuck here, see https://en.wikipedia.org/w/index.php?title=User_talk:Hellknowz&oldid=589322174#Jakebot. --Jakob (talk) 12:32, 7 January 2014 (UTC)

{{tl|OperatorAssistanceNeeded|D}} Any updates? Mr.Z-man 18:35, 23 January 2014 (UTC)
 * {{ping|Mr.Z-man}} I was told at VPT a while ago "Why don't you use a library/framework that will handle all the details of the HTTP protocol for you, including processing cookies correctly and such?". Any ideas on the best library/framwork to use? Thanks, --Jakob (talk) 19:15, 23 January 2014 (UTC)
 * This was discussed above. See Creating a bot. JWBF seems more Java-esque (it's big with code divided in a dozen different packages) while Wiki.java looks a little simpler. MER-C is active and could probably give you some guidance with Wiki.java. I'm not sure who maintains JWBF, but both seem to be actively maintained. I've never actually used either one, so I can't offer much help beyond that. Mr.Z-man 16:43, 24 January 2014 (UTC)

{{OperatorAssistanceNeeded|D}} Any progress? — HELL KNOWZ  ▎TALK 13:28, 19 February 2014 (UTC)
 * {{ping|Hellknowz}} I have put together a brief report at User:Jakebot/Test report. --Jakob (talk) 16:20, 19 February 2014 (UTC)
 * Note that I am no longer considering the mere presence of external links as evidence that an article is referenced. --Jakob (talk) 16:21, 19 February 2014 (UTC)

I am doing the same thing on a fairly regular basis manually (through AWB), and fear that this task will be too difficult to get right 99.9% of the time for a bot (every bot is allowed a few mistakes :-) ). There are so many ways to add some reference or source to a page that checking for all of them with a bot seems impossible to me. I haven't checked your code, but do you e.g. exclude
 * Authority Control-VIAF
 * Bare external links
 * References through templates/infoboxes (e.g. some fight sports, some taxoboxes, I think chembox, ...)
 * Website in infobox
 * Attribution templates like DNB, CathEnc, 1911, ...

These are just the ones I think of right now, there are undoubtedly many other pitfalls. I would in general oppose using a bot for this. Fram (talk) 15:37, 20 February 2014 (UTC)
 * {{ping|Fram}} I don't believe the mere presence of external links should be considered a "reference". But I have added in code to check for the AuthorityControl template and most attribution templates. Maybe I can at least get approval for a trial to see if it works? --Jakob (talk) 16:13, 21 February 2014 (UTC)
 * {{ping|User:Jakec}} While I think that external links are sufficient to not tag something as unsourced, I meant to say "bare URLs" in the body of the text, not added as external links. The absence of "ref" tags around those URLs shouldn't disqualify them. Fram (talk) 14:06, 24 February 2014 (UTC)

You posted User:Jakebot/Test_report, but every time I check it, half the items are wrong and most are redirects. Can you make a larger list, preferably pointing to the actual revisions the bot saw whether it decided to tag. — HELL KNOWZ  ▎TALK 17:29, 21 February 2014 (UTC)
 * I apologize that it took this long to respond. I don't have the time or the interest to continue working on the bot. However, it would be great if someone more technically competent than I would pick up where I left off, since this seems like a useful task. Thus, I withdraw this request. --Jakob (talk) (Please comment on my editor review.) 00:56, 17 April 2014 (UTC)
 * Marking as {{BotWithdrawn}}. Max Semenik (talk) 03:08, 22 April 2014 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.