Wikipedia:Bots/Requests for approval/StanfordLinkBot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol delete vote.svg Denied

StanfordLinkBot
Operator:

Time filed: 23:39, Wednesday December 3, 2014 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s):  Python

Source code available:  Github The repository is currently empty. Source code will be added after bot approval.

Function overview: Insert links between Wikipedia pages based on statistical inference on human navigational traces.

Links to relevant discussions (where appropriate): Research:Improving_link_coverage is an ongoing research effort and describes the underlying idea which is used to predict links.

Edit period(s):  One time run to test efficacy of a specific method.

Estimated number of pages affected: 7000

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No):

Function details:

The job of this bot is to insert a link between a source and a target page, given the mention in the source page which should link to the target page. The input is in the form of a tab-separated file. To make this bot version-agnostic, it provides a best-effort service when searching for the mention in the source article. If the mention exists then the link is added (at the first mention), otherwise it is not. It does not support specifying a location of the mention (in terms of number of words preceding it) because that location is subject to change due to edits.

The link prediction algorithm was developed in a research project that is part of a collaboration between Stanford University and the Wikimedia Foundation. The project page can be found here. A paper describing the algorithm and results is under submission to the World Wide Web Conference; if you would like a confidential preprint, please get in touch with Bob West.

Link Prediction Method
We propose a novel approach to identifying missing links on Wikipedia. We build on the fact that the ultimate purpose of Wikipedia links is to aid navigation. Rather than merely suggesting new links that are in tune with the structure of existing links, our method finds missing links that would immediately enhance Wikipedia’s navigability. We leverage a data set of navigation paths collected through a Wikipedia-based human-computation game called The Wiki Game in which users must find a short path from a start to a target article by only clicking links encountered along the way. We harness human navigational traces to identify a set of candidates for missing links and then rank these candidates according to various metrics. We further validate our prediction by recruiting human raters from Amazon Mechanical Turk and setting up a human evaluation task that asks them to guess which links should exist in Wikipedia, based on the Linking Guidelines. Our evaluation (see above for how to obtain a preprint of the paper) shows that the links predicted by our method are of higher quality than alternative methods.

Discussion

 * This request specifies the bot account as the operator. A bot may not operate itself; please update the "Operator" field to indicate the account of the human running this bot. AnomieBOT ⚡ 23:48, 3 December 2014 (UTC)
 * A couple starter questions to get the ball rolling:
 * Do you happen to have a couple of real-life examples of what goes in and what comes out? Like, let's say you ran this bot over a few pages. Can you give an example of some of the changes it'd actually make?
 * Does the bot simply handle article links? Does it parse, understand, and/or exclude category links or other anomalies?
 * Is there a "proposed-changes" or "log-only" mode of any sort? E.g., instead of editing, could it write to a log in its user space saying how it would have edited?
 * Is there any reference of where this has already been used, or are we the first? It's not a make-or-break sort of thing, just something that'd help us if it already has a track record or something. :P
 * I'm not seeing a substantial edit count from the bot operator. Normally, this would suggest a severe lack of experience needed when it comes to running a bot in an active community with its own quirks, but if someone at the Foundation is working with you on this, could you please get them to chime in here? That'd help establish what all's going on and possibly provide some background information for us and other editors reviewing this request.
 * Cheers =) -- slakr \ talk / 03:20, 10 December 2014 (UTC)


 * Hi, I'm Bob West, one of the developers of the bot. Sorry for the delay in answering your questions, Slakr. Here are my replies in the order of your questions. Please let me know if anything needs more clarification.
 * (1) The bot will take as input a text file with two columns: column 1 has the source of the link to be added, column 2 has the target. That is, an article listed in column 1 will be modified by adding a link to its corresponding article in column 2. Our algorithm, which produced the input file for the bot, is a source predictions algorithm, i.e., it is given a target and finds the top sources that should link to it (this makes our algorithm different from most other algorithms, which tend to take a source as input and produce a ranked list of target candidates). We found the 10 best sources for 700 target pages in the English Wikipedia, and it is these 700 × 10 = 7,000 links that we would like to add to Wikipedia through our bot.
 * As an example, here are the top 10 source articles that should link to the target Abortion:
 * Catholic_Church, Miscarriage, Ethics, Infant_mortality, Embryo, George_W._Bush, Women's_rights, Modern_liberalism_in_the_United_States, Obstetrics, Infanticide
 * As another example, here are the top 10 source articles that should link to the target Air_pollution:
 * Los_Angeles, Pollutant, Automobile, Asthma, Environmental_policy_of_the_United_States, Great_Smog, Ozone, Weather, Acid_rain, Environmental_law
 * At the time we ran our algorithm, none of these sources linked to the respective targets. If the link has been added since, we will not modify the source. Also, when we ran the algorithm, all these links could be added without modifying the words of the article, since the source article contained a mention of the target; the only modification that's necessary is to add double square brackets around the mention (which will be easy using a regex). If the target mention cannot be found in the article text anymore, our bot will not modify the source.
 * (2) Yes, the bot focuses on simple article links. It won't add any category links, and crucially, it will never remove any existing links but only add links that can be tied to a pre-existing anchor text.
 * (3) Since the modifications we are planning are so small (simply add "&#91;&#91;" and "&#93;&#93;" around phrases that already exist in the source article), we were planning on making the changes directly in the wikitext.
 * (4) As the bot is being developed specifically for Wikipedia, it hasn't been applied anywhere else yet.
 * (5) I have informed our collaborator at the Foundation about this discussion. Now that I've replied, I'll ping her again and will ask her if she wants to add a few words here.
 * Cervisiarius (talk) 19:53, 22 December 2014 (UTC)


 * Hi My name is Leila and I work in the Foundation as part of the Research-and-Data team. Here is some background and context that I hope help you and other editors to move forward with a decision. Thanks in advance for your time.


 * Bob, his advisor, and I officially started doing research on improving link coverage in early December 2014. The research is being documented here. Our research builds on the ideas Bob and his colleagues developed before we start our collaboration, and that earlier work is under review in the World Wide Web Conference. That earlier research is what StanfordLinkBot will be operating based on. Given that I was not involved in the earlier research, I took the following steps to assess the performance of StanfordLinkBots:
 * I went over the paper where Bob et al. discuss the algorithm behind StanfordLinkBot. My conclusion was that the approach used to identify missing links is unique and neat. For example, unlike other algorithms suggested earlier in the literature that start from an article and find all the articles that should be linked from this article, they find all the articles that should be linked to the current article. This approach along with high precisions reported control for overlinking that is usually a concern in methods that recommend adding links to articles.
 * Bob shared the two-column list he mentions above with me. I randomly chose 100 link suggestions from that list and checked the articles in English Wikipedia to assess whether the suggestions made sense to me. They did, and I was impressed with the performance of the bot in practice in the subset I looked at.
 * Given that what the bot is supposed to do is very simple (adding links to already existing words in articles), and given the expertise of the folks who wrote the code and those who will review it, I have not reviewed the code for StanfordLinkBot. I'd be happy to do so and share my thoughts if needed.

LZia (WMF) (talk) 21:50, 24 December 2014 (UTC)


 * Not a BAG member, just a random guy with a question. Will disambiguation pages and/or pages with a parenthetical after the name, such as Georgia (U.S. state) vs. Georgia (country) vs. Georgia confuse the bot?  Also, I'm assuming that the bot will not add a second link to a page if there is already one on the page? – Philosopher Let us reason together. 02:47, 17 January 2015 (UTC)


 * Hi Philosopher! The basic principle behind the bot is that a source page S will be linked to a target page T if many users who were actively looking to reach T went through S on the way (and if S has no link to T yet; which is the answer to your second question: you're right that no second link will be added if one already exists). The case of disambiguation and parenthesized pages is no different from any other type of ambiguity. In the (contrived) scenario that many users who tried to reach the target Georgia (country) went through, say, Atlanta, then a link from Atlanta to Georgia (country) might be suggested. I'm calling this scenario "contrived", though, because we've never seen such an erroneous suggestion from our algorithm. Cervisiarius (talk) 06:14, 19 January 2015 (UTC)


 * So, to follow up, the bot will be capable of creating piped links? Obviously, no one will be writing "Georgia (U.S. state)" qin the text of an article, so you can't just add brackets around it to create your link.  – Philosopher Let us reason together. 00:36, 20 January 2015 (UTC)
 * Yes. We are explicitly creating piped links. Ashwinpp (talk) 19:29, 21 January 2015 (UTC)
 * Excellent! – Philosopher Let us reason together. 23:22, 21 January 2015 (UTC)


 * (or so) &mdash; There seems to be no major objection; the trial is mainly a sanity check to make sure the bot makes edits sanely and to discover any unexpected issues. Once we have that small base of edits (assuming it looks/works fine and issues aren't encountered), we can likely just approve the full run. -- slakr \ talk / 04:05, 26 January 2015 (UTC)
 * Also, I noticed some edits to the sandbox. While this is fine to test automated editing, what we're really looking for in the trial are changes to the pages themselves (so that we can see the diffs of the changes the bot makes). Otherwise, it's difficult to have a point of reference and track down potential issues. -- slakr  \ talk / 20:29, 2 February 2015 (UTC)
 * Yes, we were making some automated edits to the Sandbox for testing purposes and to ensure we don't break wikimarkup. Appreciate your interest :)
 * We've made trial edits to 50 articles as can be seen in the history. Ashwinpp (talk) 09:22, 8 February 2015 (UTC)


 * Question: User:StanfordLinkBot's infobox calls it "StanfordLinkPredictor"; User:StanfordLinkPredictor has a different bot-req pending at Bots/Requests for approval/StanfordLinkPredictor. Is this here (...Bot) a clone of that (...Predictor) seemingly unattended request? DMacks (talk) 08:55, 8 February 2015 (UTC)
 * Sorry for the confusion. It was initially named StanfordLinkPredictor, however we changed it to StanfordLinkBot to reflect the fact that it is a bot. The infobox has now been corrected Ashwinpp (talk) 09:22, 8 February 2015 (UTC)
 * No worries. Could someone on BAG handle closing out that request and possibly block/redirecting its userpage? DMacks (talk) 20:16, 8 February 2015 (UTC)

Trial complete
Ashwinpp (talk) 21:45, 9 February 2015 (UTC)

BAGAssistanceNeeded Ashwinpp (talk) 01:05, 19 February 2015 (UTC)

I note that the bot account has been used by a human. Stop doing that.

The bot's edit to Charles I of England hasn't been reversed in the last ten-ish days, but myself I find it questionable. Country names are an example of overlinking, and in this particular article there's a link to England in the lede. Anyone confused by or curious about the term will have stumbled across it in the first seconds of reading the article. This is an example of where the manual of style on linking is going to provide some food for thought for the developers. Portuguese India bears similar considerations. I also note you said earlier "If the link has been added since, we will not modify the source".

Why was "European" included in the link text to classical music in Cello?

The edit to Aromatic hydrocarbon linked to carbon dioxide, but not carbon monoxide. Is the bot capable of applying multiple edits simultaneously, and was there supporting data to suggest a link to carbon monoxide? Same thing with Seasonal human migration and cattle; and Full breakfast and a bunch of ingredients.

The bot chose to bypass Harvard Graduate School of Education and instead pipe link Harvard to Harvard University; do you think this was an appropriate edit? Similarly, in editing Raphael, a potential link to High Renaissance was ignored and Renaissance was pipelinked to The Renaissance. What was the processing chain that lead there? The term Computer scientist wasn't linked to the article Computer scientist, but instead pipelinked to Computer science; is this behaviour by design; is it optimal? In Mickey Mouse, an article with a couple of dozen mentions of the word animation, the term traditional animation was split to link to Animation. Why as this particular instance of animation chosen to be the one receiving the link? Similar issues occur in Parmigiano-Reggiano and Rock music.

In the Optics edit the bot doesn't link the first instance of Newton, but something like the second. How does the bot choose which instance of a term to link? Similar behaviour is observed in Dissection and Portuguese India.

The soybean edit links to Dairy, which is a dairy products factory, rather than Dairy product which is the meaning of word used in the sentence. What can be done to prevent subtly incorrect links like this? Or in sugar, where Protein was linked, whereas a more contextually precise link ought to go to Protein (nutrient)?

Did you check any of the edits the bot made, or do a dry run beforehand? Josh Parris 05:24, 20 February 2015 (UTC)


 * D Josh Parris 04:08, 26 February 2015 (UTC)


 * I echo Josh Parris' concerns: I'm seeing a lot of link choices that seem somewhat inexplicable. I would like to see explanations for the bot's linking choices thus far. The links added at Raphael and Internet are especially puzzling, as the precise terms used in the prose do exist as articles, yet alternate articles were linked instead.
 * I'm particularly concerned to see links being added to a Featured article. The link added to Charles I of England is of no added value as the word is already linked in the first sentence of the article; indeed, the names of major countries are often intentionally not wikilinked as they're extremely common terms. I feel strongly that articles in Category:Featured articles should be explicitly excluded from this bot's working list at this time. It's unlikely that the bot—especially at this early stage of its development—is going to make better linking choices than those made by the human editors who wrote and vetted Featured articles, which are specifically reviewed for compliance with the manual of style guidelines on (among other things) linking without overlinking. Maralia (talk) 00:57, 8 March 2015 (UTC)

So, on the basis of a what could charitably be called a patchy trial run, an inexperienced and unresponsive op, and inexperienced supporter and and inexperienced WMF protagonist, I'm going to deny this. This trial has demonstrated why even with Big Data technologies, without strong AI behind editing you end up with shoddy edits. I suggest to take this project forward you publish a list of suggestions and get experienced human editors to make the associated edits. If you want to come back with another attempt you're welcome, but please do a dry run beforehand to ensure that the linking is of a higher quality. A thousand or so edits under your belt would also help demonstrate some wiki-experience. Josh Parris 13:17, 9 March 2015 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.