User:Nickj/Link Suggester

Note: LinkBot has been superseded by the Can We Link It link suggesting web tool.

"Can We Link It" has the following benefits above LinkBot, or feature parity with it:
 * Uses almost identical internal logic as LinkBot.
 * It also syntax checks articles.
 * It works on the live version of articles (i.e. you can add a paragraph, and then see suggestions for the article including what you just added, whereas with LinkBot you might have had to wait months, because LinkBot used database dumps).
 * It doesn't pollute talk pages with lists of suggestions.
 * To some degree it learns from its mistakes.
 * Suggestions are only supplied on-demand.
 * It will make the links for you (you just say "yes" to a suggestion), whereas with LinkBot you had to manually make the links.
 * If you reject a suggestion, it won't make that suggestion again for that article, whereas LinkBot would (because it had no way of definitively knowing that a suggestion had been rejected).
 * As with LinkBot, the decision to make or not make a link is made by humans. The software only makes suggestions, and you decide whether you like the suggestions.

However there are some downsides as compared to LinkBot:
 * People mightn't know there are suggestions for an article unless they go looking for them, whereas with LinkBot the suggestions "came to you" to some extent. Also as part of this, "Can We Link It" is hosted on an external site (it's an external tool, not integrated into the Wikipedia).
 * It doesn't suggest reverse links (to deorphan an article) - i.e. LinkBot would suggest links to an article plus links from that article, whereas Can We Link It only suggests links from an article. In other words, LinkBot would actually suggest every link twice (once in the source article, once in the destination article). However, every link is also a reverse link - so if all the suggestions are processed, it'll have the same effect.
 * Doesn't know about disambiguation pages. LinkBot would not suggest links to disambiguation pages, whereas Can We Link It will. This is to a very small extent mitigated by the fact that Can We Link It learns from its mistakes, so saying "no" to disambiguation link suggestion will stop that link from being suggested in future (but not suggesting the link in the first place would be better).

What is the Link Suggester, and what is LinkBot?
The Link Suggester is a bit of software that takes the article text of a Wikipedia article, and looks for links that could be made to other articles, and that have not been made yet. LinkBot is the bit of software that takes these suggestions and presents them in way that people can easily read them on the Wikipedia. Both bits of software are written and operated by Nickj.

Can you give an example?
Yes, using a real-world example. Consider the following snippet (wiki codes are shown):

Newtown lies partly in the electorate of Grayndler, currently represented by Anthony Albanese of the ALP.

The output of the link suggester might look like this:
 * Can link electorate: Newtown lies partly in the electorate of Grayndler, currently represented by Anthony Albanese of the ALP.
 * Can link Anthony Albanese: Newtown lies partly in the electorate of Grayndler, currently represented by Anthony Albanese of the ALP.

What happens with this suggested link information?
The plan is for it to be added to the article's talk page.

So the article itself would not be modified?
No. The best judges of what are good links are humans, not software. Software does not understand context or meaning, whereas a human does. Therefore, the link suggestions would be added to the article's talk page, and then a human editor can add the links to the article, or disregard or delete them if they are not appropriate.

So if humans are better at determining appropriate links, why suggest links?
4 reasons: The best approach is likely to be a fusion that combines the strengths of humans, and the strengths of software; A bit of software can perform the tedious and repetitive process of finding missing links; and a human is best able to take that information and apply it appropriately to the article.
 * 1) Humans miss things. In the previous example, "Anthony Albanese" was a good link, but it was simply missed.
 * 2) Checking for links is tedious. The only way to know if the Wikipedia contains articles that are relevant is to search for them. Most people don't bother other than for a subset of the possible links.
 * 3) Links are content.
 * 4) The Wikipedia is expanding, and while there mightn't have been an appropriate link when the article was written, there may be one now.

Has anyone discussed this idea before?
Yes - look here for a discussion of auto-linking.

Note that most of the arguments against assumed that:
 * The changes would be automatically applied to the article, not to the talk page.
 * That the links would be on almost everything, and thus be really annoying. For example, the following assumptions were typically made:
 * Would have a preference for a short links, not long links.
 * Would link on common single words.
 * No differentiation of proper nouns from other text.

An approach is being used that specifically tries to address these problems - please see below for more information on good links versus bad links.

Surely we don't want every possible link? That would be lots of links!
Exactly, and this is where it gets interesting. Making every single possible link is simply not conducive to the flow of an article. For example, consider the following real-world example wiki sentence:

A reorganisation of local government boundaries in 1968 saw part of Newtown placed under Marrickville council.

The link suggester, when showing every possible link would show the following for this sentence:
 * 1) Can link local: A reorganisation of local government boundaries in 1968 saw p...
 * 2) Can link government: A reorganisation of local government boundaries in 1968 saw part of Newt...
 * 3) Can link boundaries: A reorganisation of local government boundaries in 1968 saw part of Newtown placed

This is a pretty exhaustive list, but these are probably bad things to link on. To avoid this, the Link Suggester applies some simple rules-of-thumb to try and improve the signal-to-noise ratio (i.e. keep the 'good links' and eliminate the 'bad links').

So what makes a good link, and what makes a bad link?
A good link is usually either :
 * A complex phrase - e.g. There is no such thing as a free lunch. These are easy to recognise, because they are long (typically 3 or more words).
 * A proper noun - e.g. North Sydney, Anthony Albanese. These are easy to recognise, because they are capitalized, whilst not being at the start of a sentence.
 * An acronym - e.g. NAFTA is a free trade agreement. These are easy to recognise, because they are capitalized in multiple letters.

Then there are things that are sometimes worthwhile linking on:
 * Abstract nouns - refers to ideas or concepts - e.g. fraud can be a good link, whereas government is probably a bad link. Determining which abstract nouns make good links is hard, although we can make some broad generalizations (e.g. abstract noun words that end in "-ism" tend to be good links, such as fascism).
 * Adjectives - e.g. blue-collar is a good link, where as the blue in the phrase "the sky is blue" would usually be a bad link. Determining which adjectives make good links is hard, as the best links are to infrequently used adjectives (some of which are on this list).

Things that are usually bad links:
 * Common nouns - e.g. restaurant, chair, bed.
 * Verbs - e.g. to be, suggesting.
 * As a rough rule, very short links tend to be bad. These are easy to recognise, by rejecting any links less than say 4 or less characters long.

Additionally, a link should only be suggested once per article, as per the Wikipedia style.

Also, an article shouldn't link to itself (even by linking to a page that redirects to the original article).

So what exactly will it link on?
The link suggester will suggest links that meet these criteria: What remains are generally quite safe links to suggest, with a good signal-to-noise ratio.
 * "good links" (as defined above)
 * Plus "-ism" abstract nouns, minus the few really common ones (like criticism).
 * Plus 2 word links, unless these links start with the word "the" (in which case they need to be 3 words or more, or have 2 or more capital letters).
 * Most dates are removed (e.g. "1 March", "13th January", "April 2003", "May 21st", "December 25").
 * Finally any links that I have manually marked as bad to link to are removed. Around 500 links are currently manually marked as bad, but if you find something that has been missed (and there are sure to be things) then please let me know.

What is the current project status?

 * Phase 1 (Complete): Over the course of last week (18th October to 25th October), I manually cut and pasted the output of this script on to the talk pages of around 40 pages. This was a manual process, which allowed me to see what it does well and what it does wrong, and to gather feedback (good and bad) from page authors. This step was very useful.
 * Phase 2 (Complete): Currently running the script on a local copy of the wiki database, and fine-tuning the manual list of bad links; Fine-tuning the algorithm to reduce false-positives, cover corner cases, and increase speed of detection. Also waiting for the Wiki Syntax project to complete its first run, since the two projects are related.
 * Phase 3 (Complete): Run it on a small number pages, as a supervised bot (probably do around 100 on the first night, and then if things go well gradually increase the size of each run). This would consist of the LinkBot running slowly (with somewhere between 10 and 30 seconds between saves), whilst I check that the suggestions added to the talk pages looks reasonable. This first trial run is the step that I previously sought approval for, and which should commence soon (waiting on the next database dump).
 * Phase 4 (Complete): Run a larger trial, which incorporates the feedback from Phase 3 (such as not suggesting links to disambiguation pages, excluding more "bad" pages, and putting the suggestions on a subpage rather than filling up the talk page).
 * Phase 5 (pending): The main problem now is scalability: how to get all the suggestions out to people (there are roughly 1.4 million suggestions all up). Adding these as separate pages would add a huge number of pages to the Wikipedia: would the Wikipedia scale okay? What happens when these pages need to be deleted: isn't that going to take an administator a long time? Uploading these suggestions using a bot will take over a month: is that too long? Also people want a way to mark which suggestions to implement, and which to not: doing this using a bot that reads a human-edited list is very messy and seems awkward - wouldn't it be better to have some kind of web-based form for this? I have a big list of suggested links for the entire English Wikipedia - the problem is getting it out there. I'm wondering whether it really needs a web server with a several gig of disk space, and a decent amount of bandwidth. (Dbroadwell had some ideas along these lines).

Would you like it to be run on the whole wiki in the future?
The Link Suggester has been run on a local copy of the whole English Wikipedia already for testing purposes, but thus far these results have not been uploaded to the Wikipedia. The aim (if the feedback is suitably positive) is eventually for the LinkBot to upload the suggestions for every article in the English Wikipedia.

Are there any spin-off projects that flow out of this?
Yes, there are two:
 * 1) When processing the page, the link suggester needs to look at the page's wikicode (so that it doesn't suggest a link something that is already a link, for example). When doing this it is easy to detect pages with broken wikicode (e.g. unclosed wiki-links, unopened wiki-links, malformed section headers, and so forth). This spin-off project evolved to become the Wiki Syntax Project.
 * 2) The Missing Redirects Project, which suggests new useful redirects and disambiguation pages, from the data that we already have in the Wikipedia.

Will it suggest links multiple times? e.g. Will it try to link the same proper noun two or more times?
No.

Will it suggest links to articles with titles similar to, but not identical to, the text in my article?
No (apart from capitalization differences).

Why did you make this?
I kept finding that there were articles that I hadn't linked to yet, simply because I didn't know that they existed. I decided that there should be an automated way of suggesting links, based on the text of the article.

Can I get the source code?
Source code for LinkBot is now available. It's a bit of a mess, sorry!

Does it ever make bad suggestions?
Yes. Please take all the suggestions with a grain of salt. Bad suggestions are almost always because the same combination of words are used to mean different things to different people. I have endeavoured to eliminate bad suggestions, whilst keeping good suggestions, but getting a perfect automated link suggester is probably impossible without genuine Artifical Intelligence.

Also some suggestions can be tangential - that is they're not inherently wrong suggestions, they're just not appropriate to link to in the context that they're suggested. For this reason the suggestions are provided for your review, so that people can make the links they like and disregard the rest.

Will it ever modify the original article?
No, never. If the original article ever gets modified by the software, then that's a bug.

Is this an official Wikipedia project?
No - at the moment this is a personal project.

How does it work?

 * 1) The most recent available copy of the enwiki database is downloaded and stored.
 * 2) A large index of the names of every article, and the target (if it's a redirect) is built in memory (for speed).
 * 3) The text of every article is retrieved from the local database, and then processed to look for unlinked text that matches the title of an existing article. Those results are then filtered based on the rules of what makes a good or bad link, and the wiki syntax of the article is checked.
 * 4) The results are then saved to the local database.
 * 5) Once the above has finished, the saved suggestions can then be uploaded to the Wikipedia by the LinkBot.

How long does it take to suggest links for the whole of the English Wikipedia?
On the old Pentium-3 800 MHz which I am using for this:
 * Downloading, uncompressing, and storing the latest enwiki database dump takes 7 hours. (This step is I/O limited - the hard disk light never goes off).
 * Building an index of articles in memory takes several minutes. Building a really complete and useful index in memory that knows the target of every redirect takes five minutes and consumes around 200 Mb of RAM. (This step is I/O limited - the hard disk light never goes off).
 * Then to check the syntax of and suggest links for every article in the wikipedia, and save the results, takes 51 hours. (This step looks I/O limited for short articles, and either CPU or memory-bandwidth or algorithm limited for medium to long articles - the hard disk light only goes off briefly when processing medium or long articles).

So total time taken = 7 + 0 (rounding down) + 51 = 58 hours.

Note: These times do not include any uploading of suggestions to the Wikipedia by the LinkBot. This is purely the time taken to generate the suggestions, which the LinkBot can then upload.

Note: These times are proportional to the number of articles in the Wikipedia (and were accurate as of 13-Dec-2004) - so as the Wikipedia continues to grow, the times taken to process the data will increase too.

What programming language is it written in?
In PHP, running on Debian Linux 3.0.

How long roughly might it take to upload the data to the Wikipedia?
Currently the main speed-limiting factor is political, and not technical. This is because there is maximum allowed limit of 6 transactions per minute for bots (although there is some discussion on allowing exceptions to this rule where there is consensus on it).

At the 6 transactions per minute rate, uploading all the suggestions to the Wikipedia would take 24 days, which is a very long time (and runs the high risk of the suggestions becoming out of touch with the article if it has been edited in the intervening period).

The fastest the LinkBot could probably go is around 30 or 40 transactions per minute, at which rate uploading suggestions would take around 5 days (which is far more realistic). So, if after the trial phase goes well, and if it still seems a good idea to run it on the whole Wikipedia, then I will see if there is consensus on raising the transactions per minute limit on the LinkBot.

Are there any examples I can see?
Yes - for current examples here is the edit log for LinkBot.

There are also some much older examples (these are from Phase 1, done with manual cutting and pasting, around late October 2004):
 * Talk:Snowy Mountains Scheme.
 * Talk:William Charles Wentworth.
 * Talk:Peter Reith.
 * Talk:Miss Australia.
 * Talk:Federal Court of Australia.
 * Talk:Ash Wednesday fires.
 * Talk:Agriculture in Australia.

Can I leave feedback, and if so where?
Absolutely - you can leave both positive feedback and negative feedback (if you're not sure which, pick whichever you think is most appropriate). You can also let me know about suggested links that probably should never be suggested.

Why were no links suggested pointing to my article?
If no links were suggested, then the reason was one of the following:
 * The article's title does not fulfil the criteria of a 'good link', as defined previously. These criteria were a 'best guess' at what qualify as good links.
 * The article's title does not occur in the text of any articles. The text must occur verbatim (excluding capitalisation) in order for it to be suggested.
 * The article was manually excluded by me (and as such, this reflects my personal biases, opinions and preferences). There are only a quite small number of manual exclusions (around 500, out of around 600000 articles/redirects, or 0.08%).

What are some of the reasons for manual exclusions?
Manual exclusions (never suggesting a link to a page, even though it qualifies as a "good link") are added because of one or more of the following reasons:
 * The article was a stub or too short to link to. Examples: Private sector, Far East, Spoken language, State-of-the-art.
 * The article was largely a disambiguation page - example: North Shore.
 * The article was about something other than the way those words are used in everyday English (this happened mostly for articles about books, albums, magazines, bands, songs, and movies) - an example was the movie "They Live", or the movie "In and Out", or the book My Life. Another example is Code for, which is about genetics, but the way this phrase is used in articles does not correspond with this. Another is Shift in, which is computer-specific, which not how it is used in everyday English (e.g. "This shift in public perception ..."). Another is 15 minutes, and another is Ten Years Later.
 * The article may have been related to the topic, but not spot-on - an example is Baseball player - this redirects to list of baseball players, but it doesn't define or explain what a baseball player is. Another was State capital. Another was TV shows. Another was Jazz musician. Another was Court case. This mostly seemed to happen with pages that were redirects to lists of things.
 * There were simply too many links suggested to the article (examples: New York, St. Louis, Los Angeles). Living together, poverty line and per capita income were at the top of this category though, because those phrases are used in the RamBot articles (so the suggestions were found many times). Overall though, very few articles were in this category (maybe around 10 all up).
 * Too culturally biased. Example: State legislature - many places in the world have a State legislature, not just the US; example: Task force - phrase has entered common English-speaking lexicon, but article is quite US-military-specific; example: Minister of Education - many places in the world have a Minister of Education, not just the United Kingdom; example: Meteorological Office - many places in the world have a Meteorological Office, not just the UK.
 * Some links I though were just plain silly - example: Some More.
 * I didn't think it was worth linking to in the majority of situations in which it was used - example: twenty-five.
 * Someone else requested it not be linked to on this page.

Why add the suggestions to a talk page, and not as some kind of big list?
The idea of adding suggestions to the talk pages is that: The problems with listing them separately as a big series of list pages are that:
 * 1) They're suggestions - a user may not like some of them - and so the actual article should never be automatically modified. People feel strongly about this, and I agree with them.
 * 2) By adding suggestions to the talk page, those suggestions are automatically associated with the page, and visible to anyone watching the page, all without the problems inherent in modifying the page.
 * 3) The link suggester shows suggested links from a page, but it will also show suggested links to a page (but only if there are outgoing links as well).
 * 1) Point 2 above is lost.
 * 2) It would be a huge number of pages, much more than 100 or 200. With 210931 pages with suggestions, at around 3.8 suggestions on average per page, that's over 800,000 suggestions in total. Experience with this kind of process of making lists of changes (with creating the data for the Wiki Syntax Project) has shown that 140 suggestions per page (includes link to page, word that can be linked, and context of the change) is the perfect size to get the page size just under the 32K suggested maximum page size. 800000 / 140 means there would be 5715 pages.
 * 3) Then people have to process those 5715 pages. Experience with asking people for their help in processing changes listed on pages (as part of the Wiki Syntax Project), with a lot of effort by a lot of people, has shown that the processing rate is around 35 such pages per week. At that rate of progress, with the same consistent and concerted effort (which would probably be very hard to do over such a long-haul project), it would take around 165 weeks to process all of the suggestions, which equals 3.2 years.
 * 4) Point 3 above is lost unless you have twice as many pages (a list to, and a list from), which would be 11430 pages.
 * 5) Over such a long period of time, with the rapid pace of change in the Wikipedia, the data would age very rapidly, and quickly become irrelevant to the content that was actually on the pages at the time a human got around to looking at the suggestion.
 * 6) In other words, the most viable approach appears to be to distribute the problem of processing the links out to the page authors / maintainers by putting those suggestions on the talk pages (and this act of distributing the problem is one of the most fundamental reasons why the Wikipedia is so successful). Of course, page authors can ignore the suggestions (personally, I wouldn't, if I thought that they were good suggestions, and if I cared about the page) - and with the tests, sometimes people ignore the suggestions, and sometimes they use them. Of course if you ignore the suggestions, then you're no worse off than you were before.

Can I submit some text to look for linkify-ing?
I am looking for pages to contribute to. A friend suggested writing my own personal bio, then submitting it to this linkify bot as a way to find pages I might be interested in editing. dfrankow 17:02, 12 October 2005 (UTC)

What is the best way to annotate the suggestions?
If I create some of the links as suggested by the link bot, it is acceptable to delete that part of the link bot text on the talk page, or is that considered as bad as deleting the comments of a real person? I'd rather not have to put a comment after every suggestion saying that it is now taken care of, but if the text just stays, other people will waste time investigating every suggestion again.--ragesoss 15:24, 11 January 2006 (UTC)
 * It's no problem to delete them, and no offense whatsoever will be taken by deleting them. Alternatively, you can always strike the suggestions out to indicate that they have been done. -- All the best, Nickj (t) 04:14, 12 January 2006 (UTC)