User:BlevintronBot/Bot/Experiment

This page describes an experiment to evaluate the effectiveness of the bot at fixing broken references. You might want to first read a description of the Bot's normal operation.

The bot employs two tricks to improve broken links:
 * Editing the article by marking the link with a template; and,
 * Sending solicitation messages to the users who initially added the now-broken links to the article.

This page hopes to answer questions such as: are these tricks sufficient for improving broken references? and will these be annoying to human users of Wikipedia?

The tables in this page are updated automatically by the bot. The bot also records its raw data.

Preliminaries and Terminology
Every time the bot wants to edit an article, it chooses one of these behaviors with equal probability:
 * (-E-S) Do not mark the link, and do not send solicitation messages;
 * (-E+S) Do not mark the link, but send solicitation messages;
 * (+E-S) Mark the link, but do not send solicitation messages; or,
 * (+E+S) Mark the link and send solicitation messages.

The bot makes special efforts to ensure that every article and every user belongs to at most one of these cases. This helps to let us evaluate each case independently.

Note: case (-E-S) serves as the control, since it represents the Wikipedia's innate ability to fix broken links in absence of the bot.

We call the chosen behavior the action, even though case (-E-S) doesn't do anything. Similarly, we will refer to the solicited users, even though they receive no solicitations in cases (-E-S) and (+E-S). We will refer to the timeframe as a period of time after the action. The duration of the timeframe is controlled by the parameter CHECK_FOR_REVERTS_TIMEFRAME.

At the end of the timeframe, we measure changes that have happened to the article, and record statistics. We maintain separate statistics for each of the four cases (listed above).

Update: the bot no longer considers case (-E+S). Now that the bot will automatically fix some links and only report the unfixed ones, it will become very confusing to the user if a page has 100 broken links but only 3 were marked.

Excluded Data Points
Data points are excluded in these situations:
 * 1) If an error occurred while trying to perform the action on wikipedia.  For instance, an edit conflict or network error.
 * 2) Articles where the bot operator has made edits after the action.  I'm trying to remove my own interference from the stats.
 * 3) If the article has been deleted since the action.
 * 4) Bugs.  Sometimes, a bug causes the bot to crash before it records a data point.  So far, I'm aware of two instances of this in which a total of two data points were lost.

Contributing Links
When the bot tries to fix the page, how many total, distinct broken links are present on that article? This counts the links that the bot has identified over days of checking. This excludes links which have been marked with dead link or which already have an archive URL marked on them.

(as of Sun May 13 18:54:42 UTC 2012)

Archived (num places)
When fixing an article, how many places did the bot add an archive URL? Note, this is talking about occurrences of a URL in the article, not distinct URLs. If one URL appears several times in the article, this metric counts it several times.

(as of Sun May 13 18:54:42 UTC 2012)

Marked Dead (num places)
When fixing an article, how many places did the bot mark a URL with dead link? Note, this is talking about occurrences of a URL in the article, not distinct URLs. If one URL appears several times in the article, this metric counts it several times.

(as of Sun May 13 18:54:42 UTC 2012)

Unfixed (num places)
When fixing an article, how many places was the bot unable/unwilling to either mark or archive the URL? This usually occurs because the URL appeared at a weird location in the document, for instance, within a template it didn't understand. Note, this is talking about occurrences of a URL in the article, not distinct URLs. If one URL appears several times in the article, this metric counts it several times.

(as of Sun May 13 18:54:42 UTC 2012)

Distinct Editors who Introduced the now-dead Links
How many distinct editors added the to this article?

(as of Sun May 13 18:54:42 UTC 2012)

Number of Solicitation Messages Sent
How many solicitation messages were sent in this article. Solicitations are sent to editors who introduced the now-dead links if the link cannot be repaired automatically. Also, solicitations are never sent to IP Address users or to bots (accounts with the bot tag on the user page).

(as of Sun May 13 18:54:42 UTC 2012)

Community Response to the Bot's Actions
These measurements are taken ONE WEEK after the bot's actions. That is why the number of samples differs between this section and the previous.

Does the Bot Encourage Improvement?
The goal of this metric is to determine overall improvement to the links. I say that replacing a link or adding an archive link is an improvement. Marking it with Dead link is not an improvement.

Here, we say a broken link has been improved if it has been removed, replaced, or an archive has been specified via archiveurl, Wayback or WebCite. We report here the fraction of distinct links that have been improved.

Specifically, we take the text of the latest version of the document, remove Dead link tags where bot contains BlevintronBot, extract the set of URLs which are neither marked dead nor which have an archive URL. We count the fraction of dead links (that were detected during our initial scraping and link checks) that are no longer present in the latest document. This metric counts distinct links.

Limitations of this metric:
 * sometimes a broken link is better than no link, since it may help users find new citations;
 * blanking a page gives a perfect score for this metric.

(as of Sun May 13 18:54:42 UTC 2012)

Does the bot encourage Participation?
In this section, we try to measure the effect of the bot on the levels of participation.

Total Article Participation
How many edits are made to the article by humans (non-bot users) during the timeframe?

Limitations of this metric:
 * does not account for other causes of participation;
 * participation of several distinct users is not independent, since they are editing a shared article (e.g. maybe the first user fixed the problem before the second loaded it);
 * more participation is not necessarily better (the hypothetical perfect article would never need to be edited);
 * participation is probably bi-modal (some articles have many edits, some have few), and the average oversimplifies this distribution.

(as of Sun May 13 18:54:42 UTC 2012)

Average Participation among Solicited Users
How many edits does each solicited author contribute to the article during the timeframe?

Limitations of this metric:
 * only meaningful for comparing cases (-E+S) and (+E+S);
 * does not account for other causes of participation;
 * participation of several distinct users is not independent, since they are editing a shared article (e.g. maybe the first user fixed the problem before the second loaded it);
 * more participation is not necessarily better (the hypothetical perfect article would never need to be edited).

(as of Sun May 13 18:54:42 UTC 2012)

Fraction of Solicited Users who Participate
What fraction of solicited authors contribute to the article during the timeframe?

Limitations of this metric:
 * only meaningful for comparing cases (-E+S) and (+E+S);
 * does not account for other causes of participation;
 * participation of several distinct users is not independent, since they are editing a shared article (e.g. maybe the first user fixed the problem before the second loaded it);
 * more participation is not necessarily better (the hypothetical perfect article would never need to be edited).

(as of Sun May 13 18:54:42 UTC 2012)

Does the bot annoy Humans?
The bot cannot measure a human's emotional state, but it can look for evidence that it did the wrong thing, or that humans do not appreciate its help.

Revert Rate
What fraction of our edits are reverted by the end of the timeframe? We say an edit has been reverted if there is a later revision to the article whose edit summary mentions revert, undo, etc.

Limitations of this metric:
 * parsing the edit summary is less than perfect, and I expect some false-positives and false-negatives.

(as of Sun May 13 18:54:42 UTC 2012)

Banned from the Article
What fraction of the edited articles are modified by the end of the timeframe as to exclude bots (e.g. via )? This is strong evidence that a human no longer wants the bot.

Limitations of this metric:
 * we don't know why the bot was excluded;
 * in some cases, we don't know if the exclusion is specific to this bot, or if another bot was causing troubles (e.g. ).

(as of Sun May 13 18:54:42 UTC 2012)

Banned from User_talk:
What fraction of the solicited users opt-out of future communications from the bot (e.g. via on their user page)? This is strong evidence that the bot's solicitations are not appreciated, or are perceived as offering no value to human contributors.

Limitations of this metric:
 * we don't know why the bot was excluded;
 * in some cases, we don't know if the exclusion is specific to this bot, or if another bot was causing troubles (e.g. ).

(as of Sun May 13 18:54:42 UTC 2012)

Discussion
Based on the results as of 14 April:


 * Notification rate: The bot sends about 4 notifications per 10 edits on average.
 * Participation: About 1 in 5 notified users contribute to the article within a week.
 * Annoyance metrics: the bot was not blocked (via Bots) from any article nor any User talk page. Though one user reverted the notification.

The link improvement metric shows a big difference between the three cases:
 * Control: 0% of the dead links improve after a week.
 * No notifications: 42% improve after a week.
 * With notifications: 58% improve after a week.

This is misleading. Most of the improvement is due to the archive URLs that the bot automatically finds and adds to the articles. By comparing archive rate and mark dead rate, you see that about 0%, 42% and 56% of links were archived by the bot in those cases, respectively. So, the improvement due to notifications is probably closer to 2%.

Conclusions: The false positive and broken edit rate is still too high for deployment. The experiment suggests that notifications do not annoy most users. Notifications have a small, positive effect on dead link remediation.

My initial hypothesis was that notifications would have a large effect. I have invalidated this hypothesis, and now see no benefit of this bot over other dead link bots.