User:GabrielF/NewsCitations


 * Note: This is a bit preliminary at the moment. I welcome any comments or suggestions that you may have, but please regard this as a draft. GabrielF (talk) 21:41, 12 July 2013 (UTC)

This project uses Big Data techniques to analyze the use of news media sources on Wikipedia. I selected approximately 250 news sources, including newspapers, magazines, TV and radio networks and websites. I then counted the number of times Wikipedia links to each source.

The results of this project demonstrate that a relatively small number of high-profile news sources are extremely prevalent on Wikipedia. The BBC is cited 337,000 times and The New York Times is cited 299,157 times. Other highly respected sources with a similar breadth of coverage are used far less frequently. For instance, the Wall Street Journal is cited only about 26,000 times. Certain newspapers are cited much more frequently than their circulation would suggest. For instance, the Israeli newspaper Ha’aretz, with a circulation of 65,000, is cited over 14,000 times. The Pittsburgh Post-Gazette, with a circulation of 187,000 is also cited about 14,000 times. By contrast, the St. Louis Post-Dispatch, with a circulation of 196,000, is cited less than 3,000 times. Factors that may effect how often a newspaper is used may include: whether the newspaper specializes in a controversial subject (Ha’aretz), whether the newspaper’s archives are available free on Google News (The Pittsburgh Post-Gazette) and whether a newspaper has had a paywall for many years (The Wall Street Journal).

Methods
I analyzed the 3 May 2013 dump of English Wikipedia's external links table. I uploaded the table to Amazon S3 and used a Hadoop Streaming job with Amazon Elastic MapReduce to read through the file. I checked to see if each URL in the table corresponded to one of a list of news websites.

Results










Discussion
Despite the diverse, global nature of the Wikipedia community, Wikipedia editors rely disproportionately on a small number of sources. Although the sources most frequently used are generally well respected, the Wikipedia community should consider whether an over-reliance on a small number of sources limits our ability to provide our readers with a diverse range of viewpoints.

The presence of a paywall appears to be a significant factor in Wikipedia's overall use of sources. Wikipedia cites The New York Times nearly 300,000 times. By comparison, Wikipedia cites The Wall Street Journal only 26,000 times. Both newspapers are highly respected and are read worldwide. One possible explanation is that the Journal implemented a paywall in 1997, providing access to articles only to subscribers, while the Times implemented a paywall in 2011. The Guardian, cited 139,012 times, does not use a paywall. As more newspapers move to paywalls, it is possible that Wikipedia editors may rely more heavily on the remaining free sources.

The Pittsburgh Post-Gazette and the Atlanta Journal-Constitution have a similar circulation (about 187,000), but the PPG is cited three times as frequently (14,382 vs. 4841). The PPG makes its archives available for free on Google News meaning that PPG articles from decades ago can be used in Wikipedia articles.

The Israeli newspapers Haaretz and The Jerusalem Post are cited 14,000 and 12,000 times respectively, despite each having a circulation of less than 75,000. Newspapers that specialize in controversial subjects punch above their weight on Wikipedia.

Wikipedia editors seem to prefer sources that are both popular and well-respected. The New York Times is used far more widely than The New York Post or USA Today.

Policy Recommendations
Wikipedia editors have succeeded in the past at encouraging academic databases such as HighBeam and JSTOR to donate access to their services to Wikipedians. We might consider a similar approach for news sources such as the Wall Street Journal, the Financial Times or the Times of London, which are underrepresented on Wikipedia. Given Wikipedia's massive readership, I would imagine that these newspapers would be concerned that their competitors are being cited an order of magnitude more frequently on Wikipedia. Providing a relatively small number of highly active Wikipedia editors with free accounts on their websites might go a long way towards increasing a newspaper's influence on Wikipedia.

Limitations
I analyzed the MediaWiki external links table for English Wikipedia without examining whether the Wikipedia page containing the external link was a Wikipedia article or a supplementary page, such as a discussion page, a user page or a Wikipedia policy page. As a result, this analysis includes links from every area of Wikipedia, not just from articles. I don't consider this a major problem because my overall goal is to examine what sources Wikipedia editors are using, and the links on Wikipedia's policy, user and discussion pages reflect the choices that editors make. For instance, a Wikipedia editor might argue that an article should be deleted because the topic is not notable and another editor might provide links to newspaper articles to demonstrate that the subject is notable enough for inclusion in the encyclopedia. While these links are less visible to the average reader than links within an article, they are still part of the process of building the encyclopedia.

In some cases, Wikipedia editors cite a news source without linking to it. For instance, an editor might provide a textual citation to an article that appeared on page 3 of the New York Times in 1940. This analysis does not account for these purely textual citations.

In certain cases, Wikipedia editors link to a newspaper article that appears in an archiving service or aggregator. These include: Google News (124,416 links), ProQuest (30,414 links), HighBeam (27,653 links) and LexisNexis (1591). In the case of Google News and ProQuest I looked at each URL and determined the newspaper it was originally published in. I then added the link to that newspaper's total. I did not do this for HighBeam or LexisNexis. There were too few LexisNexis links for this approach to be worthwhile, and HighBeam contains many academic journals and trade publications that are outside the scope of this project.

Directions for Future Research
It would be interesting to examine the history of Wikipedia articles to determine the rate at which different news sources were added. When newspapers such as the New York Times established paywalls, did Wikipedia editors start to use them less often?

I would like to use the geotags in Wikipedia articles to create maps of the world showing locations that are sourced to specific newspapers. This would help show whether Wikipedia uses particular newspapers as sources on a particular region, on a particular country or for topics worldwide.