Wikipedia:Data mining Wikipedia

Wikipedia's open, crowdsourced content can be data mined.

From its articles, their pageviews, WikiProject-assessments, infoboxes, a variety of metadata (such as on page-edits) and categorization information can be extracted that can be used for analysis, statistics and the creation of new insights in general.

Natural language processing may be used to process article contents. This page is not about the use of data mining with the intent to improve Wikipedia.

Types
Data mining involves six common classes of tasks:


 * Anomaly detection (Outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation.
 * Association rule learning (Dependency modelling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
 * Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
 * The Wikipedia Data Mining Project's goal is to discover the internal pattern in a Wikipedia data set and explore various data mining algorithms. Cluster algorithm/s can group Wikipedia articles based on similarity, and forms thousands of data objects into an organized tree to help people view the content.
 * Classification – is the task of generalizing known structures to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".
 * Regression – attempts to find a function that models the data with the least error.
 * Summarization – providing a more compact representation of the data set, including visualization and report generation.

Instances

 * Watson is a question answering (QA) computing system that IBM built to apply advanced natural language processing, information retrieval, knowledge representation, automated reasoning, and machine learning technologies to the field of open domain question answering and also made use of/processed the full text of Wikipedia.
 * In 2014 Pantheon is a project developed by the Macro Connections group at The MIT Media Lab that's collecting, analyzing, and visualizing data on historical cultural popularity and production. The data used to create Pantheon 1.0 were the 11,341 Wikipedia biographies that existed in more than 25 languages in May 2013. Specifically, the number of different Wikipedia language editions that have an article about the historical character and their pageviews were used.
 * In November 2015 Jose Lages at the University of Franche-Comté in France and a few pals used the way universities are mentioned on Wikipedia with the PageRank, 2DRank and CheiRank algorithms to produce a world ranking of the most influential universities.


 * Kylin is a system that extracted information from Wikipedia by using infoboxes to automatically create training data for learning relation-specific extractors

Help and tools

 * Mining Wikipedia For Awesome Data, presentation
 * Wikipedia's API
 * Wikipedia's API

Legal considerations
Wikipedia and its sister projects—e.g. Wikimedia Commons, WikiSource—supported by the Wikimedia Foundation are hosted by servers (see Wikimedia servers on Meta-Wiki) at a data center in the state of Virginia, with an emergency backup data center in the state of Texas; caching servers are located in the Netherlands and Singapore. The Wikimedia Foundation is a non-profit incorporated in Florida and based in California; the terms of service for all Wikimedia Foundation websites is governed by the laws of the state of California and U.S. federal law.

From within the U.S.
Data mining of information on Wikipedia being performed from within the U.S., with one exception, is unlikely to be unlawful or a tortious violation of others' rights, as the information (text of pages, past revisions, IP addresses) is public (so mining likely won't run afoul of privacy laws in the U.S.) and, at least when mining on Wikipedia, likely to be considered fair use of copyrighted materials that doesn't infringe on the rights of the copyright holders (generally, the people who add content to the website). Additionally, privacy laws in the US typically do not protect information for which there is no reasonable expectation of privacy. Since all contributors, including contributions from IP addresses who have not created an account, agree to the terms of service and to irrevocably release their contribution under the CC BY-SA 3.0 & GFDL licenses and anonymous editors agree that their IP address will be recorded, it is unlikely that contributors can claim a reasonable expectation of privacy.

The issue of jurisdiction on the internet is not well settled in the courts, so data miners could be subject to either the jurisdiction of the courts for California (based on the terms of service, especially in any disputes with the Wikimedia Foundation) or the location(s) of the servers accessed for data mining. An exception to this is data mining from the US on Wikimedia Foundation servers in the Netherlands or Singapore, in which case an injured party could claim protection under the laws of either country. Since the Wikimedia servers in the Netherlands and Singapore are for caching, this issue can be avoided by mining only from Wikimedia servers in the U.S.

From outside the U.S.
Data mining of information being performed from outside the U.S. may violate local law or violate the rights of others (which can result in costly lawsuits if discovered). The main consideration is privacy laws, which should be considered if when any type of personal information (user names and IP addresses) is collected when mining. In the European Union, the General Data Protection Regulation (text) strictly regulates the manner in which personal data may be processed, defining 'personal information' as:
 * "any information relating to an identified or identifiable natural person; an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person." (Art. IV, §1)

This appears to include any type of data mining that profiles edits by IP addresses. However, there is an important exception to the GDPR regulation that is found in Art. 85, which says:
 * (1) "Member States shall by law reconcile the right to the protection of personal data pursuant to this Regulation with the right to freedom of expression and information, including processing for journalistic purposes and the purposes of academic, artistic or literary expression."
 * (2) "For processing carried out for journalistic purposes or the purpose of academic artistic or literary expression, Member States shall provide for exemptions or derogations from Chapter II (principles), Chapter III (rights of the data subject), Chapter IV (controller and processor), Chapter V (transfer of personal data to third countries or international organisations), Chapter VI (independent supervisory authorities), Chapter VII (cooperation and consistency) and Chapter IX (specific data processing situations) if they are necessary to reconcile the right to the protection of personal data with the freedom of expression and information."
 * (3) "Each Member State shall notify to the Commission the provisions of its law which it has adopted pursuant to paragraph 2 and, without delay, any subsequent amendment law or amendment affecting them."

The legal status of Article 85 of the GDPR is that it requires Member States to enact certain laws on the subject. Unfortunately, however, the editor adding this content to this section in May 2018 could not find any guide to how Member States have enacted this provision into their national laws.