Wikipedia:Requests for comment/infobox template coherence

Achieving a global infobox scheme for Wikipedia is not a trivial task. To take the first step to achieve such coherence, we propose an initial effort that already brings benefits to Wikipedia.

Description of the problem
Wikipedia uses hundreds of infobox templates for describing various entity types like NFL teams, schools in Canada, train stations etc. These infoboxes are separated and do not use a common vocabulary. Several different spellings of attributes are used for them, which all stand for the same meaning (e.g. birth_place, birthPlace, origin). This poses limitations to checking consistency within Wikipedia infoboxes, amongst different language editions, and it makes it hard for external tools to reuse the information in infoboxes.

Goals

 * Short term: Establish the currently missing links between synonymous template attributes. Wikipedia authors can more easily reuse existing template attributes and thus improve coherence. External application can parse and extract information from Wikipedia more effectively, thus helping to spread and utilize the data contained in Wikipedia.
 * Medium term: The template annotations can be used by Wikipedia authors to check hundreds of articles for factual inconsistencies, such as outdated population figures, conflicting birth days etc. This will be a powerful tool for helping to improve the quality and consistency of Wikipedia.
 * Long term: Provide consensus about which properties should be used in templates and what data they should contain. It can also be used as a guideline, when designing a new template. Creating a common schema is difficult and needs discussion. Once the schema is stable it can be merged to template definitions and/or implemented in the underlying software.

Suggested solution
MetaWiki could be used to align different infobox templates. Similar types of infoboxes can be grouped into classes, e.g. pages using the infobox Template:Infobox officeholder or Template:Infobox badminton player describe persons. Different spellings/names of attributes in infoboxes can be aligned, e.g. the use of birthPlace or birth_place in infoboxes could be mapped to a property birthplace. For each property and class, a page in MetaWiki can be created, which contains its formal (RDF/OWL) and textual description. Since the method relies on the RDF and OWL Semantic Web standards, a variety of external extractors and tools is able to understand and use it.

Technical implementation
Documentation of infobox templates is usually done in the doc subpages of them, e.g. Template:Infobox badminton player/doc or Template:Infobox musical artist/doc. The suggested coherence mappings could be added there in the form of a template. The template will be machine readable. Modularity is reached by including other templates into the main template and provide interpretation suggestions, here called parse hints. At the beginning there will be 3 basic definitions for subtemplates:
 * map one to one (e.g. Occupation can be mapped to occupation) realized by Template:Template attribute, see below.
 * split properties (born containing birthplace and date can be split into two), Template:Template attribute can be reused, see below.
 * merge properties (geocoordinates are often scrambled accross the whole infobox, sometimes using 6 properties) at Template:Template attribute merge

Example of template source code
Example for: Template:Infobox musical artist

This says that: The classes and properties itself are described on MetaWiki (one page for each class or property). For instance, birthplace is such a description. The approach is language independent and metadata needs to be provided only once per class/property.
 * articles having a "musical artists" infobox belong to the (OWL) class "Musician"
 * the attribute/parameter "born" in a "musical artist" infobox is mapped to the (OWL) property "birthdate", where the value is supposed to be a date
 * etc.

Corresponding Template in Portuguese: |Predefinição:Info artista musical

Note: This RFC is limited to the English Wikipedia, i.e. consensus on it does not mean that such templates can be published without prior discussion on other language editions. The example only serves as an illustration of future multi-language support.

Example for: Template:Infobox_high_school_marching_band

Example for: Template:Infobox badminton player

Example for: Template:Infobox settlement

Mockup of the template
(= this is what it looks like when viewing it)

Running Example
For clarification, we created a running example on User:SebastianHellmann/example, where you can see the template annotation in action.

Demonstration of benefits for Wikipedians

 * Improved access to Wikipedia: Keyword string search helps to find a single article, but does not help in identifying sets of articles. Here is a DBpedia based browser, which allows to navigate Wikipedia with the help of clearly defined classes and an already existing mapping of infobox template properties. There are e.g. at least 181 famous musicians who play guitar and come from Canada (see example). This information is actually contained in Wikipedia, but can only be found by a string search and browsing many pages. Faceted Wikipedia/DBpedia search was selected as one of the 365 most innovative ideas in Germany by the German federal government. Other examples: browse persons born in the USA in 1970, exploring connections between Berlin and London, gFacet browser
 * Improved querying:
 * famous German musicians born in Berlin
 * All soccer players, who played as goalkeeper for a club that has a stadium with more than 40.000 seats and who are born in a country with more than 10 million inhabitants


 * Infobox inconsistency can be easily spotted with the help of SPARQL. The following query displays 10 articles which use Template:Infobox language, that contain plain text in the language family field, instead of a (possibly more desirable) link to the language article. See the result here. This query contacts the live and synchronized instance of DBpedia, so changes to Wikipedia will be reflected in the result of the query within minutes.
 * Semantic Web standards (RDF/OWL) make annotations usable by several extraction frameworks (with DBpedia being one implementation)

Drawbacks

 * For the initial infobox coherence mapping, 400 templates would need to be added (can be prepared by the DBpedia team). If doc subpages of infoboxes do not exist, new pages would need to be created.
 * Duplicate information: The mapping template contains some of the infobox attributes. If those infobox attributes change, then the mapping template may also have to be modified. If that is not done, the two can get out of sync. However, infobox template changes are rare and most popular infoboxes have a syntax description and an example on their doc page, which also need to changed. Moreover, mapping templates can also be checked automatically on whether they are outdated.

Alternatives

 * Fix different infobox attribute spellings: While this would be desirable, it requires consensus on which changes to make, administrative effort and millions of page changes. It may be the case that the RFC described here will contribute to cleaner infobox templates, but this appears to be a mid or long term development. Currently, there are more than 49000 infobox attributes (and a rough estimate indicates that they will be mapped to approx. 2000 RDF properties). What a cleaning of infoboxes does not achieve is to form a hierarchy of infoboxes: For instance, a Canadian senator naturally is a politician and a person as well. So a list of all person should include Canadian senators as well, which is easily achieved by the vocabulary on MetaWiki (using rdfs:subClassOf), but not by cleaning the different spellings of infobox attributes. Another problem, which this would probably not solve is that a particular attribute can have different units, e.g. feet of meter, in different infobox templates, which is covered by the parse hints in the mapping template.
 * Include the mapping directly in the infobox template definition: An advantage of this is a closer integration of mapping and definition of the infobox. A drawback is that it requires technical skills and makes it hard for editors to modify the mappings (not every Wikipedian is able or even should modify complex template definitions). The inclusion in the doc subpage is a compromise, which ensures that the mapping is close to the definition of the template and can still be maintained without much effort.
 * Microformat: Microformats are a way to describe specific content types like address and event information. It is not feasible to extend them to cover all domains, since each meaningful microformat needs community and tool support. It is also clear that the microformat community does not want to achieve this. Quoting : "Microformats are not [...] Infinitely extensible and open-ended [...] A panacea for all taxonomies, ontologies, and other such abstractions [...] Defining the whole world, or even just boiling the ocean".
 * Semantic MediaWiki: Semantic MediaWiki (SMW) is a more direct way of introducing Semantic Web knowledge representation standards in infobox templates. While the deployment of SMW on different Wikipedia editions can be seen as desirable, is unlikely to be deployed in the short term. The approach described here could later be used to bootstrap a deployment of SMW. (Both approaches rely on the same standards.) Also, SMW and the infobox annotation approach described here would complement each other perfectly: Once the usefulness of semantic representations in Wikipedia was shown using the infobox annotations, more fine-grained semantic representations can be integrated directly in the article texts using SMW.
 * External Database: Currently, the DBpedia extraction framework has an own database with many mappings and an own ontology. This RFC is aimed at moving from a closed database, to an open approach. One could reject the proposal and let them continue their work internally. Disadvantages: Long term maintenance of mappings will be difficult, when they are kept up-to-date in a closed separate system only accessible by a few people. Wikipedians have no control over how the data is extracted, so they cannot influence how external tools use it. No convergence towards semantic support in Wikipedia can be achieved. It is harder to use the mappings directly in Wikipedia, for instance for checking dates etc. (It is unlikely that Wikipedia will let an external project determine how values in infoboxes should look like.)

Supporting Data
In the English Wikipedia there are currently about 49111 different properties/parameters used in templates. List of property usage for each template: excerpt, whole (5MB)

Discussion, idea, remarks
Moved to the discussion page.