User:Zazpot/Charles Matthews session 2017 07 06

See: Charles's notes.

Wikidata

 * ~30m entities (of which ~0.5m were about scientific papers before Tom Arrow made Fatameh live in ~early 2017; ~1.3m as of now).
 * ~150m relations (of which ~50% are relations to external identifiers).
 * plus metadata.

Relations
We can use these to learn (see bubble chart) that there is a reasonable spread of subjects covered, but it could be more even still.

Entity disambiguation
To give a sense of the size of the problem: ~1.2m hand-curated disambiguation pages exist in Wikipedia. This is a tremendous effort, but not scalable given how (relatively) few active Wikipedians there are.

There is a solution outside MediaWiki called OpenRefine, spun off from Google's Reconciliation API, that is intended to assist with disambiguation.

Within Wikidata, one can perform a Disambiguation search, e.g. d:Special:ItemDisambiguation?language=en&label=Brampton. These rely on finding identities between any given item's label/alias and other item's label/alias.

Note: in the context of MediaWiki, a "Special page" is one that is dynamic, i.e. created on the fly each time it is fetched.

mix'n'match is a tool by Magnus Manske that is a great help with disambiguation via external identifiers.

Open knowledge
For open knowledge to be useful, it must be reliable and verifiable.

WikiFactMine is a tool created by Tom Arrow to download papers from EuroPubMed every day to a set of virtual machines hosted on WMF Labs. This processes them through ElasticSearch and makes the results available via the WikiFactMine API.

WikiFactVis consumes this API to visualise the assertions that WikiFactMine came up with. These assertions are candidates for "facts" that can then be inserted into WikiData, etc.

These candidates are simply based on co-occurrence. The aim is for humans to decide which candidates are valid and which are not. So, we are now working on an interface to make that task user-friendly.

John Eyre
Charles took pages from CCED, Venn, and Foster for people names John Eyre, and asked us to try to disambiguate the people to whom they refer.

Using SPARQL to see around corners
Example: parents of Etonians.

Adding "main subject" relations
E.g. find scientific papers ending in "virus". Then click through to the Wikidata entries for those items, and add main subjects where those are not already present, and where you have enough data at hand to enable you to do so confidently.

Fact mining
There are many relations that are only true with qualification. These are, in a sense, not pure triples: they are quads or quints, etc. Look at some newspaper headlines. Are they factual statements? If so, then are they simply relations (subject-predicate-object) - call these "R"? Or do they also require a qualifier ("RQ")? Or an extra qualifier ("RQQ"), etc?

Using Wikidata queries (via PetScan) to track Wikimedia projects' success rates
w:Wikipedia:WikiProject Dictionary of National Biography/Tracking