User:Yaron K./Future of Wikipedia

Wikidata will mean the end of Wikipedia
Note: though I have a tangential connections to the Wikidata group through my involvement in Semantic MediaWiki, these opinions are strictly my own.

Will Wikipedia still be around in 50 years? The rapid pace of technology tends to make a lot of things obsolete fairly quickly, but some technologies, like Unix and C, are still thriving after 40 or 50 years - and perhaps more relevantly, so are information services like LexisNexis, which just celebrated its 40th birthday this year and remains in widespread use. But for every one of these, there are hundreds of others that are no longer with us, or have faded into obscurity.

But what could replace Wikipedia? The alternative couldn't compete on price or accessibility. More advanced options like an improved interface, or even a distributed wiki system, could all eventually be adopted by Wikipedia, so those are not necessarily the answer. But I do think that Wikipedia will be replaced in the next 50 years, and I think that it will be Wikipedia that makes itself obsolete. And the instrument by which Wikipedia destroys itself (so to speak) will be Wikidata.

For those who don't know about Wikidata, it's an ongoing project that is meant to serve as a single repository of structured data that can be used to automatically populate some of the standard features of Wikipedia pages. These features are being replaced in order of their schematic-ness. The first part that was replaced was the most machine-like: the interlanguage links at the bottom of pages. Until early 2013, that set of interlanguage links had to be manually, or at least semi-manually, maintained, by either humans or "bots". Wikidata replaced most of this effort: it has a single page on each topic, that includes a list of its article name in each Wikipedia, and that list is now used to display the set of interlanguage links automatically.

The next step for Wikidata is taking care of infobox data. Infoboxes are the tables of high-level data found on the side, near the top, of most Wikipedia articles. For infoboxes, similarly, there's a lot of redundancy between different languages - the population of a city, for example, currently needs to be stored and maintained separately for the article about that city in each language.

After that, hopefully, will come categories. There's an explosion of categories on Wikipedia, especially in the larger languages, and all of these are manually maintained. To a large extent these categories mirror the information contained in the infobox - so that if a person's infobox indicates that they were born in 1665, died in 1722, had allegiance to Portugal and attained a rank of General (like this guy), chances are good that the article will end up in categories like "1665 births", "1722 deaths" and "Portuguese generals". Having the information stored in Wikidata will allow for the aggregation that categories provide to be done automatically - and could potentially allow for much more flexible aggregation, and querying. What if I want to see only the generals that came from city in Portugal? Or only the ones who lived in the 18th century? Wikidata could in theory allow for that sort of aggregation to be done on-the-fly, by users, without the need for editors to spend a lot of time creating a category structure that is, after all, only a guess as to what users want to see.

Similarly, there are all the pages that hold lists of things - so many that the English Wikipedia has a page with a List of lists of lists. These pages, too, could be automatically aggregated to some extent.

Moving on, there is structured data within articles themselves. Pages about actors, for example, often include a table listing all of that actor's roles, along with the film/TV series/etc. and a year or years for each. Any such table could be generated automatically by a query of Wikidata's data.

Do you see where we're going here? With each step, the amount of content that users have to maintain on each Wikipedia article keeps decreasing. And after interlanguage links, infoboxes, categories and tables of data like filmographies and discographies are all automated, that really just leaves one thing, though it's the big one: the body of each article - the text itself. I believe that, at some point in the future, the text of articles within Wikipedia will be generated automatically from Wikidata.

Talking about automatically-generated text might seem more hypothetical if this kind of thing weren't happening already. An article by Steven Levy in the April 2012 issue of Wired, "Can an Algorithm Write a Better News Story Than a Human Reporter?", covered the current state of the art: it mostly focused on a single company, Narrative Science, that automatically generates articles about sports games and stock market fluctuations, among other topics, based on the statistics that are input. Here is an example of an article created by Narrative Science that summarizes a baseball game between two children's teams:

Friona fell 10-8 to Boys Ranch in five innings on Monday at Friona despite racking up seven hits and eight runs. Friona was led by a flawless day at the dish by Hunter Sundre, who went 2-2 against Boys Ranch pitching. Sundre singled in the third inning and tripled in the fourth inning...

Nothing really awkward about that text.

Now, sports games and market fluctuations are ideal test cases for this sort of thing because they involved a very finite set of rules and possibilities; you can literally create such an article from a set of names and numbers. On the other hand, perhaps a lot of Wikipedia articles are similarly constrained. An article about an animal species or a chemical compound will most likely have a predictable structure, and perhaps articles about people with specific occupations - author, engineer, etc.

To be sure, there are some major challenges here. Storing the information necessary to generate a full-sized article is different in both quantity and quality from storing the information needed for an infobox. It's one thing to store the fact that a certain actor played a certain role in a certain movie; it's another to store that that actor was cast in that role, then dropped out, then rejoined after the film experienced some production delays, etc. It would take perhaps 100 times the data, and a data structure that's quite a bit more flexible, to be able to generate that kind of text. On the other hand, perhaps Wikidata could end up holding a combination of data and text - data for the basic stuff, and snippets of text that serve as commentary around facts, or plot descriptions, etc.

I will admit that an automatically-generated text will never beat a well-constructed article created by humans - or perhaps I should say, by the time that does happen, the creation of encyclopedia articles may be the least of our worries. But let's postulate that machine-generated articles will contain awkward grammar, strange transitions, and missing data. I think they're still a superior-enough concept that they will eventually triumph.

(This can already be simulated to some extent, by the way, by running English Wikipedia articles through an online translator. But those usually cover only a relatively small number of languages, and the results are quite a bit more awkward than directly-generated text would be.)

The most important advantage of machine-generated articles is that it can be done for every language. The English Wikipedia currently has around 4,300,000 articles. Even taking that as a rough proxy for the total number of noteworthy topics in the world, it means that the vast majority of languages in which Wikipedia is available only cover a tiny fraction of general knowledge. The next four languages in the top 5 largest Wikipedias (Dutch, German, Swedish, French) have about 1.5 million articles each - about a third the number of topics. By the time you get to #50 (Greek), you're down to 90,000 articles - about 2% coverage. The remaining 237 languages have even less. Of course, every Wikipedia is growing, but it seems far-fetched to imagine that "the sum of all human knowledge" will ever be available in most languages via Wikipedia other than by automation.

Will there be code to automatically generate text in, say, Saterland Frisian (#174)? Perhaps it will take a while before every language gets its own text-generation mechanism. What I think can be said is that even the most awkward, primitive script to create text in such a language will result in a body of content that far exceeds the current set, in both size and usefulness. If you were the speaker of an obscure or under-represented language, would you rather have access to thousands of hand-written articles of varying quality, or millions of articles with awkward syntax but up-to-date information? Neither one is ideal, but I think the better choice is obvious.

And automatically-generated articles offer some other advantages. You can tailor the size and content of the article to the current user, or the current context - instead of, say, the endless battles over how many "references in popular culture" are appropriate, if any, the users themselves can set their tolerance for The Simpsons or Family Guy quotes - or that can be set by the application displaying the data.

And for that matter, why should the information be displayed simply as text? Perhaps some users would prefer to see it all as a video, with auto-generated audio plus a slideshow of images - or as clickable maps and timelines, or a series of tables, or something even more unconventional.

It's at that point that the concept of "Wikipedia" starts to break down. After all, if Wikipedia starts to just serve as a viewer for Wikidata's massive data store, there's no reason why other sites, and other applications, and other [whatever these things will be called in the future] can't do the same - each with their own set of preferences, visualizations, and natural-language-generating algorithms.

So perhaps by then, Wikipedia itself will start to fade away, just one of many viewers for Wikidata's data store. And with it may go away a lot of the endless edit battles: whether the dairy product is spelled "yogurt" or "yoghurt" in English can be made a viewer setting, and the question of whether to acknowledge the fact that Abraham Lincoln and Charles Darwin were born on the same day can be up to the viewer application, which will have access to everyone's date of birth. Even more contentious issues - related to geopolitics and the rest - may become less of a big deal, simply because, again, they will be at the discretion of each viewing application.

And after Wikipedia fades, perhaps Wikidata will be the next to go, and with it the whole concept of a single, central knowledge repository. Perhaps various organizations will start to publish their own, reusable Wikidata-style data: so that FIFA will be the definitive source for information about soccer (or is it football?) players and games, the Metropolitan Museum of Art will be the definitive source for information about their artworks, and so forth. Or there may be multiple, competing sources for each area of subject matter, with each data viewer application able to choose which sources to rely on, with Wikidata (or whatever replaces it) always there to fill in the gaps.

Does this concept of hundreds of millions of articles mindlessly cranked out by machines seem sterile? Soulless? Frightening, even? That's a valid response - though the flexibility afforded by separating the data from the presentation could conversely make the display of information more open to creativity, and more "human", than simple text- and image-based articles. The translation to multiple languages is also a very humanistic outcome. And if you're still worried, I can pretty much assure you that none of this will happen in the next five years. But in the next 20 years? In 50?