Wikipedia:Wikipedia Signpost/2016-02-03/Op-ed

The removal of James Heilman from the Wikimedia Foundation Board of Trustees has brought the issue of the “knowledge engine”, i.e. the work of the WMF’s "Discovery" department, into focus for the volunteer community.

Ever since his dismissal, Heilman has maintained that disagreements about appropriate transparency related to the Discovery, or "Knowledge Engine", project, funded by a restricted grant from the Knight Foundation, were a key factor in the events that led to his removal. Jimmy Wales has referred to these claims as "utter fucking bullshit".

But what is "discovery" and the knowledge engine all about? This is an attempt to make sense of the patchy information that the Wikimedia Foundation has provided to the volunteer community and the public to date, and to extract some of the underlying ideas related to the project.

A statement of regret
A few days ago, Lila Tretikov posted a statement titled "Some background on the Knowledge Engine grant" on her talk page on Meta-Wiki. This is worth reading in full; parts of it are excerpted in what follows below. She begins by acknowledging that she should have communicated with the volunteer community sooner.

This kind of communication is potentially a good start to mend fences with the community, and redress some of the things that went wrong with how and when this project was started and communicated.

When did this project start?
In her statement, Lila Tretikov recalls her thoughts around the knowledge engine in June 2015. But to locate the actual beginning of this project, we have to look back a little further than that.

"Search & Discovery" first appeared as a department on the Wikimedia Foundation Staff & Contractors template on 30 April 2015. There would have been no point in creating a well-staffed, well-funded Search & Discovery department in April 2015 if the WMF leadership had had no practical idea of what this team was going to be working on.

was perhaps the first to raise public questions about the project, which remained unanswered. Reviewing the WMF's draft 2015–2016 Annual Plan in her capacity as a member of the volunteer-staffed Funds Dissemination Committee in May 2015, she said on Meta:

This is in line with what James Heilman said in his Signpost op-ed dated January 13, 2016, describing the ideas around Search & Discovery as having been developed before the April to June 2015 quarter:

By September 2015 – more than four months before the WMF publicly announced the grant to the community and the world at large on January 6, 2016 – the Knight Foundation had clearly made a decision to support a WMF knowledge engine project. A still extant page on the Knight Foundation website mentions a grant for $250,000, with a grant period running from 1 September 2015 to 31 August 2016, funding work –

James Heilman's dating of the project matches the facts. And Risker's questions, posed in May 2015, indicate that the volunteer community – including the FDC – had been out of the loop well before June 2015.

Has the Foundation's grant transparency policy changed?
Lila Tretikov addresses another question in her statement:

This seems to differ from her predecessor Sue Gardner's statement of WMF policy in October 2011. In this statement, Gardner said,

Sue Gardner said publishing grant applications is standard policy unless the donor objects. Lila Tretikov says it is standard policy not to publish grant applications, unless "requested and agreed to by donors."

This seems like a subtle move away from the transparency which the WMF has traditionally emphasised as one of its core values.

"Actions speak volumes"
Volunteers have called for weeks for the Knight Foundation grant application and grant letter to be published. A month ago, for example, addressed Jimmy Wales on his talk page, responding to statements by Wales that James Heilman's narrative was "misdirection", a "trap", "not true" and that he – Wales – was "a much stronger advocate of transparency than James":

Wales replied,

This sounded promising. Yet nothing happened on that front for the best part of a month, until Lila Tretikov's recent statement on her Meta talk page. In the relatively brief discussion that has ensued there to date, James Heilman has reiterated his call for the grant application to be published:

Yet the responses posted by Jimmy Wales and fellow Board member Denny Vrandečić on that page are anything but encouraging. The WMF board seems remarkably reluctant to publish both the grant application and the grant letter for community review.

Partial information
What Lila Tretikov has now provided on her Meta talk page is a list of expected outcomes of the Knight Foundation grant, deliverable at the conclusion of the first stage. The mention of a first stage raises the obvious question of how many stages are envisaged, and what the expected deliverables for the other stages are.

Liam Wyatt mentions in his most recent blog post that the WMF's original grant application seems to have been for a much larger amount than the relatively modest $250,000 the Knight Foundation has actually committed to. As others have pointed out, $250,000 is an amount the Wikimedia Foundation can raise and has raised in a few hours on a December afternoon. Liam Wyatt's assertion that the original application was for a much larger amount seems at least plausible.

Details of the grand vision for this multi-stage project, couched in approachable language understandable by anyone, is still lacking. Publication of the grant application would help volunteers and the public understand the WMF leadership's thinking, and the long-term goals of the Discovery project.

"Are you building a new search engine?"
Some related information is available in a on MediaWiki – but when originally delivered, its slides would have been accompanied by spoken commentary. As it is, the slides are written in such a shorthand and jargon-laden style that it seems likely few general readers will be able to follow the content, and fill in the gaps.

What's described on of that document however is clearly some form of search engine for open content on the Internet.

This is also reflected in the original Knight Foundation announcement from September 2015:

Of course, this is not the first time the idea of a search engine has been raised in the Wikimedia universe. Those with long memories will recall Wikia Search, a short-lived free and open-source Web search engine launched by Jimmy Wales' for-profit wiki-hosting company Wikia in 2008.

Wikia Search was conceived as a competitor to the established search engines. It was not a success, closing down in 2009 after failing to attract an audience. That there have been questions among WMF staff and volunteers whether the WMF is engaged in building a search engine is borne out by a corresponding section in the Discovery FAQ on MediaWiki:

At the same time, the Discovery FAQ on MediaWiki asks,

This sounds like an idea to re-purpose Wikipedia.org, having it function as a search engine for open content – including but not limited to all Wikimedia projects. And indeed, a on "Conceptual directions for Discovery" shows versions of a Wikipedia.org page that bear more than a passing resemblance to Google's start page, being dominated by a Wikipedia wordmark and an empty search box.

It is perhaps significant that Wikipedia is now one of the search engine options for the search box in Firefox, alongside Google, Yahoo!, Bing and others.

Deliverables
Assessing whether users would "go to Wikipedia if it were an open channel beyond an encyclopedia" is also one of the key questions the work funded by the Knight Foundation is expected to answer.

The sections of the Knight Foundation grant documentation that Lila Tretikov has quoted on her Meta talk page read as follows:

Embedding Wikipedia in OEMs and other carriers
Responding to a question about the mention of Original Equipment Manufacturers in the above grant excerpt, Lila Tretikov has indicated on her talk page in Meta that the WMF is

There is no reason to assume that this type of arrangement will be restricted to mobile phones. Even today, the Amazon Kindle e-book reader for example has a Wikipedia look-up function pre-installed, as does the Amazon Echo, a household infotainment assistant that like Apple's Siri responds to voice commands.

One problem the Foundation appears to want to address is that Search often returns "zero results" – i.e. cases where a user query is not mapped to a Wikipedia article.

Kindle users trying to look up a word in a book they're reading will be familiar with this problem. To give an example, the other day I was faced with the term "unentailed" in a Victorian short story. Neither of the Kindle's built-in dictionaries knew the term. Querying Wikipedia on the Kindle yielded the disappointing message: "No Wikipedia results were found for your selection", along with an option to "open Wikipedia". When I did so, I found many Wikipedia articles containing the term, but only one was useful in helping me understand the word: Fee tail.

A better search function might have presented me with that article to begin with. Similarly, I might not have had a zero results message if the Wikipedia search function extended to other Wikimedia projects. Instead, my Kindle might have pointed me to the Wiktionary entry for the word unentailed.

Search improvements like this would make locating information more convenient for Kindle users. Amazon would arguably profit from having a more desirable product.

Machine-generated content
The ongoing community consultation on Meta suggests that one possible approach the Wikimedia Foundation might take would be to

More than half of all language versions of Wikipedia suffer from a long-term dearth or indeed complete lack of human volunteer editors. Wikidata content could be used to have machines generate simple articles "on the fly", using a store of simple sentence templates that Wikidata values are then plugged into. This, too, would reduce the number of times users' searches come up empty.

Public curation of relevance
A key concern for search engines is which information to "surface" in response to a query, i.e. identifying which information is most "relevant" to a user query. (In the above example, for instance, this might have been the Wiktionary entry for the word "unentailed", or the Wikipedia article for "fee tail".)

Here, too, Wikidata and Discovery are apparently envisaged to play a key role in future. A little-visited Discovery RfC draft subpage on MediaWiki speaks of "public curation of relevance":

There are obvious alarm bells attached to that last caveat. Increasing or diminishing the visibility of information to search engine users is what search engine optimisation is all about. Public curation of relevance provides a mechanism that seems tailor-made for that purpose – hence the caveat.

Focus on open content
If Wikipedia.org is to become a search engine, all the information available to date suggests that it will be a search engine specialising in open content.

As discussed in a previous Signpost op-ed, search engines are increasingly re-publishing open content deemed relevant to users' queries on their own sites.

The purpose of a search engine used to be to provide the user with a directory of relevant links. This purpose has shifted: search engines are morphing into answer engines that aim to provide the answers to users' questions directly, obviating users' need to click through to any other site. This supports the search engines' business model: search engines derive a significant part of their income from the ads displayed on search engine results pages. Making sure that users stay on these pages and do not click through to other sites (including Wikipedia) improves the search engines' bottom line.

APIs
The Knight FAQ on MediaWiki emphasises the importance of APIs (application programming interfaces):

The availability of a Wikipedia knowledge engine would not just benefit users of Wikimedia sites. There are clear overlaps with commercial search engines. Google itself for example, when introducing the Knowledge Graph to the public, referred to it as a "knowledge engine". As mentioned in that publicity video, a key concern when deciding what content to show in response to a query is what other users have judged relevant to their query.

Seen from the viewpoint of potential re-users like Google, Bing, Yandex and other search engines, public curation of relevance, as mentioned above, could help them identify free information sources that human users are likely to find useful.

Thus the API might deliver volunteer-derived data that those answer engines can use to optimise their product – for example through optimised page ranking for open content pages, or direct inclusion of user-preferred free content in answer boxes, knowledge panels and the like.

Taking a very negative view, bearing in mind search engines' soaring and astronomical profits from advertising, one might argue that such an arrangement would turn volunteers into unpaid hamsters driving the spinning cogs in the Knowledge Graph, Bing's Snapshot etc., in endless pursuit of the dangling carrot that they can try to affect how readily something is "surfaced" in search engines – and occasionally succeed in doing so, for better or worse.

Coexistence?
Given the information presently available, it seems reasonable to assume that the long-term goal of the Discovery project, including public curation of relevance, is to build a search engine for free information sources on the Internet, and for capturing and defining patterns of human user interaction with such content.

The failed Wikia Search project was designed to compete against Google. Not least because of this failure, it seems unlikely to me that the present Discovery project pursues similar long-term ambitions.

It simply doesn't seem very plausible that the Discovery project could or could even be intended to compete against the likes of Google, Bing and Yandex, given that –


 * 1) Wikipedia's open-source nature and APIs would make whatever insights and data the Discovery project generates available to these competitors, who are already well established. Any competitive advantage these data might deliver to a WMF search engine would be instantly neutralised by the fact that its putative competitors would have access to them too.
 * 2) Google, Microsoft and Yandex are actually supporting Wikidata. It seems unlikely that they would be funding a competitor.
 * 3) Wikidata comes with a no-attribution CC-0 licence, which serves re-users' interests, but undermines the publicly professed rationale of reaching users so as to convert them into editors.

At most, the Wikimedia Foundation board appears to entertain the idea of charging re-users for API use. In other words, the work done by volunteers might be of sufficient economic value to re-users to open up another source of income for the Foundation.

The times, they are a-changing
I recall listening to Lila Tretikov's "Facing the Now" speech at Wikimania 2014, in which she stressed that Wikipedia, while apparently at the peak of its success at the time, was in danger of being left behind by technology developments, and would need to adapt to remain relevant. I wondered at the time what she was driving at, but it seems to become clearer now.

As people's infotainment needs and usage patterns move away from desktop computers to mobile phones, voice-controlled electronic assistants and other products that will be as commonplace in ten or twenty years' time as they are unimagined by most of us today, many of the people who used to visit Wikipedia pages may find their information needs more conveniently satisfied elsewhere than on the pages of an encyclopedia.

The march of technological progress can't be halted. And indeed, many of us find that progress inherently exciting. The idea of positioning the Wikimedia community as the central engine driving many different types of information products and services – or at least a major component of such an engine – is likely to appeal to many Wikimedians. It would certainly keep Wikimedia relevant.

And one might ask, in the absence of scalable alternatives, is there really a better process for generating and curating such content today? Google has long argued that volunteers outperform paid contributors when it comes to such work.

Yet there is also much to be disturbed about. Omnipresent snippets, delivered to a potential audience of billions, amplify the risk of manipulation, creating an information infrastructure that seems more vulnerable to activist influence, or indeed Gleichschaltung, than conventional media. It has been established that even today, search engines would have the power to sway elections if they put their mind to it.

To the extent that volunteer labour helps corporations like Apple, Google, Amazon, and Microsoft make billion-dollar profits, the potential labour patterns described here involve obvious and profound social and economic injustices. On Jimmy Wales' talk page, volunteers have begun to wonder whether they need to unionise to have any influence on the future of their movement.

The increasing importance of machine-generated and machine-read content in efforts to serve global information demand may be anathema to many Wikimedians committed to the idea of an encyclopedia written by people, for people.

And there are other concerns: to what extent do Silicon Valley-facing developments like those described here, efforts to build a technologically slicker product and achieve greater market penetration, detract from other efforts that volunteers might consider more relevant to the core goal of writing an encyclopedia? Making Kindle search functions more easy and satisfying to use is all well and good, but what should the relative priority of such efforts be?

James Heilman is a Wikipedian who has gained much acclaim for his efforts to make Wikipedia's medical content truly reliable. Bringing a Wikipedia article to a quality level that was good enough to make it eligible for inclusion in a peer-reviewed journal (see previous Signpost coverage) – the first Wikipedia article to qualify as a reliable source under Wikipedia's own rules – was a milestone in Wikipedia history, though one the Foundation made no great effort to publicise at the time.

In an alternative universe, the Wikimedia Foundation might put equal focus on supporting and expanding such efforts, believing that a quality product will always have a readership. Ubiquity is not the same as quality; Gresham's law could easily be applied to the world of information as well.

Conclusion
As stated above, this op-ed is my attempt to make sense of the patchy information that has been made public about Discovery, or the Knowledge Engine. I will be grateful to have my interpretations and conclusions confirmed or corrected as appropriate by WMF personnel – and, I am sure, so will the volunteer community.

But most importantly, there are matters here that the community should make an input to. The ongoing community consultation on strategy touches on many of the issues discussed here. It will close on February 15, 2016.

There is also an ongoing issue of transparency. The WMF board should make the Knight Foundation grant application and grant letter public, in line with WMF policy as stated by Sue Gardner in the past. If for some reason the Knight Foundation objects, that should be openly stated and an explanation provided.


 * Information sources on Discovery:


 * meta:User_talk:LilaTretikov_(WMF)
 * mw:Wikimedia Discovery/FAQ
 * mw:Wikimedia Discovery/Knight FAQ
 * mw:Wikimedia_Discovery/RFC
 * commons:File:Discovery_Year_0-1-2.pdf

For another perspective on these developments, see Liam Wyatt's blog post Strategy and controversy, part 2.

Update: The grant agreement has since been published, and some further documents have been leaked to the Signpost''. See Signpost coverage.''



''Andreas Kolbe has been a Wikipedia contributor since 2006. He is a member of the Signpost's editorial board. The views expressed in this editorial are his alone and do not reflect any official opinions of this publication. Responses and critical commentary are invited in the comments section.''