Wikipedia:Reliability of open government data

Wikipedia fundamentally relies on the use of what we call reliable sources. We are starting to use more and more open data from government sources, as illustrated in the COVID-19 pandemic. But shouldn't we clearly distinguish between "reliable" data and "official" data? When can government agencies be trusted to provide reliable data? COVID-19 pandemic daily infection counts lack credibility for several countries around the world: how should Wikipedia readers be warned?

Sep 2021: constructive editing of this essay is welcome, but it is not intended as a support/oppose'' survey. Please edit or insert arguments and counterarguments, preferably with sources, into prose and/or lists. Individual sections on the talk page could be used for support/oppose type discussions, with summaries later being inserted into the essay itself.''

The COVID-19 pandemic case
During the COVID-19 pandemic that dominated world news starting in 2020, some of the key pieces of knowledge that readers have sought and editors have provided are the daily counts of how many people have been infected or died in countries around the world. Numerous media sources in specific countries point to particular worries about the data from several countries, and the Wikipedia editing generally follows the usual pattern of judging the reliability of particular media sources, doctors' statements, citizens' groups statements, rather than relying on government agencies' statements alone. However, the key diagrams and the numbers that feed through to global numbers on the pandemic are not nuanced by the unreliability of some of the data.

The WikiProject COVID-19/Case Count Task Force (WP C19CCTF) that "COVID-19 confirmed cases, deaths and recovery counts" data are based on reliable sources. But these "reliable sources" are in fact open data provided by government health agencies from around the world, who have fundamentally different methods of providing information to those of peer-reviewed research and journalism. In addition to country-level claims of data fabrication covered in some article sections (Belarus, Russia, Nicaragua, Venezuela), the statistical properties of the numbers published by the government agencies can be investigated for credibility without any political biases such as those of the known systematic demographic biases in Wikipedia. Both Benford's law and the lack of noise in the officially stated COVID-19 daily data point to the unreliability of the data from several countries. Unsurprisingly, the worse a country's Reporters Without Borders Press Freedom Index is, the more likely it is to lack day-to-day random fluctuations (stochastic noise) in its official COVID-19 daily infection counts. Presumably, government agencies with less risk of press criticism are less worried about fabricating their official open data.

In this particular case, switching to WHO or Johns Hopkins University CSSE (JHU CSSE) data would not be a solution for finding unfabricated data, because WHO is restricted to providing official national data, and JHU CSSE data shows broadly similar results of suspiciously low-noise daily counts to those of the WP C19CCTF; in fact, the statistical significance of the relation between the Press Freedom Index and low noise is stronger with the JHU CSSE version of the data - see the appendices in the analysis, which aims to be fully reproducible from source data and source code.

Terminology: reliable vs official
Is it acceptable that we continue to use the term "reliable" when we really mean "official" (from a government or governmental agency), and we know that "official" in many cases may mean quite likely falsified? Are we contributing to disinformation if we fail to clearly warn readers that "official" information may be fictitious? Should we trust official open government data by default, or should we distrust it by default?

The COVID-19 pandemic is not the only example of government open data used in Wikipedia, and these questions are likely to become more relevant as citizens increasingly pressure governments to publish open data.

Templates
We could create a template with a mouseover, something like cn or fv, with a superscript message something like govt and a longer mouseover message something like Official information from a governmental institution or agency; "official" information may or may not be reliable.

Official sources noticeboard
Should we have a noticeboard to develop official sources ratings lists something like WP:RSP? This would need enough volunteers willing to rate specific government agencies, or specific governments or countries, and enough information to warn Wikipedians of potential personal and legal security risks involved in them accusing their governments of fabricating data. The debates could risk becoming extremely controversial and subject to the usual risks of controversial Wikipedia topics.

Elections
The overall and detailed numbers of votes in elections for political office are a form of open government data for which electoral fraud is well-known to occur and election forensics is a small but emerging field of study. The current convention in the English language Wikipedia is that the infoboxes show the official results even when the results are dubious (e.g. Iran 2009; Belarus 2015 2020; Turkmenistan 2017). The implicit policy seems to be that the infobox reliably reports the government's point-of-view on the election results, even if these are false data, while the validity or invalidity of the open data is described in prose in the lead, based on reliable sources independent of the government.

Robots and search engines and websites that feed off machine-readable Wikipedia infoboxes process and propagate the infobox numerical data, but as of 2021, don't propagate the prose information. The prose information is what contains warnings about the information being (in some cases) highly unreliable (except in the sense that the information is a reliable report on the government agency's claim about the data).

COVID-19 pandemic
It can reasonably be argued that the COVID-19 pandemic data currently (Sep 2021) in Wikipedia is reliable in the sense that it represents the governments' points of view on their pandemic statistics. However, would the use of better terminology or some good templates be enough to warn users that the data may be nonsense in some cases, so that we are not contributing to official governmental disinformation?

It would be aesthetically upsetting if we had to exclude COVID-19 pandemic data from those countries whose data is most suspicious, and would risk accusations of pro-Western bias, even if the decisions were based on purely statistical properties of the official government data.

Bayesian option
A possible approach could be to associate a Bayesian probability for the credibility of each source of open government data, where the individual probabilities are generated from peer-reviewed research,   preprint research (itself with a lower Bayesian probability of being correct), and media articles (with bayesian probabilities related to WP:RSP?). Would there be enough people from diverse backgrounds and with the editing capabilities and the enthusiasm to get these data into Wikidata? Currently (Sep 2021), Wikidata elements are subject to much less editorial debate than Wikipedia articles.

Infoboxes for elections, pandemic data or other open government data could have a parameter  |credibility_percent = 3 | credibility_refs =  that displays a probability either as a percentage (3% in this case) or as a decimal in the range from 0 to 1, and gives a median (more robust than the mean) credibility estimate based on one or more references. As in ordinary Wikipedia editing, the parameter would quite likely be subject to intense debate on source reliability, how to express the overall value, and so on, depending on the quality of sources for individual open government data articles.

Openness and verifiability of the credibility research itself
En.Wikipedia generally considers any peer-reviewed research by a reputable research journal to be reliable, without requiring that the research paper be open access, and without requiring that the specific data sources, input parameters and method be presented in a fully reproducible format. Given the risk of initially relying on a small number of research papers in what is a small research field, we could require much higher standards than are typically considered enough. We could require that both:
 * 1) the research papers would necessarily have to be open access
 * 2) the research papers would have to be fully : Any results should be documented by making all data and code available in such a way that the computations can be executed again, yielding identical results, by any independent researcher with basic scientific computing skills

How do we combine different researchers' assessments?
If we use the credibility estimates from a single research paper by a single research group (or researcher), then we introduce a high element of sensitivity to error in that one research paper: if the paper is wrong, then that feeds through to a whole range of articles.

If we use the credibility estimates from multiples research papers, then how do we combine them? One solution would be to assign credibility parameters to each of the research papers and/or researchers, and take weighted medians (medians for robustness). These could be initially set to, e.g. 0.5, and then raised or lowered based on qualitative discussion, or on track records of those researchers' previous publications. However, this risks being counted as WP:OR or WP:SYNTH. There would have to be strong consensus on the method and algorithm. Or we could include ranges or the interquartile range or the central 95% range if there is a high number of research papers.

Policies
Should there be any specific Wikipedia guideline or policy distinguishing "reliable" versus "official" data? Some sort of text label to clarify the distinction?

Reliable sourcing versus geographical bias dilemma
COVID-19 data is generally more dubious in countries with worse press freedom, and election data is generally more dubious in countries with less developed democratic structures and human rights cultures and institutions. If we systematically remove open government data from Wikipedia that is less reliable, then we improve our information reliability but risk strengthening the known geographic biases of the English-language Wikipedia. If we don't remove it, then we risk presenting unreliable data as being reliable while appearing to provide less biased encyclopedic coverage. This dilemma is similar to the usual sourcing dilemma in relation to these biases, with the difference that numbers can give the false illusion of being reliable, since numbers can give the impression of being more objective than words. (Numbers obtained and presented accurately, are, of course, at the heart of most of modern science; but there is a huge caveat in the word "accurately".)

Negotiation with other editors on where to compromise, on a case-by-case or topic-by-topic basis on talk pages, with standards evolving with time, is the one way to handle this dilemma.