User:Larataguera/Bias

Evaluating qualitative systemic bias in large article sets on Wikipedia
The presence of systemic bias on Wikipedia has been well-established by several studies. Most studies demonstrate simple quantitative bias by noting that a particular class of articles has fewer instances than another similar class (eg, there are more biographies about men than women, and there are more articles about Paris than all of Africa ).

Qualitative bias in article content (ie, biased information in existing articles) is more difficult to assess. Creation of new articles helps address simple quantitative bias in the number of articles, but it does not independently remedy bias in the encyclopedia if those articles are orphaned or poorly interlinked. Integration of missing topics requires an assessment of due weight to determine where, for example, a new biography about a female mathematician should be linked from articles about her area of expertise, awards she won, or other mathematicians she may have influenced. It’s also possible that some of her works would meet notability guidelines (WP:BK) and are also missing. The complexity of this situation means that assessing qualitative bias on the basis of even a limited class of articles (eg, 1,000 women mathematicians) quickly implicates many thousands of other articles outside that class.

I propose a systematic approach for assessing bias by using Wikidata to create an ontological map of related articles. This approach generates a set of statements that could potentially be missing from the encyclopedia, and is intended to be a step toward assessing qualitative bias in large sets of articles. In combination with a separate assessment of sources to determine due weight (only marginally developed here) this approach could lay a foundation for identifying and integrating missing information from large sets of articles and help counter systemic bias (WP:CSB).

Environmental justice on Wikipedia
For this exercise, I examined articles related to environmental conflicts and environmental justice. Wikipedia is overwhelmingly created by white males in Europe and North America —a demographic that generally benefits from environmental injustice. Information about Indigenous people and the Global South is notably absent, creating gaps in knowledge about those portions of the world that bear the burdens of environmental injustice. There are about 4,000 environmental conflicts currently listed in the Environmental Justice Atlas, and the number is growing. Given the complexity discussed above, these 4,000 conflicts could have implications for tens of thousands of articles (some existing, and some missing) about environmental justice campaigns; resource extraction projects (eg mines, pipelines, gas fields); notable environmental defenders (individuals and organizations); corporations; commodities; disasters and other events; threatened ecosystems; and more.

Method
For this exercise, I examined 800 environmental conflicts from the EJAtlas to determine how they were represented on Wikipedia and Wikidata. I used script-aided matching to iteratively sort the conflicts and establish an ontological structure comprising statements that define relationships between entities. Wikidata statements have a subject-predicate-object format: for example the relationship, “The Escobal mine protests oppose the Escobal mine” is represented as: Q106830477 Escobal mine protests—P5004 in opposition to— Q16957078 Escobal mine. (That statement does not exist in Wikidata at this writing).

After compiling the list of conflict titles, I queried them in the English Wikipedia, returning the two most relevant matches. Of these conflicts, 488 returned at least one proposed matching entity (61%). I then sorted through the conflicts to remove false positives and confirm that the proposed matches clearly related to the EJAtlas conflict title. This process confirmed the match for about 20% of the EJAtlas conflicts. (So two-thirds of the initial 488 results were potentially false positives).

Following this initial matching, I reduced the size of the set to the first 250 conflicts (including unmatched conflicts), and developed a second script that enabled matching of multiple entities to a single conflict and placement of the matched entities into one of several categories:


 * Conflict (protest, social movement; directly corresponds to EJAtlas entry)


 * Project (eg: mine, pipeline)


 * Resource (eg: lake, protected area)


 * Company


 * Environmental organization


 * Disaster

To aid accurate and detailed matching, this script also provided a text input and one-click querying to retrieve additional information about the conflicts: up to five relevant Wikipedia entities, the lead paragraph of a Wikipedia article, the lead paragraph of an EJAtlas description, or the company name from a Wikipedia infobox.

I was able to match 161 entities to 113 conflicts (from the reduced set of 250, so 45% of conflicts were matched with at least one entity, up from 20% in the first iteration.) Forty-eight conflicts (20% of the set) were matched to more than one entity. Most of the 161 entities were projects (45%) or companies (26%).

I used these matches to generate a partial ontology for 250 conflicts consisting of 481 relationships, including 250 general statements establishing the conflicts themselves. Very few of these statements are presently included on Wikidata, although a few were added during the activity. Many of the conflicts remain undefined, and some properties that seem integral to environmental conflict ontologies do not exist.

These statements are a small portion of all possible relationships that could comprise environmental conflict ontologies; they were chosen to illustrate the process and to (mostly) make use of existing Wikidata relationships more than to recommend a particular ontological structure. Obviously, a different set of relationships would have to be derived for a different set of articles (such as that for female mathematicians discussed above).

Suggestions for further work
The low matching rate found in this activity limits possible statements, and is at least partially due to poor coverage of environmental conflicts on Wikimedia platforms. Although this experiment did not systematically determine what proportion of missing conflicts would meet general notability guidelines (GNG), only seven articles about a conflict were found in the final set of 250 conflicts for which an ontology was generated (7/250 = 3%). Given that the EJAtlas is a moderated platform that requires secondary sourcing, and that conflicts in general are frequently newsworthy topics of discussion in academic literature, this rate seems absurdly low. I did identify several missing entities that met GNG and worked with another editor to create articles for them. Since notability requirements for Wikidata are lower, presence of a conflict in the EJAtlas is sufficient to establish a conflict entity as an instance of (P31) environmental conflict (Q5683226). The statements proposed above assume that these entities would be created on the basis of the EJAtlas entry.

This examination suggests that Wikipedia’s coverage of environmental conflict is poor. It also suggests that without further script development, it would take a little over fifty hours of work to accurately match about half of the conflicts listed in the EJAtlas to at least one existing Wikidata entity. Accurate and efficient matching of the remaining 50% (and more complete matching in general) would probably require additional development. These matches could be used to facilitate an information exchange between the EJAtlas and Wikipedia that has potential to improve the quality of information on both platforms.

This work could set a foundation for correction of bias in coverage of environmental conflicts in Wikipedia by identifying entities that are missing entirely, or articles that may lack information about environmental conflict. The structured relationships also provide a framework that should facilitate script-aided editing to establish the missing information (in combination with further development to identify supporting sources).

Sources and due weight
In order to evaluate whether a missing entity meets GNG, relevant sources would have to be assessed. That assessment is mostly beyond the scope of this exercise, but I did develop a script to extract sources from an EJAtlas entry about a conflict. Similar scripts could be developed to identify relevant sources on Google Scholar or other databases, and these tools should make it relatively easy to establish whether a particular entity meets GNG (although that assessment is frequently subjective, as evidenced by many impassioned debates at AFD!) Assessing due weight for inclusion of a conflict (or any concept) in other related articles is much more difficult, though tools could and should be developed to aid that assessment.

Implications and questions
It would be possible to extend this methodology to any class of articles likely to suffer from systemic bias in its representation on the encyclopedia. Our earlier example of women mathematicians could be represented as an ontology consisting of biographies that are related by statements about mathematical concepts, other biographies, places, awards, books, and technologies. In the case of environmental conflicts, the EJAtlas makes it easier to organise this information and provides a resource for identifying many of the related entities. For other sets, a similar resource would be helpful.

In the case of environmental conflict, some care is warranted to ensure that we develop an ontological structure that minimises harm. The central question is which categories should be differentiated and which should remain ambiguous. Within the platforms explored here, conflicts and conflictive projects are frequently conflated. My initial experiment conflated conflicts and disasters (in the single category of events), though I differentiated these categories in the final iteration. Any structure will erase certain distinctions while preserving others, and this erasure has the potential to do harm.

In view of the ongoing violence of environmental injustice and the possibility that this work could reduce harm by making information about that violence more accessible—as well as the reality that some of this ontology is already implicit in the existing structure of Wikidata, Wikipedia, and the EJAtlas—continued development of this approach seems worthwhile; but it will require additional attention to details about the ontological structure. I doubt quandaries about how to structure these relationships will have clear and unambiguous answers.

It is also true that sorting through large sets of articles is a lot of work; and perhaps a more organic and less systematic approach could eventually address systemic bias. But given how little attention this problem seems to get (WP:CSB has struggled to remain viable for years, and it teeters on the edge of inactivity); and given the scale of the problem, some organised and systematic approach carried out by a small number of editors seems advisable. This approach is intended to save labour in the long run by facilitating script-aided editing.