Wikipedia:Wikipedia Signpost/2020-05-31/Special report


 * Marc Miquel, Ph.D. is a user researcher and lecturer who focuses on diversity in Wikipedia, engagement in games and user experience in general. The article was originally published by Wikipedia@20 and is licensed CC BY 4.0 -S 
 * No matter whether a language is a real entire worldview or not, a collaborative encyclopaedia like Wikipedia provides the best chance to allow any language speakers to immortalize it. -Editors of Wikipedia@20 

Though I had used Wikipedia for years, it was only ten years ago when I discovered how each language edition community can freely organize its content—as there is no central editorial board. The Catalan version of the encyclopedia, in my native tongue, can have pages dedicated to its culture without impediment. Some might take this for granted, but I cherished this principle because of my memories of my grandfather, who was forbidden to speak his language in public during the forty years of Franco's dictatorship, and of my mother, who did not have the chance to be educated in her mother tongue. I did not immediately become a contributor, but I wanted to learn more and, hopefully, one day give back. Today, I am doing so as a researcher with the Wikipedia Cultural Diversity Observatory (WCDO). Though the English Wikipedia has brought much attention to the larger Wikimedia project, that project's future and potential growth lie in many smaller languages and cultures, which are often overlooked—and under threat, as many human languages are likely to disappear by the end of the century.

The poet Ezra Pound said that "the sum of human wisdom is not contained in any one language, and no single language is capable of expressing all forms and degrees of human comprehension." Obviously, the same is true of Wikipedia. At the observatory, we work to discover the knowledge that is local to each language, the cultural pearls from every place in the world, and we promote its exchange. I believe this can be advanced using a model assessing project cultural diversity. Such a model will then allow us to better encourage Wikipedia language communities to raise awareness, organize events, adopt tools, and incorporate cultural diversity as part of our strategic plans.

Researching the cultures in Wikipedia language editions
Although cultural diversity appears now to be a crystal-clear priority for the movement, it was not that obvious in 2011, when I attended my first Wikimania. In the most popular and crowded Wikipedia conference, the multitude of nationalities reminded me of the encyclopedists' version of the United Nations. Our apparent differences were in clothing, colors, gestures and many other details. Before the conference, a friend of mine asked me a key question: if English Wikipedia has most of the articles, why should there be hundreds of other language editions? I hesitated a bit, and my answer was that for the different language editions to exist, they had to be different.

Finding these differences became my main interest in Wikipedia. Even though I was initially more focused on the Catalan Wikipedia, I found an exciting quest in using algorithms to compare the contents from any language edition. I could see the extent and particularities of the coverage of each topic in each language as if they were patterns revealed in an aerial view, unperceivable to the eyes of other editors. Analyzing the editors' behavior and the extent of topics in articles became the object of my Master's thesis and later of my Ph.D. thesis. By understanding how this editing process unravels in the data and other researchers' work, I found many reasons to justify the need for multiple language editions. I will try to summarize them into three.

The first aspect I saw during my research was that the articles of every language edition are limited to specific groups of points of view or have a "linguistic point of view." This was something intuitive to any Wikipedia user. Some topics are dealt very differently in the Catalan and Spanish Wikipedia – especially those concerning politics and culture. Brent Hecht and Darren Gergle showed us that these variations in points of view between the language versions of the same article could be measured by taking into account the outgoing links in the text they have in common. Even in general topics, like ‘Psychology', one can find differences of 20% in the links pointing at different articles. Massa and Scrinzi pointed out that topics that elicit controversy, for instance, articles about the terrorist "Osama Bin Laden" or the international struggle "Israeli-Palestinian conflict," showed the fewest number of links in common.

This led me to think that even though Wikipedia asks for a neutral point of view (NPOV) (i.e. a fair representation of the different available points of view on a topic), we know this is an ideal. Since a language edition is a community phenomenon, group interests and power dynamics tend to reinforce or undermine certain points of view. Some perspectives are unknown or simply ignored, and very few are novel or exclusive to that particular group of speakers. This latter category is very valuable. Such novelty and uniqueness is, in fact, a valuable contribution, and should be seen as a complement to other language editions.

Linguists sometimes defend a linguistic perspective by saying that every language is a specific worldview, or at least, one of a particular context. Each language you speak gives you concepts to map things and situations, and classify them according to the experience of generations. Any language accumulates knowledge in the vocabulary used to label the species of plants, the nouns to describe climatological changes in the natural environment, and the idioms and adjectives that have originated to understand human character and history in a specific way. Being able to compare linguistic differences and observe from multiple perspectives allows you to contrast and understand reality better.

The eminent linguist Benjamin Lee Whorf went a bit further with this perspective and reinforced the idea that we need more than one language to gain depth in thinking. He claimed that all knowledge is provisional, and therefore, multilingual competencies allow you to advance faster in its development. "Western culture has made, through language, a provisional analysis of reality and, without correctives, holds resolutely to that analysis as final. The only correctives lie in all those other tongues which by aeons of independent evolution have arrived at different, but equally logical, provisional analyses." This quote inevitably reminded me of how Wikipedia allows us to compare the different points of view, jumping through the parallel versions of an article that exists in several language editions.

The second aspect I saw during my research was that the language editions are influenced by the territories where the language is spoken and they are the most complete at creating content about them. Hecht and Gergle measured in several language editions the number of links directed to articles geolocated on the territories where the language is spoken. With such a simple metric they could determine that each Wikipedia tends to be self-focused, as results indicated that these articles received many more links than other geolocated articles, i.e., they were more prominent in the linked graph structure.

Even though geolocated articles show relevant language differences, one could argue that this is only a small portion of each Wikipedia. The articles about many other topics such as traditions, history, organizations, politics, and so on can explain the idiosyncrasies of any culture and the territories where the language is spoken. This way, by collecting all the articles about these topics, I thought we could get a better idea of what is genuine in the cultural and geographical contexts of every language edition.

I hence proposed an algorithm to collect such articles and I entitled the selection of articles "Cultural Context Content" (or CCC). My first questions were (1) how many articles would each Wikipedia dedicate to their cultural contexts, and more importantly, (2) what would be the extent of this group of articles.

A supposition concerning the Catalan Wikipedia was that it would overcompensate for the linguistic and cultural genocide suffered during the past century and that it would also be influenced by the current political self-determination struggle. This might result in an exaggerated number and proportion of articles set in this cultural context, which would be centered around Catalonia, Valencia, Balearic Islands, Andorra and a few scattered territories in the south of France and in the Aragonese autonomous community. Surprisingly, the proportion was only 20% and since the first measurement, it has decreased to the current 17.09%. Taking into account the top forty language editions, the average proportion of content dedicated to their cultural context is a quarter of each Wikipedia. Some like the English and the Japanese presented more than half of them. Others like the German, French, and Italian had lower proportions (33.7%, 26.9%, and 18.8% respectively).



It is difficult to answer why some Wikipedia language editions dedicate more articles to their context than others, as it may depend on many factors. The proportion of articles dedicated to CCC is not related to the density of the population, nor the number of editors, nor the territorial area. But it is surely an indicator of appreciation towards their culture and places. The fact that the proportion of articles dedicated to CCC remains stable over time in every Wikipedia language edition, implies that editors are motivated to continuously create and represent the most significant places around them. This came as a surprise to me, as I expected it would decrease with the growth of each Wikipedia language edition. Why would editors continue to create articles about their culture after the main cities, political figures and historical events have already been documented?

In the beginning, I was not sure whether to consider the large extent of CCC as an undesired bias. But my interpretation of the presence of these "local encyclopedias" drifted from acceptance to encouragement, especially when I realized that the proportion of pageviews was even higher than the proportion of articles itself. Then I assumed that each Wikipedia could have a fundamental role in illuminating the context of each language to readers, and this is probably a key ingredient to explain the overall success and popularity of Wikipedia. One could say that the differences that every language edition present are even more valuable to readers than editors, which totally justifies the effort.

Even though I have not yet verified whether this higher reader interest in CCC articles applies to all language editions, the hypothesis "context-encyclopedia-key-ingredient-to-success" is very plausible. In smaller Wikipedias with little traffic, we see the inverse trend. For instance, in some African vernacular languages, the proportion of articles dedicated to their context is very low. Considering 39 Wikipedia language editions in Africa, the average proportion of articles dedicated to each cultural context is 11.1% (median 13.8%). Why is that so? Because these languages are often relegated to a private use while English or French is used for education and official matters. Only Afrikaans—a language with a social situation similar to European languages—has 23.9% of content dedicated to its context. Hence, we can say that cultural context content creation and consumption is a good indicator of a healthy Wikipedia in a society.

The third and probably the most relevant aspect I saw during my research was that Wikipedia language editions do not cover one another's cultural context content, i.e. they do not have sufficient cultural diversity in their content. In 2012, Bao, Hecht, Carton, Quaderi, and Horn found out there is a language gap between Wikipedia language editions, that is, every language edition has many articles with no equivalent version in other languages. Also, contrary to what my Wikimania friend thought (that English Wikipedia would be the only necessary language edition, a sort of "catch-all encyclopedia"), bigger language editions do not cover the articles from smaller ones. Considering this, I wondered whether this language gap could be due to the cultural context. The results showed that, on average, 60% of the articles that are not translated in any language edition are related to the language cultural context.

When CCC articles were shared across languages, it tended to be with those geographically closer or with those language editions which had the largest number of articles (especially English, German and French Wikipedias). It surprised me that sometimes articles related to the context of small Wikipedias were not covered at all, even though one might think it would be an easier effort to the community of editors. Some Wikipedians told me that multilingualism dynamics tend to be translating from bigger language editions into smaller ones. Besides one must also consider the difficulties in accessing the content from an unknown language about unknown territories. As a result, big Wikipedia language editions do not cover the diversity of knowledge available in smaller languages either.

The excess content about the Western world is part of this so-called systemic bias. To me, it seems the large amount of content Wikipedias devote to their context-based institutions, entertainment and sports is not the problem – as it is popular and read. Instead, it is the lack of reciprocal content about their cultural contexts that impedes reaching a minimum of content about the world's cultural diversity. Perhaps even more important is the struggle of these small encyclopedias to represent their cultural context. We have to work in both cases.

The debate on the role of Wikipedia in the future of languages and human knowledge
The first article in a non-English Wikipedia was in the Catalan Wikipedia. It was about Àbac (Abacus), an ancient calculating tool, and it was written by an editor from Andorra named Cdani, who requested Jimmy Wales, the co-founder of Wikipedia, to create a Catalan Wikipedia where Catalan editors could write in their native language, and so as "to not inflict his terrible English" on the English Wikipedia which had been created two months earlier. In fact, Wikipedia has always been global and the need for growth is still very present. In the recent Wikimania 2018 held in South Africa, Jimmy Wales reminded the community about the "desire to be in every language and every culture, on every continent and in every place" and celebrated the first thousand articles in the Zulu Wikipedia.

With the recognition of milestones being reached by small languages, the Wikimedia movement acknowledges that information and knowledge are determinants of wealth creation and social development for any society in general. For several years this has been one of the main directives of UNESCO, which claims that the inclusion of languages in the digital world is urgent, as the digital divide will only increase their marginalization. In this sense, Wikipedia has set a long-term strategic direction aimed at knowledge equity by 2030. This is understood as putting "the focus of our efforts on the knowledge and communities that have been left out by structures of power and privilege," by breaking down "the social, political and technical barriers preventing people from accessing and contributing to free knowledge."

Barriers such as the digital divide—lack of Internet—prevent millions of people from using Wikipedia. At the same time, the inclusion of new languages in the Wikipedia project is not as easy as encouraging their speakers to become editors, as they come across other obstacles as well. Van Dijk states that the lack of language standardization including common grammar, the degree of editor literacy, and the language status or the attitude of speakers towards their language, are all factors that have a major impact. This latter factor is especially delicate as speakers should have the conviction that their language is worthy of such endeavor. But when the speakers internalize marginalization and a subsidiary position, it becomes very difficult to envisage that history could have been different, and revitalize the language and grow a Wikipedia.

I believe that the problem of little content in Wikipedias of less-resourced languages should not only be seen as a language problem but also as a local knowledge problem considering that language and knowledge are inextricable. I am certain that a way to help speakers of endangered languages to enter or expand Wikipedia is to send them the clear message that their knowledge matters, and that it is what we need to reach the best depiction of human cultural diversity. Conceivably, the language problem cannot be tackled without tackling the recognition of their speakers' knowledge, encouraging its representation, or at least the representation of its most relevant concepts (i.e. geographical places, traditions, and leaders) from the speakers' points of view – as we suggest in the Wikipedia Cultural Diversity Observatory.

During the first ten years, Wikipedia grew to include more than 260–270 active language editions, and since then it has remained stable at around 300. This represents an incredibly low number as compared to the approximately 7,000 languages that reportedly exist on the planet. Many linguists like Andrew Dalby foresee massive language death in the next decades. András Kornai presents evidence of a massive die-off caused by the digital divide and estimates that only 5% of all languages can obtain an online presence (i.e. around 350 languages). How they can defy this fate and survive remains an open question. But it seems obvious to me that Wikipedia is the best available strategy for these endangered languages, independently of whether they are fully revitalized or not.

When we fear language loss, we may precisely fear the disappearance of that aforementioned worldview, one that required some time to get established and refined. No matter whether it is a real entire worldview or not, a collaborative encyclopedia provides the best chance to allow any language's speakers to immortalize it. The use of Wikipedia and this local knowledge in education may be crucial in order to have a chance to pass it on to further generations, and in any case, Wikipedia's characteristics such as its wide variety of topics, linked nature, and extensive use of images constitute a corpus of knowledge essential to revitalize the language or to study its nuances at any future point. The difficulty lies in breaking all the barriers and encourage speakers to edit articles.

I do not doubt that given that Wikipedia is one of the most visited websites on the Internet, its communities and strategic direction will react and be a clear example in leadership assuming the necessary efforts to take up the cultural diversity challenge. Throughout the commitment to knowledge equity, Wikipedia is in a position to make one step forward towards cultural diversity. It would be easy to subscribe and commit to the UNESCO declaration on Cultural Diversity from 2001, which defines cultural diversity "as a source of exchange, innovation and creativity" […] "as necessary for humankind as biodiversity is for nature." This means that defending cultural diversity is not only a matter of respect for the heritage but a pragmatic decision towards humanity's progress. The UNESCO declaration adds that "[cultural diversity is] the common heritage of humanity and should be recognized and affirmed for the benefit of present and future generations." Making a public commitment to this declaration accompanied by several measures such as revising content policies would most surely bring positive results.

Maturity levels model for cultural diversity in Wikipedia communities
Once we have agreed that Wikipedia must take an active role in preserving cultural diversity, we might ask ourselves what we can do now with the current communities. How could we align all the movement members towards improving cultural diversity in their content? One way we find particularly useful is to evaluate the maturity of each language community in terms of cultural diversity. A maturity model allows us to understand the situation and barriers an organization comes across when incorporating certain elements in view of succeeding at a particular aspect. For cultural diversity in Wikipedia we propose each language community to work on the a) discourse, b) organization (through events and tools), c) degree of awareness of the gaps (through metrics and visualizations), and d) strategy (by setting goals and priorities).

Figure 2 below shows a preliminary version of the maturity model. The different sorts of barriers and levels are based on discussions I held with the communities during international Wikipedia conferences, while the different incorporated elements in the pursuit of cultural diversity are my suggestions. I named the levels: (1) Unintentional, (2) Spontaneous, (3) Organized, (4) Controlled and (5) Distributed.

The more a community moves towards the later levels, the more it is able to create a culturally diverse array of content (or closer to the sum of human knowledge in terms of cultural diversity) in its language and even contribute to the content of other languages. Having a mature understanding of cultural diversity implies that, first, you represent your cultural context (e.g. cities, monuments, leaders, etc.) and, second, you share this content by exporting it across the other language editions, as well as covering their cultural context content.



At the first level, Unintentional, cultural diversity is not yet a goal and not even a topic of discussion. The few editors working on the language edition try to cover the very basic encyclopedic knowledge usually based on a Western perspective. Cultural diversity is scarce considering the superficial knowledge a basic encyclopedia provides: world capitals, most spoken languages, among others. Editors usually come across barriers such as lack of Internet, lack of translation tools or lack of self-recognition of the value of their language and culture.

At the second level, Spontaneous, the community exists and in terms of cultural diversity, editors start creating content about nearby places and people, as they consider it valuable to readers. Even though there is no strategy, they recognize the value of representing their cultural context and of translating articles from other language editions – they incorporate certain elements of discourse. However, there are no community conversations on how editors should organize themselves to create content more efficiently (i.e. using lists or contests) and all contributions are spontaneous. They lack editors and an offline team to move further.

At the third level, Organized, a few people emerge within the community with an organizational mindset that allows them to propose topic-dedicated events. In terms of cultural diversity, some events are dedicated to visually representing their heritage (e.g. Wiki Loves Monuments), to spread it across other languages (e.g. Catalan Culture Challenge), and to cover the cultural context of other languages (e.g. Asian Month). Members of communities that reached this level sometimes have a big picture and are partially aware of the contents that are missing, but they lack measurements and tools to better organize themselves and prioritize their top value actions.

At the fourth level, Controlled, there are different new roles: event organizers, content experts, and international relations. They are able to consider the big challenge to cover the cultural context content of other language editions, and they engage in all sorts of events to do it. An example could be the regional Wikimedia CEE Spring contest organized by Central and Eastern European languages. At this level, the use of metrics and data visualizations in order to be aware of the content coverage is incipient, but it would be very useful to know the cultural context content of every language edition (% of articles) and the knowledge gaps. However, few editors access the metrics. With no regular measurement and no constant communication, the figures on cultural diversity and gaps might not trigger any further action.

At the fifth level, Distributed, cultural diversity is seen as a top priority. Communities count on different area experts (in the field of events, metrics, communication, etc.) and know how to establish reasonable goals and organize themselves to accomplish them. The degree of coverage of other cultural and geographical contexts is common knowledge across the community and editors are aware of the main knowledge gaps. Cultural diversity has its dedicated events and contests and it is also a recurring requirement for other contests based on general topics (e.g. Women, Art, Books, et cetera.). At this level, discourse, organization, indicators, and strategy are at an advanced stage for the community to represent the existing world cultural diversity. The community has a strong culture in addressing knowledge gaps and every member is able to find the necessary events and resources to do it. The metrics assessing the extent of the gaps are constantly visible in the different types of community communications (e.g. newsletter, mailing list, etc.) that reach the entire group, and the use of tools to browse valuable articles is common in events.

According to the model above, maturity in communities progresses one level at a time. If, for instance, a community is at level 2 (i.e. Spontaneous), it will not be able to fast forward to level 4 (i.e. Controlled) without first passing through level 3 (Organized), gaining the necessary community capacity. Each level requires revising the current processes with more skills and knowledge. While I am writing this, no community has reached the fifth level (and only a few are located on the fourth), because metrics and data visualizations are also being developed and are to be implemented by the end of 2019. I believe that the more awareness is raised on content cultural diversity and the more usable the tools become, the easier it will be for communities to embrace these values and practices. In the end, cultural diversity is a core value of the global movement and the different elements of the model are aimed at improving current activities.

Without metrics and tools, it is hard for communities to work on topics they may not be able to identify in a foreign language. Metrics may be useful to provide editors with specific points to address the cultural diversity or culture gap problem and have more impact on their contributions. In the near future, I hope to obtain feedback from the communities and understand more thoroughly the barriers that separate one level from another. For instance, the use of a survey would be helpful to obtain data and refine the model, while at the same time disseminating it. The maturity model for cultural diversity is a working vision to help language communities make progress through specific and attainable steps.

Towards a stronger sense of a global community
Thirty years before the commercialization of the Internet and forty years before the birth of Wikipedia, media theorist Marshall McLuhan anticipated that the world would become a global village. Each place would be connected through technology, and information would continuously flow without entailing cultural uniformity. The Internet may not have yet lived up to such humanist ideals, but I truly believe Wikipedia has managed to create a fascinating space, where speakers of any language can present information from different points of view, and search for consensus through a shared representation of provisional knowledge.

As I am writing these lines, I believe cultural diversity remains an unopened box to most of the movement. The sum of human knowledge cannot be contained in one language edition. The sum of human knowledge depends on representing and sharing the content of every language with other languages; in other words, it depends on the content exchange between languages. Current research shows that large language editions like English, French or German cover a considerable amount of content relative to the cultural context of other languages, but this is not usually the general case nor is it sufficient. We cannot be content when African languages do not reach even a minimal representation of their related cultural context, hence failing to provide a perspective on their leaders, places, food, and traditions, among other things.

All in all, I am confident that cultural diversity will become one of the main objectives in the future. Whenever I attend a Wikipedia meeting or event, I realize that we enjoy being part of a global community. Editors feel this sense of unity in diversity, and the very fact of recognizing the value of cultural diversity and fostering content exchange will strengthen the movement in many senses. I am not sure I can promise my grandfather or mother a specific extent to which Catalan will be used in the next century, the number of new Wikipedias in the next ten years, or the state of coverage of all cultural contexts by minor language editions. But I am positive that Wikipedia is the best possible way to spread human knowledge as there is nothing more Wikipedian than being culturally diverse.


 * Acknowledgments: Thanks to the valuable suggestions on improving the article to Robin Taylor, Laura Vincze, Joseph Reagle, David Laniado, Denny Vrandečić, Stephane Coillet-Matillon, and Jake Orlowitz.