User:DexDor/Categorization

Categorization, if used correctly and consistently, may provide a useful facility for some Wikipedia editors and for some readers. However, placing articles in too many categories (overcategorization) can have negative consequences.

Ideally, the amount of categorization (i.e. the number of categories that each article is placed in) should be sufficient that the benefits of categorization can be realised whilst minimising the costs.

For most readers and for many editors categorization is an irrelevance.

Benefits of categorization in Wikipedia
Uses of categorization:
 * Wikipedia categories can be used (by both readers and editors) to find an article on a specific topic (e.g. if they can't remember the name of the person they are looking for information about) or to navigate between articles. However, other facilities (e.g. templates) often provide a better way of doing this (particularly for readers).
 * Wikipedia categories provide a structured way for editors to look at all the pages of a particular type (e.g. articles on a particular subject) - including those articles that have few links from other articles (which are often articles that are in a poor state).
 * Categorization is a way for editors to spot duplicate articles and content forks.

Costs of categorization in Wikipedia
Costs of categorization include:
 * Edits to the categorization of an article appear in the article's edit history and hence on the watchlist of those maintaining the article. This "watchlist noise" causes extra work for those editors protecting the article against vandalism etc.
 * Editor time is spent working on categorization, rather than on the text of articles . This includes both maintenance of the category tags on articles and maintenance of the category structure.  Several editors have asked whether that is worthwhile.
 * Categories that are not fully populated or are badly organized may mislead readers/editors into thinking that Wikipedia does not have an article on the topic they are looking for.
 * Readers/editors may assume that categorization is correct (when it isn't).

Characteristics used for categorization
A Wikipedia article can contain hundreds, even thousands, of pieces of information - e.g. an article about a city may mention the city's opera house, football team etc. In theory, each of these could be a characteristic to categorise by (e.g. we could have a category for articles about "Cities that have had an openly gay mayor" ). However, that sort of categorization could cause articles to be in hundreds of categories and require a huge amount of maintenance (on both the articles and the huge category trees that would result). Instead, Wikipedia categorization is based on categorizing articles only by the most important characteristics of the topic of the article (plus a few categories required for administrative reasons). In Wikipedia these are called "defining characteristics". The exact meaning of that term will probably never be agreed by all editors, but the principle is generally accepted. So, for example, the article about a city is normally in a category like "Cities in " and a few other categories for important long-term characteristics (like being a capital city or being on the coast); of the hundreds of facts in the article only a small number are used for categorization.

Problems
Editors interested in a particular topic tend (perhaps inevitably) to view characteristics that relate to that topic as being of particular importance. For example, an editor interested in time zones may think that's an important characteristic of a city. Other editors might be more interested in the types of public transport a city has, the ethnicity of its inhabitants, sporting events held in the city etc (these are all examples of categories that have been deleted). Some editors even place articles in a category despite the articles not mentioning the characteristic that the category is about. If all these editors got their way then a link to "their" category would appear at the bottom of lots of articles, but it would be hidden amongst hundreds of other categories and thus unlikely to be used to navigate from the article.

Sometimes editors start from a (off-wiki) list and try to add all the corresponding Wikipedia articles to a category, regardless of whether the article's contents show it meeting the inclusion criteria for the category. Some examples:
 * An editor categorizes articles based on results of a census, even though the article text doesn't contain the results from that census.
 * An editor finds "Hutu" on a list of Māori plant names so places the Hutu article in that category even though that article is about an ethnic group in Africa.

Similarly, editors adding articles to a category based on an off-wiki list may miss articles that meet the inclusion criteria, but aren't on their list. An example:
 * An editor puts articles about places into a category about a path (that goes through the places), but doesn't categorize an article about a tunnel that the path goes through.

Some examples of overcategorization:
 * A person in 5 "descent" categories (i.e. categorizing the person based on the ancestry of a great-grandparent).
 * A person in 7 "people from" categories.
 * A person in 5 categories relating to their suicide.

Consider, for example, an article about a singer. An editor interested in awards might look at the part of the article listing awards the person has received, create categories such as "Winners of " and place the article in those categories. Someone interested in festivals might place the article in categories for "People who performed at ", someone interested in personal lives might add category tags for "People who have dated ".... The list of categories would then be as long as the article - in fact it could be much longer as there can be categories for combinations of characteristics; so the article might be in categories for both "Singers who performed at " and "People from who performed at ".

A common problem in Wikipedia is that wherever there's a list which doesn't have a precise definition of what is eligible to be in the list then (well-meaning) editors keep adding "just one more" item to the list. This happens both with new categories and with category tags on articles.

Similar articles and related articles
Categories are for grouping articles about similar topics; that is not quite the same thing as grouping articles about related topics. For example, an article about a soldier who was awarded a medal for his actions in a particular battle should be linked to articles about related topics (e.g. the article about the battle, his regiment, weapons used etc), but in categorization his article should be grouped with articles about similar topics (i.e. other soldiers decorated for valour) even though there are few direct links between such articles.

Another example: Charles Darwin and HMS Beagle are related topics - the articles are linked to each other, but in categorization one belongs under people categories (e.g. Category:English naturalists) and the other belongs under ship categories (e.g. Category:Ships of the Royal Navy). In this case categorizing articles because they are related can lead to a category loop (Category:Charles Darwin and Category:HMS Beagle).

Solutions
Possible solutions to some/all of the problems outlined in this essay include: (Under construction)
 * Delete the entire category system.
 * Delete bad categories that have been created.
 * Reduce the number of bad categories being created.

Cost-benefit analysis of a category
Cost-benefit analysis for (particular types of) categories -

Costs

 * Clutter making it harder to see real defining cats (often the main categories are normally at the start of the list of categories, but not always).
 * Watchlist noise (e.g. potentially hiding vandalism).
 * Encourages the creation of more bad categories.
 * Editor time could be spent more productively.
 * Bandwidth for upload/download (minimal).

Specific topics

 * Awards recipients - see User:DexDor/Categorization of award recipients