User:DexDor/HowManyCatsOnPage

How many categories per page?
The question:
 * Are there predictors of number of categories assigned per page? Is there a relationship between the number of categories assigned to an article and: article length, the number of edits the article has, the age of the article?

Firstly, some assumptions (to limit the scope of the question):
 * Assumption 1 - We're only considering articles (not other types of pages).
 * Assumption 2 - We're only considering normal reader-side categories (that categorize an article by the article's topic). I.e. we're not considering maintenance categories (that are usually added by a template and hidden from readers), talk page categories (e.g. for wikiprojects) and any other categories based on characteristics of the article itself (such as the article being a stub).

An article should be in every category for which the article satisfies the category's inclusion criteria. At this point another assumption is needed to limit the complexity of this analysis:
 * Assumption 3 - Each category (at least in the parts of the category structure relevant to the article) has clear inclusion criteria (from the category name, any category text and any categorization guidelines). Note: There are (currently) many en wp categories for which the inclusion criteria are not clear - e.g. where editors disagree about what the inclusion criteria for the category are.

The number of category tags that there should be on an article depends on the structure of the category tree. An article should be in every category for which the article satisfies the category's inclusion criteria, but it should only have category tags for the most specific of those categories. For example, placing an article in a category for Belgian actors (assuming the category structure is correct) also places the article in the category for Belgian people and in the category for actors so the article should not have category tags for those 2 categories.

Thus, another assumption is needed to limit the complexity of this analysis:
 * Assumption 4 - The category structure (at least the parts of the category structure relevant to the article) conforms to the rules of categorization (e.g. more specific categories have been placed below appropriate parent categories). Note: There may be some categories where the parentage of the category is incorrect such as forming a category loop.

When an article is placed (directly) in a category the article also becomes a member of all the parent articles (right up to Category:Contents assuming that the category structure doesn't have a major flaw). Thus, there isn't a direct relationship between the number of category tags on an article and the number of categories the article is in. For example, this edit replaced 3 category tags on an article by 1 category tag, but the new category tag placed the article in a category that is a sub-category of the 3 categories that it was in before so the edit has increased the number of categories that the article is in.


 * Category tags currently on a page = TSC + TIC - TNC


 * TSC = Category tags that should be on the page (for the current category structure).


 * TIC = Category tags that are on the page incorrectly (for the current category structure). This includes redlink category tags, duplicate category tags, pages placed in a words&phrases category incorrectly, ...


 * TNC = Category tags that should be on the page that aren't (for the current category structure). E.g. a new article that hasn't yet been categorized, an article where the category tags have been removed by a vandal, an article that hasn't yet been placed in a new category.


 * TSC = TSP + TIP - TNP


 * TSP = Category tags that should be on the page (in a perfect category structure)


 * TIP = ...


 * TNP = ...

This assumes no disagreement about category inclusion criteria, non-diffusion etc.

Variation in number of category tags by type of article
Articles can be divided into the following types:
 * Intersection articles (where the article's topic is the intersection of 2 or more topics) - e.g. Agriculture in Brazil. Such articles generally only have a small number of category tags (e.g. to place that article in Category:Agriculture and Category:Brazil). Only a small fraction of Wikipedia articles are intersection articles.
 * Biographical articles. These are sometimes categorized such that one article can have several different values of one characteristic - e.g. a person might be categorized for several nationalities and several occupations. This can lead to many category tags on some articles.
 * Other articles - e.g. an article about a town, an event or an animal species. These are usually categorized such that for each characteristic the article only has one value.

The graph to the right shows how the number of category tags on articles varies with article length.

Number of categorization characteristics
Another (possibly more useful) thing to consider is how many characteristics an article is categorized by.

For example, an article about a museum might be categorized by the type of museum, its location and its year of opening (i.e. 3 characteristics). If each category was for a single characteristic then that article would have 3 category tags. For example, an aviation museum in Footown that opened in 2017 might have category tags for aviation museums, Footown and 2017 establishments. Those categories would each be part of more general categories (e.g. for transport museums).

However, in Wikipedia we also allow categories that are for combinations of 2 or more characteristics (intersection categories). This can reduce the number of category tags on an article (e.g. if one category covers all 3 of the museum's characteristics at the most detailed level), but more often increases the number of category tags because of different ways of combining the characteristics at different levels of detail - museums in Fooville, aviation museums in Fooland, transport museums opened in 2017, aviation museums opened in the 2010s, museums in Fooland opened in the 2010s etc.

Where articles have more than about 4 characteristics (as many articles do) the number of possible combinations of characteristics becomes large and if there was a category for every combination it would take a lot of effort to add all the applicable category tags to articles - or (more likely) lead to categories being left incomplete.

Hence, editors constructing the categorization scheme should be careful to avoid having too many intersection categories.

Categorization of categories
Category:War of 1812 was in about 30 parent categoried as of 2019 - Category:Invasions of the United States Category:19th-century conflicts Category:19th-century military history of the United Kingdom Category:1810s in the United States Category:1812 in the United States Category:1813 in the United States Category:1814 in the United States Category:1815 in the United States Category:1810s in Canada Category:1812 in Canada Category:1813 in Canada Category:1814 in Canada Category:1815 in Canada Category:United Kingdom–United States military relations Category:Canada–United States military relations Category:Wikipedia categories named after wars Category:Conflicts in 1812 Category:Conflicts in 1813 Category:Conflicts in 1814 Category:Conflicts in 1815 Category:Conflicts in Canada Category:History of the United States (1789–1849) Category:Wars involving the indigenous peoples of North America Category:Wars involving the United States Category:Wars involving the United Kingdom Category:Warfare of the Industrial era Category:Military history of Quebec Category:Napoleonic Wars Category:Eras of United States history Category:19th-century military history of the United States Category:1810s in the United Kingdom And has been in other cats Category:Military history of Ontario Category:Presidency of James Madison Category:Folklore of the Southern United States Category:American folklore Category:Canadian folklore Category:Pirates Category:Naval battles involving pirates