User:Gracefool/What is a category?

This is a mini-essay on a problem in MediaWikiland: category policy. It was initially discussed at wikiEN-l.

N.B. I'm aware of previous discussions at Wikipedia talk:Categorization. This essay is a more thorough and defensible treatment of the issue, and it highlights the fallacies of many previous arguments.

Update 4 Jun 2005: The rule proposed here, "categories are sets/graphs, not trees", is now enshrined in policy =).

Introduction
What is a category? No-one knows. There isn't consensus on what a category is (see Wikipedia talk:Categorization). Is it a hierarchical tree, with all categorizations representing "is a" relationships? Or is it just a set, a group of related articles, which may belong inside one or more other sets?

This is an important question - just look at Categories for deletion. Changes to categories have more widespread effects than changes to articles, and have a greater possibly of annoying editors.

I believe that categories are, and should be, sets, not hierarchies.

Original purpose of categories
What was the original purpose of the categorization system? Development of a taxonomy of worldy knowledge? I don't think the developers are really that stupid (I'll expand on this below). AFAIK it was as a kind of automatic list-generator for related articles. Lists are sets, not hierarchies. Lists of "related articles" are sets, not hierarchies.

Current software
The way that categories have been developed in software supports the idea that categories are sets. There is implicit support for categories as sets because there is nothing to stop anyone from using them that way. None of the limits of a hierarchical system exist in the category software. Such software is the best way to enforce the idea of hierarchical categories, and would be easy to implement (eg. don't allow arbitrary parenting of categories).

Until policy is decided on (and, preferably, software upgraded to support it), categories will continue to be used as sets. Since sets include hierarchies, while hierarchies don't include sets, the current categorization system is one of sets.

Categories are inherently POV
A categorization system is a worldview. Therefore it is very hard for categories to be NPOV. The following quote from Clay Shirky expands:


 * Many networked projects, including things like business-to-business markets and Web Services, have started with the unobjectionable hypothesis that communication would be easier if everyone described things the same way. From there, it is a short but fatal leap to conclude that a particular brand of unifying description will therefore be broadly and swiftly adopted (the "this will work because it would be good if it did" fallacy.)


 * Any attempt at a global ontology is doomed to fail, because meta-data describes a worldview. The designers of the Soviet library's cataloging system were making an assertion about the world when they made the first category of books "Works of the classical authors of Marxism-Leninism." Melvil Dewey was making an assertion about the world when he lumped all books about non-Christian religions into a single category, listed last among books about religion. It is not possible to neatly map these two systems onto one another, or onto other classification schemes -- they describe different kinds of worlds.


 * Because meta-data describes a worldview, incompatibility is an inevitable by-product of vigorous argument. It would be relatively easy, for example, to encode a description of genes in XML, but it would be impossible to get a universal standard for such a description, because biologists are still arguing about what a gene actually is. There are several competing standards for describing genetic information, and the semantic divergence is an artifact of a real conversation among biologists. You can't get a standard til you have an agreement, and you can't force an agreement to exist where none actually does.


 * Furthermore, when we see attempts to enforce semantics on human situations, it ends up debasing the semantics, rather then making the connection more informative. Social networking services like Friendster and LinkedIn assume that people will treat links to one another as external signals of deep association, so that the social mesh as represented by the software will be an accurate model of the real world. In fact, the concept of friend, or even the type and depth of connection required to say you know someone, is quite slippery, and as a result, links between people on Friendster have been drained of much of their intended meaning. Trying to express implicit and fuzzy relationships in ways that are explicit and sharp doesn't clarify the meaning, it destroys it.

The whole concept of an all-encompassing hierarchical category system is against the spirit of Wikipedia. It is an all-encompassing worldview, or attribution of value, to the marked-up (categorized) articles.

The "categories are hierarchies" idea presumes that it is even possible for a large group of people to agree on an all-encompassing belief-system, a ridiculous notion totally bereft of realism, a notion that has been shown wrong experientially in many IT metadata projects.

Categories, especially hierarchical categories, are about the followers of one particular worldview implicitly saying "our way is right, everyone should follow it". Note that the proportion of people who follow one particular worldview in every aspect is very small.

Sets are much less POV
Categorization by set is obviously less POV. An article can belong to as many sets as the community thinks it should belong to, whether directly or via multiple parenthood of the article's category (or ancestors).

Conclusion
The benefits of hierarchical categorization are outweighed by its costs
 * 1) decreased redundancy
 * 2) easier navigation (for a minority who have the "right" worldview)
 * 1) the community will never be in agreement over the system
 * 2) harder navigation (for the majority who don't find articles where they expect them to be)
 * 3) decreased accuracy (the real world is not in a big hierarchy, it merely has sets of metadata applied to it by different people)

This essay assumes that sets are taken advantage of fully by allowing multiple inheritance and possibly even inheritance loops, and encouraging articles and categories to be given many categories rather than just one or two.