Wikipedia:Reference desk/Archives/Computing/Early/Category graph study

The Wikipedia category system is a directed graph where each node which represents a category also contains a list of articles included in that specific category. Inside Wikipedia the category system is used to organize and find articles as well as group together articles of similar content. Other uses could include finding out whether an article is inside a particular category (defined as a child of a category that is a child of another category, and so on, until the desired category is reached) or building a list of articles that a category and all of its child categories contain. A true Tree data structure would make this nearly trivial; the current directed graph does not. Additionally the software currently does not have a way to enforce any restraints on categories and this may not be feasible for the software to do at all.

Under the current category system loops are possible which lead to bizarre information being present in the category graph, such as:

Humans->Apes->Primates->Mammals->Animals->Tree of life-> Biology->Science->Human societies->Humans

It is also possible create large loops which contain nearly every category inside the Wikipedia. This severely limits two of the potential uses:


 * Finding out whether an article is inside a particular category
 * Building a list of articles out of a category

As you can see in the above example Humans is a member of itself, Apes is also a member of humans (invalid), as is Animals (A human is an animal, an animal is not a human). If you built a list of all Human articles you would also get every mammal; this is less than useful.

What to do?
It would be highly advantageous to be able to extract useful information out of the Wikipedia categories. One possible solution is to find a minimum spanning tree for the category graph; this is not optimal because the shortest paths may not be the true logical and correct paths. An alternate idea is to start a new category under Fundamental categories and try to maintain it as a stricter category. This is likely to slip over time though as the existing category system is, leading to the inability to trust the category system, leading to less usefulness. The solution is not simple and may not be possible, but it would be highly advantageous.