User:Obiwankenobi/On categories

This is my page for musings about categories on wikipedia, which are an endless source of pleasure and pain for me. Pleasure because I enjoy organizing articles and having nice clear usable hierarchies to browse through; pain because our category system is just so painfully bad, and so terribly implemented, that I despair at the amount of work left to fix it.

On ghettoization, non-diffusing categories, and LGBT heads of government
I will start by walking you through something I do from time to time, which is de-ghettoizing an area of the category tree.

Let's start with a definition of ghettoization, which I developed in a previous essay, and then I'll give you a new example I found today which has confounded me somewhat...

What is ghettoization?
To start, we will consider that ghettoization only applies to categorization of human biographies on wikipedia

A biography is ghettoized if the following are true:
 * 1) The bio is a member of a gendered, ethnic, sexuality, or religion-based category X, and
 * 2) The biography is not in an ancestor or "blood relative" category of X (e.g. sibling, cousin, parent, grandparent, etc) that is neutral, i.e. non-gendered, non-ethnic, non-sexuality-based, and non-religious and which retains an equivalent descriptive specificity. By equivalent descriptive specificity, you can't say "Well, Y is in, and in , which is gender neutral, so she's not ghettoized." An essential aspect of ghettoization is that the biography is in a ghetto, but not in a neutral category which is an analogue to the ghetto.
 * 3) If multiple facets are intersected on the bio (e.g. gender + ethnicity + sexuality + religion + ...), as you go up the tree, the bio is ghettoized if it is not a member of each extant iteration that removes a facet while retaining the same noun. For example, to avoid ghettoization, Category:African-American women in politics members should also be, at the very least, in Category:American women in politics (removal of "African-American"), Category:American politicians (removal of "Women" and "African-American"), and Category:African-American politicians (removal of "Women").
 * 4) The above rules do not apply for any characteristic which has been fully diffused - i.e. if all men and women are fully diffused, there is no ghettoization concern. See for example, which is empty save its sub-categories. In these cases, there is no need for a neutral category, each person can be an actor or an actress and that is not considered "ghettoization".

What are non-diffusing categories?
Another definition related to ghettoization is needed, that of a non-diffusing category. Briefly, a non-diffusing category is one that behaves differently from normal categories: normally when you place something in a sub-category, you remove it from the parent, but a non-diffusing category behaves differently - if you place something in a non-diffusing category, you do not immediately remove it from the parent.

Wrinkle #1: whether a category is non-diffusing or not depends on its parent. A category can be non-diffusing for one parent, and diffusing for another. Here's an example:
 * Category:Scottish_women_novelists is a non-diffusing subcategory of Scottish novelists, because you don't want to ghettoize the women, leaving the men to gloat alone, victorious, in the novelists parent.
 * On the other hand, Category:Scottish women novelists is a diffusing subcategory of Category:Scottish women writers. If someone is really known as a novelist, they don't need to remain in the writers category, and can be diffused down. Scottish women novelists also diffusing on Category:British women novelists and Category:Women novelists by nationality.

So a single category can be diffusing and non-diffusing at the same time.

Wrinkle #2: a category can be non-diffusing, but members of a non-diffusing category won't always remain in the parent. How does that work? This can happen in cases where the parent category has diffusing categories underneath it, especially if those categories diffuse fully (i.e. if everyone in the parent can be placed in at least one diffusing child cat).
 * For example: If you look at Category:American women in politics, you may think all of those women should be in Category:American politicians, right? Wrong. There is nobody in Category:American politicians - it's empty! In fact, it's marked as a container category - it's not supposed to have anyone in it. So, are all of the women in Category:American women in politics ghettoized? Not necessarily - because as long as they are in a gender-neutral category underneath Category:American politicians, then we are ok.

This particular wrinkle was a bone of contention during the Category-gate discussions, with many arguing "If the category is non-diffusing, it means we must bubble them up!" To show why this is not workable, consider the following simple category structure:
 * Novelists
 * (Bob)
 * (Mary)
 * Women novelists (non-diffusing)
 * Novelists by country (diffusing)
 * American novelists (diffusing)
 * Scottish novelists (diffusing)

We start by placing Bob and Mary in the Novelists category. Now, someone says "Mary is a woman", so she gets added to the Women novelists category as well. Someone else says "Bob is Scottish", so he gets moved to the Scottish novelists category and is removed from the parent, as is normal for diffusing categories - we regularly diffuse based on nationality. Finally, someone comes along and says "Well, Mary is American, so I'm going to move her to the American novelists category and remove her from the parent (in other words, treating her the same as Bob)" - but an editor opposes: "You can't do that - she's in Women novelists which is non-diffusing, so she has to stay in the parent otherwise she will be ghettoized!" - so she gets placed back in the parent.

So now our situation looks like this:
 * Novelists
 * (Mary)
 * Women novelists (non-diffusing)
 * (Mary)
 * Novelists by country (diffusing)
 * American novelists (diffusing)
 * (Mary)
 * Scottish novelists (diffusing)
 * (Bob)

Do you notice anything weird? Mary is the only one in the parent "Novelists" category - this is a rich irony indeed, as she now gloats over her spot at the top of the food chain, while Bob languishes down in the Scottish novelists dregs.

There are two ways to fix this problem:
 * 1) The first solution is: Allow people to be bubbled up, and then diffused down, as long as they are diffused to neutral categories. This was ultimately the solution decided by consensus at . To eliminate a step, you can "bubble up and diffuse down" in one fell swoop, placing the biography in a neutral sibling or cousin, and skipping the parent entirely.
 * 2) A second solution - which was proposed but rejected at the time - would be to consider that as soon as you place a non-diffusing category, all of the siblings (and all of their sub-categories) become instantly non-diffusing as well - everyone bubbles up to the parent. To consider the mess this would cause, look at Category:Mathematicians; if we bubbled up all the women in Category:Women mathematicians‎  to the parent, in order to be fair we'd have to bubble up the men as well, so now our nicely diffused structure would be overwhelmed by thousands upon thousands of mathematicians. They would would, in addition to their "Mathematicians by country" and "Mathematicians by century" and "Mathematicians by field" categories, need to add an additional redundant one to their list, of "Mathematicians". This solution means a perfectly stable, reasonable, diffused tree, can be upended by adding a single non-diffusing category to the parent, causing every other category to un-diffuse itself, spreading virally. It's a bad idea.

Real-life example: LGBT heads of government
So, now that we know what ghettoization and non-diffusing categories are, let's do a real-life example, with a puzzle/quiz at the end.

Today I picked Category:Heads_of_government, which has a subcat Category:LGBT heads of government. I think we all agree that being LGBT doesn't mean you are somehow less a head of government, so we want to make sure all of those fellows in Category:LGBT_heads_of_government are also in a diffusing, neutral subcat (and ideally, several) of the parent. How do we find them? It's rather tricky. We will be using the Category intersection tool to help us. We want to find out, who is in LGBT heads of government, but not in any other neutral categories under. But there are dozens of articles, and dozens of nested categories - how do we sort this out? Here are the steps I took:
 * 1) Get a list of all diffusing, neutral sibling categories of . You can do so using the Category intersection tool like this. This gives us a list we can copy paste, which we paste into the Negative categories box. The list looks like this:

14th-century_heads_of_government 15th-century_heads_of_government 16th-century_heads_of_government 17th-century_heads_of_government 18th-century_heads_of_government 19th-century_heads_of_government 1st-century_heads_of_government 2nd-century_heads_of_government 3rd-century_heads_of_government 4th-century_heads_of_government 6th-century_heads_of_government 7th-century_heads_of_government 9th-century_heads_of_government Assassinated_heads_of_government - should be non-diffusing Children_of_national_leaders - even if they're in this cat, we don't care Collective_heads_of_government Diplomatic_visits_by_heads_of_government Female_heads_of_government - should be non-diffusing Heads_of_government|0 - this is the parent cat - should not recurse, so we override the recursion Heads_of_government_by_country Heads_of_government_in_Africa Heads_of_government_in_Asia Heads_of_government_in_Europe Heads_of_government_in_North_America Heads_of_government_in_Oceania Heads_of_government_in_South_America Heads_of_government_of_non-sovereign_entities Kuhina_Nui LGBT_heads_of_government - remove our target category Lists_of_heads_of_government - this is only a category for lists Premiers Prime_ministers Rulers Spouses_of_politicians - not interested in this Wikipedia_categories_named_after_heads_of_government - more of a maintenance cat, we don't care if they're in it or not


 * 1) Now, we remove from the list some of the categories, like any non-diffusing categories (such as Female heads of government - if our bios are in that one too, that's great, but that doesn't de-ghettoize them.). The sibling cats I removed are struck in the list above.
 * 2) Now we place target cat, Category:LGBT heads of government in the Categories box, select a depth of 8 or so (or deeper, depending on how far down your categories go), select 'Subset', and then "do it". This |0%0D%0A++++Heads_of_government_by_country%0D%0A++++Heads_of_government_in_Africa%0D%0A++++Heads_of_government_in_Asia%0D%0A++++Heads_of_government_in_Europe%0D%0A++++Heads_of_government_in_North_America%0D%0A++++Heads_of_government_in_Oceania%0D%0A++++Heads_of_government_in_South_America%0D%0A++++Heads_of_government_of_non-sovereign_entities%0D%0A++++Kuhina_Nui%0D%0A++++Premiers%0D%0A++++Prime_ministers%0D%0A++++Rulers&sortby=title link shows you a filled out form. What this search does is say "Show me all LGBT heads of government that aren't in this whole other list of categories, recursively." Using this technique, you can search through trees with hundreds or thousands of bios, and quickly find the ones that are ghettoized.

Ok, that was fun, but now to our puzzling result. We found three fellows who are ghettoized - they are considered "LGBT heads of government", but they are nowhere to be found in the Heads of government tree. Here are the questionable characters:
 * Sardanapalus
 * Tiglath-Pileser III
 * William_II_of_the_Netherlands

Now, how could a great king like Tiglath-Pileser III be ghettoized? He's in, , and even ! But the category intersection tool tells us he's ghettoized?? What's going on?

Well, here's where you need to start to explore your tree. If you do so, you will find that it goes something like this: Assyrian kings -> Kings -> Monarchs -> Heads of state - aha! is sibling to, so the algorithm was correct - these bios *are* ghettoized.. in a manner of speaking. You may notice that there is no - only, so we have a bit of a contradiction - our LGBT category says they are a head of government, but the parenting of the other categories suggest, not really. This is a great example because it illustrates something which you will see all over the category tree: inconsistency. Sometimes you will have a female category and an LGBT category and an African-American category and sometimes even a combination of same, and then for a slightly different job title, you will have none of that. You will also find the gender/ethnic/religious/sexuality categories placed at all points in the category tree - up high, in the middle, and down low. If you've got a woman and you want to make sure she is in a gendered category for something close to her job, you may have to click up 2 or 3 levels before you find the appropriate "Women X" category; in the case of old Tiglath-Pileser III, someone went to distant uncle to find the LGBT category, placing him, somewhat incorrectly, as a "head of government". Sorting this out I leave as an exercise for the reader, as I frankly don't know what the best path is, but here are a couple of options to consider:
 * We could create, since all kings are heads of state, but not all kings are heads of government. This is probably the most "correct" solution.
 * We could just say "Forget it, it's close enough", but it technically violates our rules against ghettoization, so perhaps do the rules need changing? Think about it this way - if you classify a woman as a "Woman novelist" and an "American poet", is she ghettoized or not? I think, yes.

Anyway, please share your comments on the talk page. I welcome your feedback on this meandering...