Wikipedia:Auto-categorization/done

Years BC
For the years BC, I suggest to add a corresponding year category to the pages on Category_talk:Years. Oddly it's being debate if there should be a category for these pages, rather than which one. If there is support for adding Category:500 BC through Category:101 BC, I would do so. Earlier years don't have individual pages, later years are already categorized. - User:Docu

The BC year articles were actually my next target. I've asked for clarification on Category_talk:Years on whether they should be categorized by year or by decade. I'll announce here when that's decided and then feed them to Pearle if she's authorized to do this. -- Beland 04:21, 6 Oct 2004 (UTC)

Completed manually: Articles with USA state names
Thanks to everyone who helped, especially Sortior, who did a huge chunk of work on this project. -- Beland 06:12, 25 Dec 2004 (UTC)


 * Former and proposed counties in the USA - Complete
 * Alabama - Updated 9 Dec 2004 - Complete
 * Alaska - Updated 9 Dec 2004 - Complete
 * AZ, AR, CA, CO, CT - Updated 26 Nov 2004. - Complete
 * DE - Completed again 26 Nov 2004.
 * FL, GA, HI, ID, IL - Updated 26 Nov 2004. - Complete
 * IN, IO, KS, KY, LA, ME - Updated 26 Nov 2004. - Complete
 * MD, MA, MI, MN, MS, MO - Updated 9 Dec 2004 - Complete
 * MT, NE, NV, NH, NJ, NM - Updated 9 Dec 2004 - Complete
 * NY, NC, ND, OH, OK, OR - Updated 9 Dec 2004 - Complete
 * Pennsylvania - Complete
 * Rhode_Island - Complete
 * South_Carolina - Completed, thanks again.
 * South_Dakota - Complete
 * Tennessee - Complete
 * Texas - Complete
 * Utah - Complete
 * Vermont - Complete
 * Virginia - Complete
 * Washington
 * West_Virginia - Complete
 * Wisconsin - Complete
 * Wyoming - Done

Misclassified CDPs
All the misclassified CDPs have been converted from town to CDP. Many articles where "town" was the correct term have been changed to "CDP" and will have to be manually updated use the term town again. The categorization efforts for the entire U.S. cities has moved to the unabridged city list and should be completed within 48 hours. -- Ram-Man 04:44, Nov 20, 2004 (UTC)

USA municipalities and counties
I recently started creating a system which attempts to suggest appropriate categories for uncategorized articles.

My first idea was to isolate words that commonly appear in article titles, and find categories in which these words also appear, which I successfully implemented. A quick look at the results revealed that the most common words are geographical place-names, especially names of states in the United States. A quick look at the contents of these articles, in turn reveals distinctive patterns created by articles about counties and municipalities which have been created by the Rambot. The upshot of this is that tens of thousands of articles can be categorized with minimal human intervention.

I have created some special routines in my auto-categorization system to do the following:
 * Iterate over the 50 states of the U.S.
 * Grep article titles and the first 500 characters of each article for the state name.
 * Skip disambiguation pages by detecting, the phrase "is the name of several places", etc.
 * For the remaining articles, isolate the text of interest. The text of interest is the first 500 characters, excluding a leading div block (usually an image of some kind) and everything after the phrase "As of", which marks the beginning of the Rambot's rambling about census data.
 * Match key phrases in the text of interest, like "is a town in Foo County, Bar ".
 * Where key phrases are found, suggest appropriate categories, like "Category:Foo County, Bar" and "Towns in Bar".
 * Segregate articles that can be automatically classified into one file, and put those that cannot be in another.

The system currently knows how to parse towns, villages, cities, and townships that are part of counties, and counties that are part of U.S. states.

It treats parishes in Louisiana, and boroughs and census areas in Alaska, as the equivalent of counties. It also recognizes boroughs that are part of Pennsylvania counties. In other states, a borough may be part of a township which is turn part of a county. These and all other areas which are part of subdivisions of a county (or equivalent) are also ignored because I'm not sure how they are supposed to be categorized.

Charter townships in Michigan are not automatically added to Category:Charter Townships in Michigan, though they are added to Category:Townships in Michigan. The complexity here will have to be addressed manually or by future automation. -- Beland 02:46, 26 Sep 2004 (UTC)

Next steps
The system has automatically classified about 31,000 articles. I have a human-readable dump, but it's 6.5MB long, and this is unwieldy to post on the wikipedia. A truncated version is posted at /workspace. Please let me know if you would like a copy of the full version.

I also have a machine-readable version (3.4MB) that will need to be passed to a bot that can automatically add a given article to a given category if the article isn't already in a category. (Duplicates like "Alaska_Township,_Minnesota" are noted and removed from the machine-readable version.) I will write such a bot if no one else wants to take responsibility for doing this.

After these articles have been categorized, I will re-run the auto-categorization system and see if what the next-most-common patterns are that might be exploited in a similar fashion. -- Beland 02:46, 26 Sep 2004 (UTC)

Update: Project complete. -- Beland 03:06, 20 Nov 2004 (UTC)

Comments and Concerns

 * Have we decided that we want county-level categories? --Gary D 11:09, Sep 26, 2004 (UTC)


 * I think so. There are about 830 existing county-level categories in the United States.  It seems a logical place to put articles about a particular region that covers multiple municipalities.  Plus, it's useful to be able to see a list of communities in a given county, and to navigate to and fro.  -- Beland 22:00, 26 Sep 2004 (UTC)


 * What happens with cities (or some few villages) which are in multiple counties? I believe Rambot mostly omitted county information from these. In some cases it has been manually added.


 * When county information is omitted, as in the following examples, the municipality-matching subroutine does not suggest categories.


 * Corinth,_Kentucky Corinth is a city located in Kentucky.
 * Iron_City,_Tennessee Iron City is a city located in Tennessee.


 * When multiple-county information has been added manually, (as in the following examples) the syntax does not generally match what's expected by the municipality-matching routine, and it does not suggest categories.

Maryland]]\nAs an unincorporated area, Hillandale's boundaries are not officially defined. Hillandale is, however, recognized by the United States Census Bureau as a Census-designated Place, and by the [[Un
 * Calverton,_Maryland Calverton is an unincorporated area located on the boundary between Montgomery and Prince George's Counties, Maryland. \n\n== Geography ==\n[[Image:MDMap-doton-Calverton.PNG|right|Location of Calverton, Maryland]]\nAs an unincorporated area, Calverton's boundaries are not officially defined. Calverton is, however, recognized by the United States Census Bureau as a Census-designated Place, and by the [[United
 * Hillandale,_Maryland Hillandale is an unincorporated area located on the boundary between Montgomery and Prince George's Counties, Maryland. \n\n== Geography ==\n[[Image:MDMap-doton-Hillandale.PNG|right|Location of Hillandale,


 * If you have any articles in particular in mind, I can look them up and see where they landed. -- Beland 22:00, 26 Sep 2004 (UTC)


 * No, I don't have any particular articles in mind--all that I've come across recently, I've fixed and added categories manually. Could it generate a list of places where A) it did not find rambot county-language AND B) don't already have county categories? That might be useful for making manual corrections of the first type (which usually involve a small bit of research). older ≠ wiser 22:11, 26 Sep 2004 (UTC)


 * I already have such a list. It's 1.3MB long, but I can divide it into chunks by state.  I'll post it above as the first salvo in the computer-assisted manual categorization project. -- Beland 00:52, 27 Sep 2004 (UTC)


 * Not having noticed this project earlier, I failed to notice how many people have been working on categorizing the city articles. The rambot has the advantage of knowing 99% of the city article names without having to parse anything because it created them.  As such I don't have to parse through state and county articles to find them.  Anyway, I recently did bot runs over all the county and city articles performing a slew of changes.  One of those was to perform the categorization.  There were a large number of articles that had categorization already, but also many that were missing it.  I know there is a list above, and I don't even want to try and go through all of those cities to remove the ones that I've completed.  But all of the "Category:COUNTY_NAME County, STATE" and "Category:YYYY in STATE_NAME" where YYYY is the a town, city, etc, have been added.  I also created those missing "Category:COUNTY_NAME County, STATE" articles by adding a "Category:STATE_NAME counties" link (See: Category:Washakie County, Wyoming).  I have not created any missing "Category:YYYY in STATE_NAME" articles, but I could very easily do it on the next bot run.  Oh I forgot one thing.  There are actually two lists of cities, so I am not totally done with updating the cities, although I've seen Beland's bot working away on that, so the rest may have been caught, but I'll have to run the bot again and catch those cities that are not easily associated with a particular county. -- Ram-Man 20:07, Nov 17, 2004 (UTC)

USA county categories
The "USA municipalities" run described above will create lots of new, uncategorized counties. To fix this, I whipped up a script to add these county categories to the appropriate "State X counties" category. I have created an input file suitable for feeding to Pearle once she's approved. -- Beland 04:30, 6 Oct 2004 (UTC)

Update - This has been completed. -- Beland 08:58, 15 Nov 2004 (UTC)