Wikipedia:Reference desk/Archives/Computing/2024 January 16

= January 16 =

Algorithm to match U.S. state names to USPS state codes
Re this mini project, I wish to match a list of USPS state codes (set A) with a list of U.S. state and territory names which has spelling variations (set B) in any order. Set B may have missing items, in which case set A matches [null]. This is an example successful match: yields Can someone please suggest a robust, performant yet simple to program algorithm? The target language is JavaScript.

Thanks, cm&#610;&#671;ee&#9094;&#964;a&#671;&#954; 09:03, 16 January 2024 (UTC)


 * I've re-read this multiple times and can't understand what this program is supposed to do. Are you trying to convert a natural language name to a USPS code? One which could have many variations, such as eg DC being "District of Columbia", "Washington DC", "Washington (District of Columbia)" for example. Is that what you're trying to do? —Panamitsu (talk) 10:28, 16 January 2024 (UTC)


 * Thanks for your reply, Sorry if I didn't make it clear. Yes, I effectively wish to convert a natural-language name to a USPS code. It is helped that the names are presented in one file, and each state or territory occurs at most once, so if one name somewhat matches one code, no other name can match that code. Here's my use-case:


 * I have created a template SVG map showing the states of the USA (and Washington DC). In the SVG code, I refer to the states' shape and labels with the USPS codes AK, AL, AR, AZ etc.


 * My collaborator,, however, receives data in the following format: The order of the states or their exact spelling cannot be guaranteed. Some states may be missing data: in the above, I have "Alaska" without a number, but that line could be left out entirely.


 * The JavaScript can have a lookup table with their nominal spellings, but may need to make matches e.g. using Levenshtein distances etc to find the best mapping. In my contrived example, the lookup table might have "CO → Colorado, CT → Connecticut", so when the data has "Conn.", it figures that "CT" is more likely than "CO".


 * Does it make sense now? It sounds that this is already a common solved problem in computer science. Thanks, cm&#610;&#671;ee&#9094;&#964;a&#671;&#954; 11:38, 16 January 2024 (UTC)
 * Your lookup table (which should be a map, not a table) is backwards. Let's assume it is called "map" in your code. If you get "Missouri", you set state=map["Missouri"] because map["Missouri"]="MO". It is likely that many people have made this map for their code. You can make it for yours. You have 51 states (including Washington DC), so you type it in. If you get one you haven't seen before, add it. Because of the way this map is made, you can have map["Colorado"] = "CO", map["Col."] = "CO", map["colarodo"] = "CO", etc... 12.116.29.106 (talk) 16:45, 16 January 2024 (UTC)

cm&#610;&#671;ee&#9094;&#964;a&#671;&#954; 10:48, 17 January 2024 (UTC)
 * I would not use Levenshtein distances; I would just put common spellings into the object. I would restrict the case. If a match is not found, then add the failing string to the map.
 * Glrx (talk) 18:56, 16 January 2024 (UTC)
 * For the example given, you cannot use Levenshtein distances. The example is CO and CT with the abbreviation Conn. For that, the distance between CO and Conn is 2. The distance between CT and Conn is 3. CO is less distant than CT. So, it will not claim that CT is a better match. It will claim CO is a better match. If you really want to go down this route, you have to do a Levenshtein check between the abbreviation Conn and all full state names. But, you have a string length weakness. Conn to Colorado has a distance of 6. Conn to Conneticut has a distance of 6. If you divide the distance by the max possible distance, Conn to Colorado is 6/8=0.75 and Conn to Conneticut is 6/10=0.60. Now, 0.60 is less than 0.75. But, you will certainly hit more problems. You are much better off making a very complete map of all abbreviations. 12.116.29.106 (talk) 19:53, 16 January 2024 (UTC)
 * Note that "Washington (District of Columbia)" and "W. Virginia" from your set B do not occur as such in our article List of U.S. state and territory abbreviations, so you may need to add more name variations by hand. You can normalize the key by ignoring case and removing spaces and periods, so that "W. Virg." is replaced by "wvirg"; this will reduce the number of entries. While edit distance should not be the primary criterion, some well-crafted version of least edit distance can perhaps be used as a fallback if the key is not found in the lookup table. --Lambiam 22:08, 16 January 2024 (UTC)
 * Another option to finding these abbreviations is to get a list of redirects to each state/territory which have the redirect templates and  . This isn't going to get them all of course, but it should be a good start. Manually mapping typos isn't too much work. Heck, I have a database of over 1,000 flags which I manually type out descriptions for and it only takes a couple of minutes to map each typo. —Panamitsu (talk) 03:02, 17 January 2024 (UTC)
 * Thank you very much, everyone, for your feedback. Though I originally wanted a generic fuzzy-matching function, I've scaled back my ambitions and adapted 's idea by first converting the name to lowercase and removing any non-alphanumeric-or-underscore characters.
 * If there is an exact match, it returns it, otherwise it finds the one with the lowest Levenshtein distance. This doesn't make use of the fact that each state appears at most once, but is hopefully sufficient for foreseeable data. I'll add a note to add common abbreviations e.g. "conn" or "wvirg" should the need arise.
 * Cheers,
 * Cheers,