
Parsing postal addresses is surprisingly difficult. The USPS
document that describes how to properly format US addresses is over 200 pages long. How many types of road (e.g. Street, Court, Avenue, etc.) do you suppose are in that document? I once tried to list all of the ones I could think of, and came up with a couple of dozen. That document contains over 200, including some strange ones such as
Loaf and
Stravenue. Further complicating matters, many names can serve more than one purpose depending on where in an address they are; for example, according to the 02000 census data, there are at least 68 cities in the US whose names are the names of states (including 25 cities named
Washington). And of course, US rules don't apply in other countries, whose rules are all different. Supposing you had a good parser for US addresses, you'd like to be able to determine if an address was in the US. One way might be to just search for the names of other countries in your string. However, again from the census data, there are at least 85 cities in the US whose names are the names of countries elsewhere in the world (including, strangely, 17 cities named
Lebanon).
It's one of those problems where you can throw together a 75% solution in a few hours, but from there each subsequent level of improvement gets increasingly difficult. It's no surprise, then, that the address parsing software out there is either really expensive, or doesn't work very well, or both. On the long list of projects I hope to get to someday is to write a good open-source address-parsing package.
Labels: address, geography, parsing