Ideas to solve international postal address formatting issues with YAML, JSON and some programming

When you are entering address–related data online, you have all those fields you have to enter your address into. You know the deal — country, postal code, locality etc.

When it comes arranging those elements back together, things may get a bit more complicated:

  • Is the postal code before or after the locality name?
  • Are regions subsets of states or vice versa?
  • Which administrative divisions are we using anyway?
  • Have you ever tried ordering something from abroad, but having to fill out all the fields “correctly”, even if your country isn’t composed of states or if your postcode is less than 5 digits?

The list goes on.

I used to work at Iceland Post for several years and we were used to parsing terribly formatted addresses on letters and parcels from abroad, even going as far as being able to read names and addresses garbled by codepage conversion errors (a certain online retailer was the main culprit there).

But not every national postal service is the same — and not all of your deliveries are handed by the post anymore. With more liberalised markets in Europe for instance, you see even more private parcel and postal operators working across borders; in addition to express courier services such as UPS, TNT and DHL Express.

The current private or semi–private market that is developing may bring in some positive developments, but for the general public, this means that the delivery companies start cutting even more corners.

Lower wages for front–line employees in the shipping and logistic industry, less employer–sponsored education, the resulting turnover and use of subcontractors as last–mile delivery personnel has led to worse service overall. Being flexible or going the extra mile is no longer a virtue and the recipient is no longer being considered the customer (the sender is).

This means that using correctly formatted addresses for e–commerce deliveries or simply for sending invoices in the post is getting more and more important every year.

Nobody wants to be liable for unhappy customers — and who wants to take on the extra cost for lost, damaged and delayed packets — let alone steep return fees?

Can we use data to solve this as a sender?

I have been fleshing out the idea of using structured data to build properly formatted postal addresses and web forms for the past decade or so. This is something that may sound straightforward to most people, but digitising the correct address formats for the 192 member states of the UPU and their dependencies is hefty work.

What is the UPU? — The Universal Postal Union is one of the specialised agencies of the United Nations (and is in fact older than the UN) and takes care of coordinating things between the postal services and administrations of its member states.

The UPU displays the address formats officially in use by its member states on its website and even includes additional information like correct capitalisation and the correct position of the address on an envelope!

This also includes some exceptions and special cases such as P.O. Box specific formatting, information about rural addresses etc, so we have most bases covered. (Of course there are further exceptions, where a state has officially adopted a specific format, but the general public is using a different one.)

This however, is laid out in several PDF files — which are human–readable but they are not easily parsable for computers, so order to make this work, every single PDF file in this collection needs to be examined and manually parsed into a machine–readable format.

I’ve tried this before and failed

More than 10 years ago, I devised a solution based on a single and quite massive XML file and a parser class written in PHP.

That ended up in a feature creep and was never fully released to the public.

However, today, with the proliferation of dynamic websites using frameworks such as React and Angular, as well as increased support for structured data formats such as YAML and JSON, I may be closer to solving this riddle.

Storting an address format as YAML

To begin with, I have been considering the basis of this project to be based on a collection of YAML-files, which are then converted to JSON and served over HTTP(S). We can consider them to be schemas for each country.

I have concluded that the following attributes would be sufficient for YAML and JSON files:

  • Country code
  • Country name in English
    • This can be translated using something like Poedit
  • Array of used address elements
  • Array of required elements
  • Array of uppercase elements
  • Regular expressions for validations
    • Note that hard validations are not recommended in general, as there are exceptions to any rule!
  • Arrays of valid options (administrative divisions), abbreviations and full names
    • US states
    • Italian regions
  • The format or formats used, line by lines
    • Variables are to be replaced with values
    • Spaces are parsed as such
    • Empty lines would be removed.

Other things to consider are conditional statements for different formats (“if”, “if x equals” and “if not”), which could be used to select variations of the address format — for instance if the locality and prefecture/state are the same, if a locality requires a special address format or if special care needs to be taken for indicating specific apartments or office suites.

Example 1:

Iceland has a fairly simple format, with a three digit postcode system, which can be validated using a regular expression.

country_code: is
country_name: Iceland
elements: [addressee, supplement, address, postcode, locality]
required: [addressee, address, postcode, locality]
regex:
- postcode: '/\A[0-9]{3}\z/'
format: 
- '%addressee%'
- '%supplement%'
- '%address%'
- '%postcode% %locality%'

Example 2:

The USA is composed of 50 different states, plus other territories specified with an abbreviation and a full name.

country_code: us
country_name: "United States of America"
elements: [addressee, supplement, address, locality, state, zip_code]
required: [addressee, address, locality, state, zip_code]
options:
- state:
 - { 'abbr': 'AL', 'name': 'Alabama' }
 - { 'abbr': 'AK', 'name': 'Alaska' }
 - { 'abbr': 'AZ', 'name': 'Arizona' }
format:
- '%addressee%'
- '%supplement%'
- '%address%'
- '%state% %locality% %zip_code%'

Example 3

Japan poses the unique challenge of having two separate address formats — one is for international mail and the other for domestic (written in Japanese script). This may be solved by describing the format attribute as an array of two objects:

country_code: ja
country_name: "Japan"
elements: [addressee, supplement, address, locality, prefecture, postcode]
required: [addressee, address, locality, prefacture, postcode]
format: {
 international: ['%addressee%', '%suppliment%', '%address%', '%locality% %prefacture%', '%postcode% %country_name%'],
 domestic: ['%postcode% %prefecture% %locality% %address% %suppliment% %addressee%']
}

Storing an address as JSON and parsing the data

Relational databases such as PostgreSQL support formatting certain columns as JSON, not to mention MongoDB, which is JSON–based. This is why I think it is a no-brainer to use JSON fields or text fields with JSON–formatted data for the job.

The following are very simple examples for how this can be done.

Example 1:

{
  'name': 'Iceland Post',
  'supplement': 'C/O Stamp Collection',
  'address': 'Stórhöfða 29',
  'postcode': '110',
  'locality': 'Reykjavik',
  'country': 'is'
}

This would be parsed into the following text block:

Iceland Post
C/O Stamp Collection
Stórhöfða 29
110 Reykjavík

Example 2:

{
  'name': 'Quincy Happy',
  'street_name': 'Wacholderweg',
  'house_number': '52a',
  'postcode': '26133',
  'locality': 'Oldenburg',
  'country': 'de'
}

This would be parsed into he following text block:

Quincy Happy
Wacholderweg 52a
26133 Oldenburg

Generating address forms

While the initial aim would be to provide correctly formatted output, the “elements” and “requried” arrays should provide enough information to generate address forms.

As long as the address formats don’t include overly complex formatting, it should be good enough to generate more advanced address forms — either on-the-fly using some XHR-wizardry or based on the user’s specified country.

Perhaps limiting the non-variable values used in formatting would make this easier.

The downside UX-wise here is in order to get this to work properly, the user needs to specify the country before the rest of the form is generated.

The programming bit

I intend to start off by writing a Ruby library to parse the files along with a test suite using Rspec. Later on, something like JSON::API Resources can be used to create a microservice API in order to help include the parser into other projects.

Regardless, the main reason why I want to initiate the project as a set of YAML files is to make it easier to convert the data and include it data in any other programming language.

The copyright question

One thing that is not clear right now is the intellectual ownership of the data provided. The UPU can claim its own copyright, as opposed to some countries like the USA, where government works generally fall into the public domain.

This project would create a derivative work of the information provided by the UPU, so I would guess they need to provide some sort of support or blessing for the project as well as the national postal operators or administrations that claim their own copyright — it may sound like an impossible task, but I think asking the right people the right questions in the right way should be sufficient.

As a sidenote, the UPU‘s largest “data product” seems to be their postcode database, which is used to facilitate auto-completion and validation of address data. They do provide free samples of those, but the full product seems to be sold by them.

Starting the project

Last year, I created an organisation on GitHub under the name of AddressKit, which I intend to use for this project in the future. If you want to support the project by participating, then don’t hesitate or contact me or write a comment to this post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.