Add US Postal Abbreviations to Your Atlas Search in 5 Minutes
Rate this tutorial
There are cases when it helps to have synonyms set up to work with your Atlas Search index. For example, if the search in your application needs to work with addresses, it might help to set up a list of common synonyms for postal abbreviations, so one could type in “blvd” instead of “boulevard” and still find all places with “boulevard” in the address.
This tutorial will show you how to set up your Atlas Search index to recognize US postal abbreviations.
To be successful with this tutorial, you will need:
- Python, to use a script that scrapes a list of street suffix abbreviations helpfully compiled by the United States Postal Service (USPS). This tutorial was written using Python 3.10.15, but you could try it on earlier versions of 3, if you’d like.
- A MongoDB Atlas cluster. Follow the Get Started with Atlas guide to create your account and a MongoDB cluster. For this tutorial, you can use your free-forever MongoDB Atlas cluster! Keep a note of your database username, password, and connection string as you will need those later.
- Rosetta, if you’re on a MacOS with an M1 chip. This will allow you to run MongoDB tools like mongoimport and mongosh.
- A copy of mongoimport. If you have MongoDB installed on your workstation, then you may already have mongoimport installed. If not, follow the instructions on the MongoDB website to install mongoimport.
- We're going to be using a sample_restaurants dataset in this tutorial since it contains address data. For instructions on how to load sample data, see the documentation. Also, you can see all available sample datasets.
The examples shown here were all written on a MacOS but should run on any unix-type system. If you're running on Windows, we recommend running the example commands inside the Windows Subsystem for Linux.
To learn about synonyms in Atlas Search, we suggest you start by checking out our documentation. Synonyms allow you to index and search your collection for words that have the same or nearly the same meaning, or, in the case of our tutorial, you can search using different ways to write out an address and still get the results you expect. To set up and use synonyms in Atlas Search, you will need to:
- Create a collection in the same database as the collection you’re indexing containing the synonyms. Note that every document in the synonyms collection must havea specific format.
We will walk you through these steps in the tutorial, but first, let’s start with creating the JSON documents that will form our synonyms collection.
We will use the list of official street suffix abbreviations and a list of secondary unit designators from the USPS website to create a JSON document for each set of the synonyms.
All documents in the synonyms collection must have a specific formatthat specifies the type of synonyms—equivalent or explicit. Explicit synonyms have a one-way mapping. For example, if “boat” is explicitly mapped to “sail,” we’d be saying that if someone searches “boat,” we want to return all documents that include “sail” and “boat.” However, if we search the word “sail,” we would not get any documents that have the word “boat.” In the case of postal abbreviations, however, one can use all abbreviations interchangeably, so we will use the “equivalent” type of synonym in the mappingType field.
Here is a sample document in the synonyms collection for all the possible abbreviations of “avenue”:
1 “Avenue”: 2 3 { 4 5 "mappingType":"equivalent", 6 7 "synonyms":["AVENUE","AV","AVEN","AVENU","AVN","AVNUE","AVE"] 8 9 }
We wrote the web scraping code for you in Python, and you can run it with the following commands to create a document for each synonym group:
1 git clone https://github.com/mongodb-developer/Postal-Abbreviations-Synonyms-Atlas-Search-Tutorial/ 2 3 cd Postal-Abbreviations-Synonyms-Atlas-Search-Tutorial 4 5 python3 main.py
To see details of the Python code, read the rest of the section.
In order to scrape the USPS postal website, we will need to import the following packages/libraries and install them using PIP: requests, BeautifulSoup, and pandas. We’ll also want to import json and re for formatting our data when we’re ready:
1 import json 2 3 import requests 4 5 from bs4 import BeautifulSoup 6 7 import pandas as pd 8 9 import re
Let’s start with the Street Suffix Abbreviations page. We want to create objects that represent both the URL and the page itself:
1 # Create a URL object 2 3 streetsUrl = 'https://pe.usps.com/text/pub28/28apc_002.htm' 4 5 # Create object page 6 7 headers = { 8 9 "User-Agent": 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Mobile Safari/537.36'} 10 11 streetsPage = requests.get(streetsUrl, headers=headers)
Next, we want to get the information on the page. We’ll start by parsing the HTML, and then get the table by its id:
1 # Obtain page's information 2 3 streetsSoup = BeautifulSoup(streetsPage.text, 'html.parser')
1 # Get the table by its id 2 3 streetsTable = streetsSoup.find('table', {'id': 'ep533076'})
Now that we have the table, we’re going to want to transform it into a dataframe, and then format it in a way that’s useful for us:
1 # Transform the table into a list of dataframes 2 3 streetsDf = pd.read_html(str(streetsTable))
One thing to take note of is that in the table provided on USPS’s website, one primary name is usually mapped to multiple commonly used names.
This means we need to dynamically group together commonly used names by their corresponding primary name and compile that into a list:
1 # Group together all "Commonly Used Street Suffix or Abbreviation" entries 2 3 streetsGroup = streetsDf[0].groupby(0)[1].apply(list)
Once our names are all grouped together, we can loop through them and export them as individual JSON files.
1 for x in range(streetsGroup.size): 2 3 4 dictionary = { 5 6 "mappingType": "equivalent", 7 8 "synonyms": streetsGroup[x] 9 10 } 11 12 13 # export the JSON into a file 14 15 with open(streetsGroup.index.values[x] + ".json", "w") as outfile: 16 17 json.dump(dictionary, outfile)
Now, let’s do the same thing for the Secondary Unit Designators page:
Just as before, we’ll start with getting the page and transforming it to a dataframe:
1 # Create a URL object 2 3 unitsUrl = 'https://pe.usps.com/text/pub28/28apc_003.htm' 4 5 6 unitsPage = requests.get(unitsUrl, headers=headers) 7 8 9 # Obtain page's information 10 11 unitsSoup = BeautifulSoup(unitsPage.text, 'html.parser') 12 13 14 # Get the table by its id 15 16 unitsTable = unitsSoup.find('table', {'id': 'ep538257'}) 17 18 19 # Transform the table into a list of dataframes 20 21 unitsDf = pd.read_html(str(unitsTable))
If we look at the table more closely, we can see that one of the values is blank. While it makes sense that the USPS would include this in the table, it’s not something that we want in our synonyms list.
To take care of that, we’ll simply remove all rows that have blank values:
1 unitsDf[0] = unitsDf[0].dropna()
Next, we’ll take our new dataframe and turn it into a list:
1 # Create a 2D list that we will use for our synonyms 2 3 unitsList = unitsDf[0][[0, 2]].values.tolist()
You may have noticed that some of the values in the table have asterisks in them. Let’s quickly get rid of them so they won’t be included in our synonym mappings:
1 # Remove all non-alphanumeric characters 2 3 unitsList = [[re.sub("[^ \w]"," ",x).strip().lower() for x in y] for y in unitsList]
Now we can now loop through them and export them as individual JSON files just as we did before. The one thing to note is that we want to restrict the range on which we’re iterating to include only the relevant data we want:
1 # Restrict the range to only retrieve the results we want 2 3 for x in range(1, len(unitsList) - 1): 4 5 6 dictionary = { 7 8 "mappingType": "equivalent", 9 10 "synonyms": unitsList[x] 11 12 } 13 14 15 # export the JSON into a file 16 17 with open(unitsList[x][0] + ".json", "w") as outfile: 18 19 json.dump(dictionary, outfile)
Now that we created the JSON documents for abbreviations, let’s load them all into a collection in the sample_restaurants database. If you haven’t already created a MongoDB cluster, now is a good time to do that and load the sample data in.
The first step is to connect to your Atlas cluster. We will use mongosh to do it. If you don’t have mongosh installed, follow the instructions.
To connect to your Atlas cluster, you will need a connection string. Choose the “Connect with the MongoDB Shell” option and follow the instructions. Note that you will need to connect with a database user that has permissions to modify the database, since we would be creating a collection in the sample_restaurant database. The command you need to enter in the terminal will look something like:
1 mongosh "mongodb+srv://cluster0.XXXXX.mongodb.net/sample_restaurant" --apiVersion 1 --username <USERNAME>
When prompted for the password, enter the database user’s password.
We created our synonym JSON documents in the right format already, but let’s make sure that if we decide to add more documents to this collection, they will also have the correct format. To do that, we will create a synonyms collection with a validator that uses $jsonSchema. The commands below will create a collection with the name “postal_synonyms” in the sample_restaurants database and ensure that only documents with correct format are inserted into the collection.
1 use('sample_restaurants') 2 3 db.createCollection("postal_synonyms", { validator: { $jsonSchema: { "bsonType": "object", "required": ["mappingType", "synonyms"], "properties": { "mappingType": { "type": "string", "enum": ["equivalent", "explicit"], "description": "must be a either equivalent or explicit" }, "synonyms": { "bsonType": "array", "items": { "type": "string" }, "description": "must be an Array with each item a string and is required" }, "input": { "type": "array", "items": { "type": "string" }, "description": "must be an Array and is each item is a string" } }, "anyOf": [{ "not": { "properties": { "mappingType": { "enum": ["explicit"] } }, "required": ["mappingType"] } }, { "required": ["input"] }] } } })
We will use mongoimport to import all the JSON files we created.
You will need a connection string for your Atlas cluster to use in the mongoimport command. If you don’t already have mongoimport installed, use the instructions in the MongoDB documentation.
In the terminal, navigate to the folder where all the JSON files for postal abbreviation synonyms were created.
1 cat *.json | mongoimport --uri 'mongodb+srv://<USERNAME>:<PASSWORD>@cluster0.pwh9dzy.mongodb.net/sample_restaurants?retryWrites=true&w=majority' --collection='postal_synonyms'
Take a look at the synonyms collections you just created in Atlas. You should see around 229 documents there.
Now that we created the synonyms collection in our sample_restaurants database, let’s put it to use.
Let’s start by creating a search index. Navigate to the Search tab in your Atlas cluster and click the “CREATE INDEX” button.
Since the Visual Index builder doesn’t support synonym mappings yet, we will choose JSON Editor and click Next:
In the JSON Editor, pick restaurants collection in the sample_restaurants database and enter the following into the index definition. Here, the source collection name refers to the name of the collection with all the postal abbreviation synonyms, which we named “postal_synonyms.”
1 { 2 3 "mappings": { 4 5 "dynamic": true 6 7 }, 8 9 "synonyms": [ 10 11 { 12 13 "analyzer": "lucene.standard", 14 15 "name": "synonym_mapping", 16 17 "source": { 18 19 "collection": "postal_synonyms" 20 21 } 22 23 } 24 25 ] 26 27 }
We are indexing the restaurants collection and creating a synonym mapping with the name “synonym_mapping” that references the synonyms collection “postal_synonyms.”
Click on Next and then on Create Search Index, and wait for the search index to build.
Once the index is active, we’re ready to test it out.
Now that we have an active search index, we’re ready to test that our synonyms are working. Let’s head to the Aggregation pipeline in the Collections tab to test different calls to $search. You can also use Compass, the MongoDB GUI, if you prefer.
Choose $search from the list of pipeline stages. The UI gives us a helpful placeholder for the $search command’s arguments.
Let’s look for all restaurants that are located on a boulevard. We will search in the “address.street” field, so the arguments to the $search stage will look like this:
1 { 2 3 index: 'default', 4 5 text: { 6 7 query: 'boulevard', 8 9 path: 'address.street' 10 11 } 12 13 }
Let’s add a $count stage after the $search stage to see how many restaurants with an address that contains “boulevard” we found:
As expected, we found a lot of restaurants with the word “boulevard” in the address. But what if we don’t want to have users type “boulevard” in the search bar? What would happen if we put in “blvd,” for example?
1 { 2 3 index: 'default', 4 5 text: { 6 7 query: blvd, 8 9 path: 'address.street' 10 11 } 12 13 }
Looks like it found us restaurants with addresses that have “blvd” in them. What about the addresses with “boulevard,” though? Those did not get picked up by the search.
And what if we weren’t sure how to spell “boulevard” and just searched for “boul”? USPS’s website tells us it’s an acceptable abbreviation for boulevard, but our $search finds nothing.
This is where our synonyms come in! We need to add a synonyms option to the text operator in the $search command and reference the synonym mapping’s name:
1 { 2 3 index: 'default', 4 5 text: { 6 7 query: 'blvd', 8 9 path: 'address.street', 10 11 synonyms:'synonym_mapping' 12 13 } 14 15 }
And there you have it! We found all the restaurants on boulevards, regardless of which way the address was abbreviated, all thanks to our synonyms.
Synonyms is just one of many features Atlas Search offers to give you all the necessary search functionality in your application. All of these features are available right now on MongoDB Atlas. We just showed you how to add support for common postal abbreviations to your Atlas Search index—what can you do with Atlas Search next? Try it now on your free-forever MongoDB Atlas cluster and head over to community forums if you have any questions!