Building a Python Data Access Layer
Rate this tutorial
This tutorial will show you how to use some reasonably advanced Python techniques to wrap BSON documents in a way that makes them feel much more like Python objects and allows different ways to access the data within. It's the first in a series demonstrating how to build a Python data access layer for MongoDB.
This tutorial is loosely based on the first episode of a new livestream I host, called "Coding with Mark." I'm streaming on Wednesdays at 2 p.m. GMT (that's 9 a.m. ET or 6 a.m. PT, if you're an early riser!). If that time doesn't work for you, you can always catch up by watching the recordings!
For the first few episodes, you can follow along as I attempt to build a different kind of Pythonic data access layer, a library to abstract underlying database modeling changes from a hypothetical application. One of the examples I'll use later on in this series is a microblogging platform, along the lines of Twitter/X or Bluesky. In order to deal with huge volumes of data, various modeling techniques are required, and my library will attempt to find ways to make these data modeling choices invisible to the application, making it easier to develop while remaining possible to change the underlying data model.
I'm using some pretty advanced programming and metaprogramming techniques to hide away some quite clever functionality. It's going to be a good series whether you're looking to improve either your Python or your MongoDB skills.
If that doesn't sound exciting enough, I'm lining up some awesome guests from the Python community, and in the future, we may branch away from Python and into other strange and wonderful worlds.
In any well-architected application of a reasonable size, you'll usually find that the codebase is split into at least three areas of concern:
- A presentation layer is concerned with formatting data for consumption by a client. This may generate web pages to be viewed by a person in a browser, but increasingly, this may be an API endpoint, either driving an app that runs on a user's computer (or within their browser) or providing data to other services within a broader service-based architecture. This layer is also responsible for receiving data from a client and parsing it into data that can be used by the business logic layer.
- A business logic layer sits behind the presentation layer and provides the "brains" of an application, making decisions on what actions to take based on user requests or data input into the application.
- The data access layer, where I'm going to be focusing, provides a layer of abstraction over the database. Its responsibility is to request data from the database and provide them in a usable form to the business logic layer, but also to take requests from the business logic layer and to appropriately store data in the database.
Given the focus of this article, let's focus a little more on the responsibilities of the data access layer. A good data access layer will provide a layer of abstraction away from the physical storage of the data.
With relational databases, a data access layer may be a relatively straightforward implementation of the Active Record pattern, closely matching the way the data is divided across one or more tables.
MongoDB's flexibility, with the ability to store documents with different schemas within a single collection, and to choose whether to embed related data within a document or to store elsewhere and join on read, provides more power than tabular databases. Thus, a data access layer may need to provide more abstraction capabilities than simply porting an ORM to work with documents. An ORM is an Object-Relational Mapper library and handles mapping between relational data in a tabular database and objects in your application.
Good question! Many great ODMs have been developed for MongoDB. ODM is short for "Object Document Mapper" and describes a type of library that attempts to map between MongoDB documents and your application objects. Just within the Python ecosystem, there is MongoEngine, ODMantic, PyMODM, and more recently, Beanie and Bunnet. The last two are more or less the same, but Beanie is built on asyncio and Bunnet is synchronous. We're especially big fans of Beanie at MongoDB, and because it's built on Pydantic, it works especially well with FastAPI.
On the other hand, most ODMs are essentially solving the same problem — abstracting away MongoDB's powerful query language to make it easier to read and write, and modeling document schemas as objects so that data can be directly serialized and deserialized between the application and MongoDB.
Once your data model becomes relatively sophisticated, however, if you're implementing one or more patterns to improve the performance and scalability of your application, the way your data is stored is not necessarily the way you logically think about it within your application.
On top of that, if you're working with a very large dataset, then data migration may not be feasible, meaning that different subsets of your data will be stored in different ways! A good data access layer should be able to abstract over these differences so that your application doesn't need to be rewritten each time you evolve your schema for one reason or another.
Am I just building another ODM? Well, yes, probably. I'm just a little reluctant to use the term because I think it comes along with some of the preconceptions I've mentioned here. If it is an ODM, it's one which will have a focus on the “M.”
And partly, I just think it's a fun thing to build. It's an experiment. Let's see if it works!
You can check out the current library in the project's GitHub repo. At the time of writing, the README contains what could be described as a manifesto:
- Managing large amounts of data in MongoDB while keeping a data schema flexible is challenging.
- This ODM is not an active record implementation, mapping documents in the database directly into similar objects in code.
- This ODM is designed to abstract underlying documents, mapping potentially multiple document schemata into a shared object representation. It should also simplify the evolution of documents in the database, automatically migrating individual documents' schemas either on-read or on-write.
- There should be "escape hatches" so that unforeseen mappings can be implemented, hiding away the implementation code behind hopefully reusable components.
I think that's enough waffle. Let's get started.
If you want to get a look at how this will all work once it all comes together, skip to the end, where I'll also show you how it can be used with PyMongo queries. For the moment, I'm going to dive right in and start implementing a class for wrapping BSON documents to make it easier to abstract away some of the details of the document structure. In later tutorials, I may start to modify the way queries are done, but at the moment, I just want to wrap individual documents.
I want to define classes that encapsulate data from the database, so let's call that class
Document
. At the moment, I just need it to store away an underlying "raw" document, which PyMongo (and Motor) both provide as dict implementations:1 class Document: 2 def __init__(self, doc, *, strict=False): 3 self._doc = doc 4 self._strict = strict
I've defined two parameters that are stored away on the instance:
doc
and strict
. The first will hold the underlying BSON document so that it can be accessed, and strict
is a boolean flag I'll explain below. In this tutorial, I'm mostly ignoring details of using PyMongo or Motor to access MongoDB — I'm just working with BSON document data as a plain old dict.When a Document instance wraps a MongoDB document, if
strict
is False
, then it will allow any field in the document to automatically be looked up as if it was a normal Python attribute of the Document instance that wraps it. If strict
is True
, then it won't allow this dynamic lookup.So, if I have a MongoDB document that contains { 'name': 'Jones' }, then wrapping it with a Document will behave like this:
1 'name': 'Jones' }) relaxed_doc = Document({ 2 relaxed_doc.name3 "Jones" 4 5 'name': 'Jones' }, strict=True) strict_doc = Document({ 6 strict_doc.name7 Traceback (most recent call last): 8 File "<stdin>", line 1, in <module> 9 File ".../docbridge/__init__.py", line 33, in __getattr__ 10 raise AttributeError( 11 AttributeError: 'Document' object has no attribute 'name'
The class doesn't do this magic attribute lookup by itself, though! To get that behavior, I'll need to implement
__getattr__
. This is a "magic" or "dunder" method that is automatically called by Python when an attribute is requested that is not actually defined on the instance or the class (or any of the superclasses). As a fallback, Python will call __getattr__
if your class implements it and provide the name of the attribute that's been requested.1 def __getattr__(self, attr): 2 if not self._strict: 3 return self._doc[attr] 4 else: 5 raise AttributeError( 6 f"{self.__class__.__name__!r} object has no attribute {attr!r}" 7 )
This implements the logic I've described above (although it differs slightly from the code in the repository because there were a couple of bugs in that!).
This is a neat way to make a dictionary look like an object and allows document fields to be looked up as if they were attributes. It does currently require those attribute names to be exactly the same as the underlying fields, though, and it only works at the top level of the document. In order to make the encapsulation more powerful, I need to be able to configure how data is looked up on a per-field basis. First, let's handle how to map an attribute to a different field name.
The first abstraction I'd like to implement is the ability to have a different field name in the BSON document to the one that's exposed by the Document object. Let's say I have a document like this:
1 { 2 "cocktailName": "Old Fashioned" 3 }
The field name uses camelCase instead of the more idiomatic snake_case (which would be "cocktail_name" instead of "cocktailName"). At this point, I could change the field name with a MongoDB query, but that's both not very sensible (because it's not that important) and potentially may be controversial with other teams using the same database that may be more used to using camelCase names. So let's add the ability to explicitly map from one attribute name to a different field name in the wrapped document.
I'm going to do this using metaprogramming, but in this case, it doesn't require me to write a custom metaclass! Let's assume that I'm going to subclass
Document
to provide a specific mapping for cocktail recipe documents.1 class Cocktail(Document): 2 cocktail_name = Field(field_name="cocktailName")
This may look similar to some patterns you've seen used by other ODMs or with, say, a Django model. Under the hood,
Field
needs to implement the Descriptor Protocol so that we can intercept attribute lookup for cocktail_name
on instances of the Cocktail
class and return data contained in the underlying BSON document.The name sounds highly technical, but all it really means is that I'm going to implement a couple of methods on
Field
so that Python can treat it differently in two different ways:__set_name__
is called by Python when the Field is attached to a class (in this case, the Cocktail class). It's called with, you guessed it, the name of the field — in this case, "cocktail_name."
__get__
is called by Python whenever the attribute is looked up on a Cocktail instance. So in this case, if I had a Cocktail instance called my_cocktail
, then accessing cocktail.cocktail_name
will call Field.get() under the hood, providing the Field instance, and the class that the field is attached to as arguments. This allows you to return whatever you think should be returned by this attribute access — which is the underlying BSON document's "cocktailName" value.Here's my implementation of
Field
. I've simplified it from the implementation in GitHub, but this implements everything I've described above.1 class Field: 2 def __init__(self, field_name=None): 3 """ 4 Initialize a Field attribute, mapping to an underlying BSON field. 5 6 field_name is the name of the underlying BSON field. 7 If field_name is None (the default), use the attribute name for lookup in the doc. 8 """ 9 self.field_name = None 10 11 def __set_name__(self, owner, name): 12 """ 13 Called by Python when this Field instance is attached to a class (the owner). 14 """ 15 self.name = name # this is the *attribute* name on the class. 16 17 # If no field_name was provided, then default to using the attribute 18 # name to look up the BSON field: 19 if self.field_name is None: 20 self.field_name = name 21 22 def __get__(self, ob, cls): 23 """ 24 Called by Python when this attribute is looked up on an instance of 25 the class it's attached to. 26 """ 27 try: 28 # Look up the BSON field and return it: 29 return ob._doc[self.field_name] 30 except KeyError as ke: 31 raise ValueError( 32 f"Attribute {self.name!r} is mapped to missing document property {self.field_name!r}." 33 ) from ke
With the code above, I've implemented a Field object, which can be attached to a Document class. It gives you the ability to allow field lookups on the underlying BSON document, with an optional mapping between the attribute name and the underlying field name.
A very common pattern in MongoDB is the schema versioning pattern, which is very important if you want to maintain the evolvability of your data. (This is a term coined by Martin Kleppmann in his book, Designing Data Intensive Applications.)
The premise is that over time, your document schema will need to change, either for efficiency reasons or just because your requirements have changed. MongoDB allows you to store documents with different structures within a single collection so a changing schema doesn't require you to change all of your documents in one go — which can be infeasible with very large datasets anyway.
Instead, the schema versioning pattern suggests that when your schema changes, as you update individual documents to the new structure, you update a field that specifies the schema version of each document.
For example, I might start with a document representing a person, like this:
1 { 2 "name": "Mark Smith", 3 "schema_version": 1, 4 }
But eventually, I might realize that I need to break up the user's name:
1 { 2 "full_name": "Mark Smith" 3 "first_name": "Mark", 4 "last_name": "Smith", 5 "schema_version": 2, 6 }
In this example, when I load a document from this collection, I won't know in advance whether it's version 1 or 2, so when I request the name of the person, it may be stored in "name" or "full_name" depending on whether the particular document has been upgraded or not.
For this, I've designed a different kind of "Field" descriptor, called a "FallthroughField." This one will take a list of field names and will attempt to look them up in turn. In this way, I can avoid checking the "schema_version" field in the underlying document, but it will still work with both older and newer documents.
FallthroughField
looks like this:1 class Fallthrough: 2 def __init__(self, field_names: Sequence[str]) -> None: 3 self.field_names = field_names 4 5 def __get__(self, ob, cls): 6 for field_name in self.field_names: # loop through the field names until one returns a value. 7 try: 8 return ob._doc[field_name] 9 except KeyError: 10 pass 11 else: 12 raise ValueError( 13 f"Attribute {self.name!r} references the field names {', '.join([repr(fn) for fn in self.field_names])} which are not present." 14 ) 15 16 def __set_name__(self, owner, name): 17 self.name = name
Obviously, changing a field name is a relatively trivial schema change. I have big plans for how I can use descriptors to abstract away lots of complexity in the underlying document model.
This tutorial has shown a lot of implementation code. Now, let me show you what it looks like to use this library in practice:
1 import os 2 from docbridge import Document, Field, FallthroughField 3 from pymongo import MongoClient 4 5 collection = ( 6 MongoClient(os.environ["MDB_URI"]) 7 .get_database("docbridge_test") 8 .get_collection("people") 9 ) 10 11 collection.delete_many({}) # Clean up any leftover documents. 12 # Insert a couple of sample documents: 13 collection.insert_many( 14 [ 15 { 16 "name": "Mark Smith", 17 "schema_version": 1, 18 }, 19 { 20 "full_name": "Mark Smith", 21 "first_name": "Mark", 22 "last_name": "Smith", 23 "schema_version": 2, 24 }, 25 ] 26 ) 27 28 # Define a mapping for "person" documents: 29 class Person(Document): 30 version = Field("schema_version") 31 name = FallthroughField( 32 [ 33 "name", # v1 34 "full_name", # v2 35 ] 36 ) 37 38 # This finds all the documents in the collection, but wraps each BSON document with a Person wrapper: 39 people = (Person(doc, None) for doc in collection.find()) 40 for person in people: 41 print( 42 "Name:", 43 person.name, 44 ) # The name (or full_name) of the underlying document. 45 print( 46 "Document version:", 47 person.version, # The schema_version field of the underlying document. 48 )
If you run this, it prints out the following:
1 $ python examples/why/simple_example.py 2 Name: Mark Smith 3 Document version: 1 4 Name: Mark Smith 5 Document version: 2
I'll be the first to admit that this was a long tutorial given that effectively, I've so far just written an object wrapper around a dictionary that can conduct some simple name remapping. But it's a great start for some of the more advanced features that are upcoming:
- The ability to automatically upgrade the data in a document when data is calculated or otherwise written back to the database
- Recursive class definitions to ensure that you have the full power of the framework no matter how nested your data is
- The ability to transparently handle the subset and extended reference patterns to lazily load data from across documents and collections
- More advanced name remapping to build Python objects that feel like Python objects, on documents that may have dramatically different conventions
- Potentially some tools to help build complex queries against your data
But the next thing to do is to take a step back from writing library code and do some housekeeping. I'm building a test framework to help test directly against MongoDB while having my test writes rolled back after every test, and I'm going to package and publish the docbridge library. You can check out the livestream recording where I attempt this, or you can wait for the accompanying tutorial, which will be written any day now.
I'm streaming on the MongoDB YouTube channel nearly every Tuesday, at 2 p.m. GMT! Come join me — it's always helpful to have more people spot the bugs I'm creating as I write the code!