Coding With Mark: Abstracting Joins & Subsets in Python
Mark Smith11 min read • Published Mar 19, 2024 • Updated Mar 19, 2024
SNIPPET
Rate this tutorial
This tutorial will talk about MongoDB design patterns — specifically, the Subset Pattern — and show how you can build an abstraction in your Python data model that hides how data is actually modeled within your database.
This is the third tutorial in a series! Feel free to check out the first tutorial or second tutorial if you like, but it's not necessary if you want to just read on.
This tutorial is loosely based on some episodes of a livestream I host, called "Coding with Mark." I'm streaming on Wednesdays at 2 p.m. GMT (that's 9 a.m. ET or 6 a.m. PT, if you're an early riser!). If that time doesn't work for you, you can always catch up by watching the recordings!
Currently, I'm building an experimental data access layer library that should provide a toolkit for abstracting complex document models from the business logic layer of the application that's using them.
The purpose of docbridge, my Object-Document Mapper, is to abstract the data model used within MongoDB from the data model used by a Python program. With a codebase of any size, you need something like this because otherwise, every time you change your data model (in your database), you need to change the object model (in your code). By having an abstraction layer, you localize all of this mapping into a single area of your codebase, and that's then the only part that needs to change when you change your data model. This ability to change your data model really allows you to take advantage of the flexibility of MongoDB's document model.
In the first tutorial, I showed a very simple abstraction, the FallbackField, that would try various different field names in a document until it found one that existed, and then would return that value. This was a very simple implementation of the Schema Versioning pattern.
MongoDB allows you to store arrays in your documents, natively. The values in those arrays can be primitive types, like numbers, strings, dates, or even subdocuments. But sometimes, those arrays can get too big, and the Subset Pattern describes a technique where the most important subset of the array (often just the first few items) is stored directly in the embedded array, and any overflow items are stored in other documents and looked up only when necessary.
This solves two design problems: First, we recommend that you don't store more than 200 items in an array, as the more items you have, the slower the database is at traversing the fields in each document. Second, the subset pattern also answers a question that I've seen many times when we've been teaching data modeling: "How do I stop my array from growing so big that the document becomes bigger than the 16MB limit?" While we're on the subject, do avoid your documents getting this big — it usually implies that you could improve your data model, for example, by separating out data into separate documents, or if you're storing lots of binary data, you could keep it outside your database, in an object store.
Before delving into how to abstract a lookup for the extra array items that aren't embedded in the source document, I'll first implement a wrapper type for a BSON array. This can be used to declare array fields on a
Document
class, instead of the Field
type that I implemented in previous articles.I'm going to define a
SequenceField
to map a document's array into my access layer's object model. The core functionality of a SequenceField is you can specify a type for the array's items, and then when you iterate through the sequence, it will return you objects of that type, instead of just yielding the type that's stored in the document.A concrete example would be a social media API's UserProfile class, which would store a list of Follower objects. I've created some sample documents with a Python script using Faker. A sample document looks like this:
1 { 2 "_id": { "$oid": "657072b56731c9e580e9dd70" }, 3 "user_id": "4", 4 "user_name": "@tanya15", 5 "full_name": "Deborah White", 6 "birth_date": { "$date": { "$numberLong": "931219200000" } }, 7 "email": "deanjacob@yahoo.com", 8 "bio": "Music conference able doctor degree debate. Participant usually above relate.", 9 "follower_count": { "$numberInt": "59" }, 10 "followers": [ 11 { 12 "_id": { "$oid": "657072b66731c9e580e9dda6" }, 13 "user_id": "58", 14 "user_name": "@rduncan", 15 "bio": "Rich beautiful color life. Relationship instead win join enough board successful." 16 }, 17 { 18 "_id": { "$oid": "657072b66731c9e580e9dd99" }, 19 "user_id": "45", 20 "user_name": "@paynericky", 21 "bio": "Picture day couple democratic morning. Environment manage opportunity option star food she. Occur imagine population single avoid." 22 }, 23 # ... other followers 24 ] 25 }
I can model this data using two classes — one for the top-level Profile data, and one for the summary data for that profile's followers (embedded in the array).
1 class Follower(Document): 2 _id = Field(transform=str) 3 user_name = Field() 4 5 class Profile(Document): 6 _id = Field(transform=str) 7 followers = SequenceField(type=Follower)
If I want to loop through all the followers of a profile instance, each item should be a
Follower
instance:1 profile = Profile(SOME_BSON_DATA) 2 for follower in profile.followers: 3 assert isinstance(follower, Follower)
This behavior can be implemented in a similar way to the
Field
class, by implementing it as a descriptor, with a __get__
method that, in this case, yields a Follower
constructed for each item in the underlying BSON array. The code looks a little like this:1 class SequenceField: 2 """ 3 Allows an underlying array to have its elements wrapped in 4 Document instances. 5 """ 6 7 def __init__( 8 self, 9 type, 10 field_name=None, 11 ): 12 self._type = type 13 self.field_name = field_name 14 15 def __set_name__(self, owner, name): 16 """ 17 Called when the enclosing Document subclass (owner) is defined. 18 """ 19 self.name = name # Store the attribute name. 20 21 # If a field-name mapping hasn't been provided, 22 # the BSON field will have the same name as the attribute name. 23 if self.field_name is None: 24 self.field_name = name 25 26 def __get__(self, ob, cls): 27 """ 28 Called when the SequenceField attribute is accessed on the enclosed 29 Document subclass. 30 """ 31 try: 32 # Lookup the field in the BSON, and return an array where each item 33 # is wrapped by the class defined as type in __init__: 34 return [ 35 self._type(item, ob._db) 36 for item in ob._doc[self.field_name] 37 ] 38 except KeyError as ke: 39 raise ValueError( 40 f"Attribute {self.name!r} is mapped to missing document property {self.field_name!r}." 41 ) from ke
That's a lot of code, but quite a lot of it is duplicated from
Field
- I'll fix that with some inheritance at some point. The most important part is near the end:1 return [ 2 3 self._type(item, ob._db) 4 5 for item in ob._doc[self.field_name] 6 ]
In the concrete example above, this would resolve to something like this fictional code:
1 return [ 2 Follower(item, db=None) for item in profile._doc["followers"] 3 ]
The dataset I've created for working with this only stores the first 20 followers within a profile document. The rest are stored in a "followers" collection, and they're bucketed to store up to 20 followers per document, in a field called "followers." The "user_id" field says who the followers belong to. A single document in the "followers" collection looks like this:
The Bucket Pattern is a technique for putting lots of small subdocuments together in a bucket document, which can make it more efficient to retrieve documents that are usually retrieved together, and it can keep index sizes down. The downside is that it makes updating individual subdocuments slightly slower and more complex.
I have a collection where each document contains an array of followers — a "bucket" of followers. But what I want is a query that returns individual follower documents. Let's break down how this query will work:
- I want to look up all the documents for a particular user_id.
- For each item in followers — each item is a follower — I want to yield a single document for that follower.
- I want to restructure each document so that it only contains the follower information, not the bucket information.
This is what I love about aggregation pipelines — once I've come up with those steps, I can often convert each step into an aggregation pipeline stage.
Step 1: Look up all the documents for a particular user:
1 {"$match": {"user_id": "4"}}
Note that this stage has hard-coded the value "4" for the "user_id" field. I'll explain later how dynamic values can be inserted into these queries. This outputs a single document, a bucket, containing many followers, in a field called "followers":
1 { 2 "user_name": "@tanya15", 3 "full_name": "Deborah White", 4 "birth_date": { 5 "$date": "1999-07-06T00:00:00.000Z" 6 }, 7 "email": "deanjacob@yahoo.com", 8 "bio": "Music conference able doctor degree debate. Participant usually above relate.", 9 "user_id": "4", 10 "follower_count": 59, 11 "followers": [ 12 { 13 "_id": { 14 "$oid": "657072b66731c9e580e9dda6" 15 }, 16 "user_id": "58", 17 "user_name": "@rduncan", 18 "bio": "Rich beautiful color life. Relationship instead win join enough board successful." 19 }, 20 { 21 "bio": "Picture day couple democratic morning. Environment manage opportunity option star food she. Occur imagine population single avoid.", 22 "_id": { 23 "$oid": "657072b66731c9e580e9dd99" 24 }, 25 "user_id": "45", 26 "user_name": "@paynericky" 27 }, 28 { 29 "_id": { 30 "$oid": "657072b76731c9e580e9ddba" 31 }, 32 "user_id": "78", 33 "user_name": "@tiffanyhicks", 34 "bio": "Sign writer win. Look television official information laugh. Lay plan effect break expert message during firm." 35 }, 36 . . . 37 ], 38 "_id": { 39 "$oid": "657072b56731c9e580e9dd70" 40 } 41 }
Step 2: Yield a document for each follower — the $unwind stage can do exactly this:
1 {"$unwind": "$followers"}
This instructs MongoDB to return one document for each item in the "followers" array. All of the document contents will be included, but the followers array will be replaced with the single follower subdocument each time. This outputs several documents, each containing a single follower in the "followers" field:
1 # First document: 2 { 3 "bio": "Music conference able doctor degree debate. Participant usually above relate.", 4 "follower_count": 59, 5 "followers": { 6 "_id": { 7 "$oid": "657072b66731c9e580e9dda6" 8 }, 9 "user_id": "58", 10 "user_name": "@rduncan", 11 "bio": "Rich beautiful color life. Relationship instead win join enough board successful." 12 }, 13 "user_id": "4", 14 "user_name": "@tanya15", 15 "full_name": "Deborah White", 16 "birth_date": { 17 "$date": "1999-07-06T00:00:00.000Z" 18 }, 19 "email": "deanjacob@yahoo.com", 20 "_id": { 21 "$oid": "657072b56731c9e580e9dd70" 22 } 23 } 24 25 # Second document 26 { 27 "_id": { 28 "$oid": "657072b56731c9e580e9dd70" 29 }, 30 "full_name": "Deborah White", 31 "email": "deanjacob@yahoo.com", 32 "bio": "Music conference able doctor degree debate. Participant usually above relate.", 33 "follower_count": 59, 34 "user_id": "4", 35 "user_name": "@tanya15", 36 "birth_date": { 37 "$date": "1999-07-06T00:00:00.000Z" 38 }, 39 "followers": { 40 "_id": { 41 "$oid": "657072b66731c9e580e9dd99" 42 }, 43 "user_id": "45", 44 "user_name": "@paynericky", 45 "bio": "Picture day couple democratic morning. Environment manage opportunity option star food she. Occur imagine population single avoid." 46 } 47 48 # . . . More documents follow
Step 3: Restructure the document, pulling the "follower" value up to the top-level of the document. There's a special stage for doing this — $replaceRoot:
1 {"$replaceRoot": {"newRoot": "$followers"}},
Adding the stage above results in each document containing a single follower, at the top level:
1 # Document 1: 2 { 3 "_id": { 4 "$oid": "657072b66731c9e580e9dda6" 5 }, 6 "user_id": "58", 7 "user_name": "@rduncan", 8 "bio": "Rich beautiful color life. Relationship instead win join enough board successful." 9 } 10 11 # Document 2 12 { 13 "_id": { 14 "$oid": "657072b66731c9e580e9dd99" 15 }, 16 "user_id": "45", 17 "user_name": "@paynericky", 18 "bio": "Picture day couple democratic morning. Environment manage opportunity option star food she. Occur imagine population single avoid." 19 } 20 } # . . . More documents follow
Putting it all together, the query looks like this:
1 [ 2 {"$match": {"user_id": "4"}}, 3 {"$unwind": "$followers"}, 4 {"$replaceRoot": {"newRoot": "$followers"}}, 5 ]
I've explained the query that I want to be run each time I iterate through the followers field in my data abstraction library. Now, I'll show you how to hide this query (or whatever query is required) away in the SequenceField implementation.
Now, I would like to change the behavior of the SequenceField so that it does the following:
- Iterate through the embedded subdocuments and yield each one, wrapped by type (the callable that wraps each subdocument.)
- If the user gets to the end of the embedded array, make a query to look up the rest of the followers and yield them one by one, also wrapped by type.
First, I'll change the
__init__
method so that the user can provide two extra parameters:- The collection that contains the extra documents, superset_collection
- The query to run against that collection to return individual documents, superset_query
The result looks like this:
1 class Field: 2 def __init__( 3 self, 4 type, 5 field_name=None, 6 superset_collection=None, 7 superset_query: Callable = None, 8 ): 9 self._type = type 10 self.field_name = field_name 11 self.superset_collection = superset_collection 12 self.superset_query = superset_query
The query will have to be provided as a callable, i.e., a function, lambda expression, or method. The reason for that is that generating the query will usually need access to some of the state of the document (in this case, the
user_id
, to construct the query to look up the correct follower documents.) The callable is stored in the Field instance, and then when the lookup is needed, it calls the callable, passing it the Document that contains the Field, so the callable can look up the user "_id" in the wrapped _doc
dictionary.Now that the user can provide enough information to look up the extra followers (the superset), I changed the
__get__
method to perform the lookup when it runs out of embedded followers. To make this simpler to write, I took advantage of laziness. Twice! Here's how:Laziness Part 1: When you execute a query by calling
find
or aggregate
, the query is not executed immediately. Instead, the method immediately returns a cursor. Cursors are lazy — which means they don't do anything until you start to use them, by iterating over their contents. As soon as you start to iterate, or loop, over the cursor, it then queries the database and starts to yield results.Laziness Part 2: Most of the functions in the core Python
itertools
module are also lazy, including the chain
function. Chain is called with one or more iterables as arguments and then only starts to loop through the later arguments when the earlier iterables are exhausted (meaning the code has looped through all of the contents of the iterable.)These can be combined to create a single iterable that will never request any extra followers from the database, unless the code specifically requests more items after looping through the embedded items:
1 embedded_followers = self._doc["followers"] # a list 2 cursor = followers.find({"user_id": "4"}) # a lazy database cursor 3 4 # Looping through all_followers will only make a database call if you have 5 # looped through all of the contents of embedded_followers: 6 all_followers = itertools.chain(embedded_followers, cursor)
The real code is a bit more flexible, because it supports both find and aggregate queries. It recognises the type because find queries are provided as dicts, and aggregate queries are lists.
1 def __get__(self, ob, cls): 2 if self.superset_query is None: 3 # Use an empty sequence if there are no extra items. 4 # It's still iterable, like a cursor, but immediately exits. 5 superset = [] 6 else: 7 # Call the superset_query callable to obtain the generated query: 8 query = self.superset_query(ob) 9 10 # If the query is a mapping, it's a find query, otherwise it's an 11 # aggregation pipeline. 12 if isinstance(query, Mapping): 13 superset = ob._db.get_collection(self.superset_collection).find(query) 14 elif isinstance(query, Iterable): 15 superset = ob._db.get_collection(self.superset_collection).aggregate( 16 query 17 ) 18 else: 19 raise Exception("Returned was not a mapping or iterable.") 20 21 try: 22 # Return an iterable that first yields all the embedded items, and 23 24 return chain( 25 [self._type(item, ob._db) for item in ob._doc[self.field_name]], 26 (self._type(item, ob._db) for item in superset), 27 ) 28 except KeyError as ke: 29 raise ValueError( 30 f"Attribute {self.name!r} is mapped to missing document property {self.field_name!r}." 31 ) from ke
I've added quite a few comments to the code above, so hopefully you can see the relationship between the simplified code above it and the real code here.
Implementing
Profile
and Follower
is now a matter of providing the query (wrapped in a lambda expression) and the collection that should be queried.1 # This is the same as it was originally 2 class Follower(Document): 3 _id = Field(transform=str) 4 user_name = Field() 5 6 7 def extra_followers_query(profile): 8 return [ 9 { 10 "$match": {"user_id": profile.user_id}, 11 }, 12 {"$unwind": "$followers"}, 13 {"$replaceRoot": {"newRoot": "$followers"}}, 14 ] 15 16 class Profile(Document): 17 _id = Field(transform=str) 18 followers = SequenceField( 19 type=Follower, 20 superset_collection="followers", 21 superset_query=lambda ob: extra_followers_query, 22 )
An application that used the above
Profile
definition could look up the Profile
with "user_id" of "4" and then print out the user names of all their followers with some code like this:1 for follower in profile.followers: 2 print(follower.user_name)
See how the extra query is now part of the type's mapping definition and not the code dealing with the data? That's the kind of abstraction I wanted to provide when I started building this experimental library. I have more plans, so stick with me! But before I implement more data abstractions, I first need to implement updates — that's something I'll describe in my next tutorial.
This is now the third tutorial in my Python data abstraction series, and I'll admit that this was the code I envisioned when I first came up with the idea of the docbridge library. It's been super satisfying to get to this point, and because I've been developing the whole thing with test-driven development practices, there's already good code coverage.
If you're looking for more information on aggregation pipelines, you should have a look at Practical MongoDB Aggregations — or now, you can buy an expanded version of the book in paperback.
If you're interested in the abstraction topics and Python code architecture in general, you can buy the Architecture Patterns with Python book, or read it online at CosmicPython.com
I livestream most weeks, usually at 2 p.m. UTC on Wednesdays. If that sounds interesting, check out the MongoDB YouTube channel. I look forward to seeing you there!
Top Comments in Forums
There are no comments on this article yet.