Explore Developer Center's New Chatbot! MongoDB AI Chatbot can be accessed at the top of your navigation to answer all your MongoDB questions.

Join us at AWS re:Invent 2024! Learn how to use MongoDB for AI use cases.
MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

Utilizing Collection Globbing and Provenance in Data Federation

Benjamin Flast4 min read • Published Jan 30, 2023 • Updated Jun 28, 2023
AtlasData Federation
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
A common pattern for users of MongoDB running multi-tenant services is to model your data by splitting your different customers into different databases. This is an excellent strategy for keeping your various customers’ data separate from one another, as well as helpful for scaling in the future. But one downside to this strategy is that you can end up struggling to get a holistic view of your data across all of your customers. There are many ways to mitigate this challenge, and one of the primary ones is to copy and transform your data into another storage solution. However, this can lead to some unfortunate compromises. For example, you are now paying more to store your data twice. You now need to manage the copy and transformation process, which can become onerous as you add more customers. And lastly, and perhaps most importantly, you are now looking at a delayed state of your data.
To solve these exact challenges, we’re thrilled to announce two features that will completely transform how you use your cluster data and the ease with which you can remodel it. The first feature is called Provenance. This functionality allows you to tell Data Federation to inject fields into your documents during query time that indicate where they are coming from. For example, you can add the source collection on the Atlas cluster when federating across clusters or you can add the path from your AWS S3 bucket where the data is being read. The great thing is that you can now also query on these fields to only get data from the source of your choice!
The other feature we’re adding is a bit nuanced, and we are calling it “globbing.” For those of you familiar with Atlas Data Federation, you probably know about our “wildcard collections.” This functionality allows you to generate collection names based on the collections that exist in your underlying Atlas clusters or based on sections of paths to your files in S3. This is a handy feature to avoid having to explicitly define everything in your storage configuration. “Globbing” is somewhat similar, except that instead of dynamically generating new collections for each collection in your cluster, it will dynamically merge collections to give you a “global” view of your data automatically. To help illustrate this, I’m going to walk you through an example.
Imagine you are running a successful travel agency on top of MongoDB. For various reasons, you have chosen to store your customers data in different databases based on their location. (Maybe you are going to shard based on this and will have different databases in different regions for compliance purposes.)
This has worked well, but now you’d like to query your data based on this information and get a holistic view of your data across geographies in real time (without impacting your operational workloads). So let’s discuss how to solve this challenge!

Prerequisites

In order to follow along with this tutorial yourself, you will need the following:
  1. Experience with Atlas Data Federation.
  2. An Atlas cluster with the sample data in it.
Here is how the data is modeled in my cluster (data in your cluster can be spread out among collections however your application requires):
  • Cluster: MongoTravelServices
    • Database: ireland
      • Collection: user_feedback (8658 Documents)
      • Collection: passengers
      • Collection: flights
    • Database: israel
      • Collection: user_feedback (8658 Documents)
      • Collection: passengers
      • Collection: flights
    • Database: usa
      • Collection: user_feedback (8660 Documents)
      • Collection: passengers
      • Collection: flights
collection inside of Atlas Cluster that needs to be remodeled
The goal here is to consolidate this data into one database, and then have each of the collections for user feedback, passengers, and flights represent the data stored in the collections from each database on the cluster. Lastly, we also want to be able to query on the “database” name as if it were part of our documents.

Create a Federated Database instance

  • The first thing you’ll need to do is navigate to the “Data Federation” tab on the left-hand side of your Atlas dashboard and then click “set up manually” in the "create new federated database" dropdown in the top right corner of the UI.
initial state of the Data Federation section of Atlas
Visual Storage Configuration editor for a Federated Database Instance
  • Then, for this example, we’re going to manually edit the storage configuration as these capabilities are not yet available in the UI editor.
JSON Storage Configuration editor for a Federated Database Instance
1{
2 "databases": [
3 {
4 "name": "GlobalVirtualDB",
5 "collections": [
6 {
7 "name": "user_feedback",
8 "dataSources": [
9 {
10 "collection": "user_feedback",
11 "databaseRegex": ".*", // This syntax triggers the globbing or combination of each collection named user_feedback in each database of the MongoTravelServices cluster.
12 "provenanceFieldName": "_provenance_data", // The name of the field where provenance data will be added.
13 "storeName": "MongoTravelServices"
14 }
15 ]
16 }
17 ],
18 "views": []
19 }
20 ],
21 "stores": [
22 {
23 "clusterName": "MongoTravelServices",
24 "name": "MongoTravelServices",
25 "projectId": "5d9b6aba014b768e8241d442",
26 "provider": "atlas",
27 "readPreference": {
28 "mode": "secondary",
29 "tagSets": []
30 }
31 }
32 ]
33}
Now when you connect, you will see:
1AtlasDataFederation GlobalVirtualDB> show dbs
2GlobalVirtualDB 0 B
3AtlasDataFederation GlobalVirtualDB> use GlobalVirtualDB
4already on db GlobalVirtualDB
5AtlasDataFederation GlobalVirtualDB> show tables
6user_feedback
7AtlasDataFederation GlobalVirtualDB>
And a simple count results in the count of all three collections globbed together:
1AtlasDataFederation GlobalVirtualDB> db.user_feedback.countDocuments()
225976
3AtlasDataFederation GlobalVirtualDB>
25976 is the sum of 8660 feedback documents from the USA, 8658 from Israel, and 8658 from Ireland.
And lastly, I can query on the provenance metadata using the field “provenancedata.databaseName”:
1AtlasDataFederation GlobalVirtualDB> db.user_feedback.findOne({"_provenance_data.databaseName": "usa"})
2{
3 _id: ObjectId("63a471e1bb988608b5740f65"),
4 'id': 21037,
5 'Gender': 'Female',
6 'Customer Type': 'Loyal Customer',
7 'Age': 44,
8 'Type of Travel': 'Business travel',
9 'Class': 'Business',
10
11 'Cleanliness': 1,
12 'Departure Delay in Minutes': 50,
13 'Arrival Delay in Minutes': 55,
14 'satisfaction': 'satisfied',
15 '_provenance_data': {
16 'provider': 'atlas',
17 'clusterName': 'MongoTravelServices',
18 'databaseName': 'usa',
19 'collectionName': 'user_feedback'
20 }
21}
22AtlasDataFederation GlobalVirtualDB>

In review

So, what have we done and what have we learned?
  1. We saw how quickly and easily you can create a Federated Database in MongoDB Atlas.
  2. We learned how you can easily combine and reshape data from your underlying Atlas clusters inside of Atlas Data Federation with Collection Globbing. Now, you can easily query one user_feedback collection and have it query data in the user_feedback collections in each database.
  3. We saw how to add provenance data to our documents and query it.

A couple of things to remember about Atlas Data Federation

  1. Collection globbing is a new feature that applies to Atlas cluster sources and allows dynamic manipulation of source collections similar to “wildcard collections.”
  2. Provenance allows you to include additional metadata with your documents. You can indicate that data federation should include additional attributes such as source cluster, database, collection, the source path in S3, and more.
  3. Currently, this is only supported in the Data Federation JSON editor or via setting the Storage Configuration in the shell, not the visual storage configuration editor.
  4. This is particularly powerful for multi-tenant implementations done in MongoDB.
To learn more about Atlas Data Federation and whether it would be the right solution for you, check out our documentation and tutorials or get started today.

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

Getting Started With Deno 2.0 & MongoDB


Oct 22, 2024 | 13 min read
Industry Event
locationEDINA, MN, UNITED STATES | IN-PERSON

MongoDB Developer Day Minneapolis


Nov 14, 2024 | 1:30 PM - 10:00 PM UTC
Tutorial

Add Memory to Your JavaScript RAG Application Using MongoDB and LangChain


Sep 18, 2024 | 9 min read
Tutorial

How to Choose the Right Chunking Strategy for Your LLM Application


Jun 17, 2024 | 16 min read
Table of Contents
  • Prerequisites