Handle Duplicate Data
On this page
When you embed related data in a single document, you may duplicate data between two collections. Duplicating data lets your application query related information about multiple entities in a single query while logically separating entities in your model.
About this Task
One concern with duplicating data is increased storage costs. However, the benefits of optimizing access patterns generally outweigh potential cost increases from storage.
Before you duplicate data, consider the following factors:
How often the duplicated data needs to be updated. Frequently updating duplicated data can cause heavy workloads and performance issues. However, the extra logic needed to handle infrequent updates is less costly than performing joins (lookups) on read operations.
The performance benefit for reads when data is duplicated. Duplicating data can remove the need to perform joins across multiple collections, which can improve application performance.
Example: Duplicate Data in an E-Commerce Schema
The following example shows how to duplicate data in an e-commerce application schema to improve data access and performance.
Steps
Populate the database
Create the following collections in the eCommerce
database:
Collection Name | Description | Sample Document | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customers | Stores customer information such as name, email, and phone
number. |
| ||||||||||||||||||||||||||
products | Stores product information such as price, size, and
material. |
| ||||||||||||||||||||||||||
orders | Stores order information such as date and total price.
Documents in the orders collection embed the
corresponding products for that order in the lineItems
field. |
|
The following properties from the products
collection are duplicated
in the orders
collection:
productId
product
price
size
Benefits of Duplicating Data
When the application displays order information, it displays the
corresponding order's line items. If the order and product information
were stored in separate collections, the application would need to
perform a $lookup
to join data from two collections. Lookup
operations are often expensive and have poor performance.
The reason to duplicate product information as opposed to only embedding
line items in the orders
collection is that the application only
needs a subset of product information when displaying orders. By only
embedding the required fields, the application can store additional
product details without adding unnecessary bloat to the orders
collection.
Example: Duplicate Data for Product Reviews
The following example uses the subset pattern to optimize access patterns for an online store.
Consider an application where when user views a product, the application
displays the product's information and five most recent reviews. The
reviews are stored in both a products
collection and a reviews
collection.
When a new review is written, the following writes occur:
The review is inserted into the
reviews
collection.The array of recent reviews in the
products
collection is updated with$pop
and$push
.
Steps
Populate the database
Create the following collections in the productsAndReviews
database:
Collection Name | Description | Sample Document | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
products | Stores product information. Documents in the products
collection embed the five most recent product reviews in
the recentReviews field. |
| |||||||||||||||||||||
reviews | Stores all reviews for products (not only recent reviews).
Documents in the reviews collection contain a
productId field that indicates the product that the
review pertains to. |
|
Benefits of Duplicating Data
The application only needs to make one call to the database to return
the all information it needs to display. If data was stored entirely in
separate collections, the application would need to join data from the
products
and reviews
collection, which could cause performance
issues.
Reviews are rarely updated, so it is not expensive to store duplicate data and keeping the data consistent between collections is not a challenge.
Learn More
To learn how to keep duplicate data consistent, see Data Consistency.