Group Data with the Outlier Pattern

On this page

Before You Begin

Pros
Cons
About this Task
Steps
Identify a threshold for outliers
Decide how to handle outliers
Add an indicator for outlier documents
Store additional sales in a separate collection
Results
Updates for Outliers
Learn More

If your collection stores documents of generally the same size and shape, a drastically different document (an outlier) can cause performance issues for common queries.

Consider a collection that stores an array field. If a document contains many more array elements than other documents in the collection, you may need to handle that document differently in your schema.

Use the outlier pattern to isolate documents that don't match the expected shape from the rest of your collection. Your schema still maintains all of the same data, but common queries are not affected by a single large document.

Before You Begin

Before you modify your schema to handle outliers, consider the pros and cons of the outlier pattern:

Pros

The outlier pattern improves performance for commonly-run queries. Queries that return typical documents do not need to also return large outlier documents.

The outlier pattern also handles edge cases in the application. For example, if your application typically displays 50 results from an array, there won't be a document that contains 2,000 results that disrupts the user experience.

Cons

The outlier pattern requires more complex logic to handle updates. If you frequently need to update your data, you may want to consider other schema design patterns. For more information, see Updates for Outliers.

About this Task

Consider a schema that tracks book sales. Typical documents in the collection look like this:

db.sales.insertOne(
   {
      "_id": 1,
      "title": "Invisible Cities",
      "year": 1972,
      "author": "Italo Calvino",
      "customers_purchased": [ "user00", "user01", "user02" ]
   }
)

The customers_purchased array is unbounded, meaning that as more customers purchase a book, the array grows larger. For most documents, this is not a problem because the store does not expect more than a few sales for a particular book.

Suppose that a new, popular book results in a large number of purchases. The current schema design results in a bloated document, which negatively impacts performance. To address this issue, implement the outlier pattern for documents that don't have a typical amount of sales.

Steps

Identify a threshold for outliers

Given your schema's typical document structure, identify when a document becomes an outlier. The threshold may be based on what the UI for your application demands, or what queries you run on your documents.

In this example, a book with more than 50 sales is an outlier.

Decide how to handle outliers

When addressing large arrays, a common way to handle outliers is to store values beyond the threshold in a separate collection. For books that have more than 50 sales, store the extra customers_purchased values in a separate collection.

Add an indicator for outlier documents

For books that have more than 50 sales, add a new document field called has_extras and set the value to true. This field indicates that there are more sales stored in a separate collection.

db.sales.insertOne(
   {
      "_id": 2,
      "title": "The Wooden Amulet",
      "year": 2023,
      "author": "Lesley Moreno",
      "customers_purchased": [ "user00", "user01", "user02", ... "user49" ],
      "has_extras": true
   }
)

Store additional sales in a separate collection

Create a collection called extra_sales to store sales beyond the initial 50. Link documents from the extra_sales collection to the sales collection with a reference:

db.extra_sales.insertOne(
   {
      "book_id": 2,
      "customers_purchased_extra": [ "user50", "user51", "user52", ... "user999" ]
   }
)

Results

The outlier pattern prevents atypical documents from impacting query performance. The resulting schema avoids large documents in the collection while maintaining a full list of sales.

Consider an application page that shows information about a book and all users who bought that book. After implementing the outlier pattern, the page displays information for most books (typical documents) quickly.

For popular books (outliers), the application performs an extra query in the extra_sales collection on book_id. To improve performance for this query, you can create an index on the book_id field.

Updates for Outliers

You need to handle updates for outlier documents differently than typical documents. The logic you use to perform updates depends on your schema design.

To perform updates for outliers for the preceding schema, implement the following application logic:

Check if the document being updated has has_extras set to true.
- If has_extras is missing or false, add the new purchases to the sales collection.
  - If the resulting customers_purchased array contains more than 50 elements, set has_extras to true.
- If has_extras is true, add the new purchases to the sales_extras collection for the corresponding book_id.

Learn More

Back

Bucket Pattern

Versioning