Data modelling patterns in MongoDB
Although MongoDB is often described as a schema-less database, where schemas can be modified without issuing `alter table` queries as in a SQL database, this is seldom true in production settings since the performance capacities of a MongoDB application are largely determined by the document model that the application implements.
The amount of physical work that MongoDB needs to do to read or write data, primarily depends on how that data is distributed across multiple documents. In addition, the size of documents will also determine how many documents MongoDB can cache in memory. Therefore, choosing the correct data model is often described as a critical early task in the design of any data intensive application.
This article attempts to describe few of the commonly used patterns (or rather recipes) for schema design that’s used across the industry to produce solutions which can scale and perform under stress.
Linking vs Embedding
All the patterns that are discussed here more or less follows a combination of these two,
- Embedding fields in a single document.
- Linking collections using keys to data in other collections.
We’ll start by looking at a pattern that’s primarily designed with embedding in mind.
Attribute pattern
If we have documents that include a large number of attributes of the same data type, and if we know that we are going to be performing queries using many of these attributes, then we can reduce the number of indexes we need by adopting the attribute pattern.
Consider the scenario where you maintain a website which includes information about movie screenings across a large number of movie theaters. And you would expect to provide a search functionality to your audience to search what particular movies will be played in a given theater at any given screen-time interval. We could expect to have a schema as below for this,
{
"theater": "Sydney Event Cinema",
"screening_times": {
"8_10": {
"movies": [...],
...
},
"10_12": {...},
"12_2": {...},
...
}
}
},
{
"theater": "IMAX Theatre Sydney",
"screening_times": {
...
}
}
For each theater, we could have a number of `screening_times` and we could expect a large variety of queries against each of those time intervals for any given theater. So, naturally to reduce the query latency, we would like to index these time intervals. However, this would require having a large number of indexes. Such as,
- theater.screening_times.8_10
- theater.screening_times.10_12
- theater.screening_times.12_2
...
Which could degrade the write performance on the application as many indexes will need to be updated on writes. We would also have to keep track of screening times that could be newly added or removed (maybe due to a government regulation), so that new indexes can be added or deleted.
For the former problem, since MongoDB 4, there’s an index type called wildcard indexes , where we can index all attributes of a given field in which the screen times will be indexed as theater.screening_times.*. However, this feature should be used with discretion as they could impose significant overhead on write performance and should only be used after carefully scrutinized testing.
Let’s look at how we could re-arrange this rigid document schema to fit the more flexible attribute pattern.
"theater": {
"screening_times": [
{"key": "8_10", "value": { "movies": [], ... }},
{"key": "10_12", "value": { "movies": [], ... }},
{"key": "12_2", "value": { "movies": [], ... }}
...
]
}
Now, in order to query the movie list playing for a given theater at a given time interval we could issue the query
db.theaters.find({ theater.screening_times.key: "8_10" })
And we would only need to create theater.screening_times.key index for this particular query. Also, notice that, we can just as easily create an index for the value as well if we need to query in a reverse order. Such as searching for time intervals where a given movie is playing. Evidently, this reduces the number of indexes significantly.
Another use case of Attribute pattern is that, it can be used to introduce polymorphism to your collection. When all documents in a collection share similar attributes with the exception of a small number of fields that may vary from document to document, those distinctive fields can be arranged into an array of attributes, thus making all the documents in the collection follow a homogeneous structure, which would simplify your application code.
Extended Reference Pattern
In the case of maintaining a number of collections to store a large data-set, we often find ourselves using look-ups to fetch only a portion of data from different collections. And when the application traffic is high, these look-up queries usually tend to be a major catalyst for the dreaded high CPU peaks you observe in your MongoDB monitoring dashboard.
Let’s take an example from a PFM(personal finance management) app. You might display a user’s list of transactions in the home screen, which shows details like transaction description, amount, merchant name, logo etc. Users will have the ability to click into a transaction and yield more merchant information like merchant address, merchant contact details etc. Let’s say that you have two collections for transactions and merchants as below,
transactions
{
"id": string,
"merchant_id": string,
"amount": number,
"date": date
...
}
merchants
{
"id": string,
"name": string,
"address": string,
"logo": string,
"contact_details": Array
}
In this scenario, a query containing a lookup operation will be called every time a user loads the app, and on a busy day like a pay day, you would probably notice the congestion these searches produce.
This is where the extended reference pattern comes in to play. It merely refers to the idea that, commonly accessed data fields from foreign collections can be duplicated and stored in the queried collection. In our PFM app, merchant details like name and logo, can be embedded inside a transaction as below,
{
"id": string,
"merchant_id": string,
"amount": number,
"date": date,
"merchant_details": {
"name": string,
"logo": string
}
...
}
Given that it would eliminate a significant portion of index look-ups required for a listTransactions query call, this is certain to improve query latency. However, we should be diligent and consider the below points before diving in to using this pattern,
- What fields are you planning to embed in your document? pick fields that don’t change frequently and only include the fields you need to avoid joins to get better results.
- Additional storage will be required per document to store more data from other collections. This shouldn’t be a huge concern if the size of your document isn’t alarmingly high.
- How you would manage updates will be a product level discussion. In our example, if merchant details somehow changed, should we be updating our transactions at the exact moment? Or are we okay with updating it asynchronously? All the places in the application code where merchant details are updated, now we would need to keep tab of updating transactions as well.
- Are updates even necessary? Consider an eCommerce application where each order in an orders collection would include the customer’s shipping information. Initially this will be the customer’s shipping address. However, updating the shipping address in the relevant orders if the customer’s address changes would be improper from a business standpoint.
Subset Pattern
When you issue a query to fetch a single document with only few projected fields, you might think that it lessens the resources accessed by the DB to provide you that data. You might be surprised to hear that the number of fields returned doesn’t have any impact at all with regards to the disk and memory usage, although it would reduce the network bandwidth due to the reduced amount of bytes transferred. MongoDB will have to read the whole document to the memory unless its a covered query.
The set of documents that MongoDB would hold in memory at a given time is known as the working set. And if this working set is at capacity, MongoDB would have to repeatedly flush out documents and reread them from the disk, which can be highly detrimental to the performance of your application.
In order to mitigate this issue, you can either
- Vertically scale by increasing the memory of the MongoDB nodes.
- Horizontally scale by sharding.
- Reduce the size of the documents to minimize the working set. Subset pattern comes in to play here.
If you feel that the size of your document is too large, you can utilize the subset pattern to your advantage. Let’s imagine a video streaming application like YouTube where you’d have comments for each video. Let’s imagine that comments will be embedded in the video document itself as an array. Some videos will have an enormous collection of comments which might be getting out of hands. A solution is to keep only a subset of those comments in each document and move the rest to a different collection as below,
videos collection
{
"video_id": string,
"video_title": string,
"video_author": string,
"comments_latest": [
{
"comment_id":string,
"author": string,
"date":date,
"text":string
},
...
]
}
extra_comments collection
{
"comment_id":string,
"video_id":string,
"author": string,
"text":string
"date":date
}
...
The date of the comments can be the factor to decide which comments to keep and which comments to add in the extra_comments collection. The benefit is that we have to fetch all the comments for a video only when a user requested them.
Subset pattern is a great approach to reduce the working set when a large part of documents is rarely needed.
Schema Versioning Pattern
Change is the only constant in life, your application is bound to change throughout its lifetime whether due to changing market conditions or through feature modifications. If you’ve worked with SQL based relational databases, chances are that you’ve seen how migrations are carried out in the forms of Alter table scripts to modify table structures of your database. If you are dealing with large tables with millions of rows, migration process would either require halting the application or maintaining separate tables and applying complex application level code to do the migration asynchronously.
Even though, with MongoDB you can modify your schemas without a hassle, you would still have to deal with two separate version of schemas in the application code. Schema versioning pattern is a great tool to be utilized to facilitate this process.
Let’s imagine a recruitment company, 15–20 years ago, you might have saved only the mobile number of an applicant as his or her contact details. But with the advent of social media and the increase digital proficiency of the applicants, they might now have an email, a LinkedIn profile, a twitter handle, a portfolio website, and many more. If you had saved all these as separate fields in your document, you would soon find yourself writing a ton of if statements in the code to check if a field exists for a particular applicant. To mitigate this, you can opt in for a structure as below, where you have moved all the contact details, including mobile number in to a contact details array.
{
"_id": string,
"schema_version": 2,
"name": string,
"address": string,
"contact_details": [
"mobile_number": string,
"website": string,
"email": string,
...
]
...
}
Notice the new field in the document named schema_version, this is essentially what the schema versioning pattern entails. When you make a breaking change in your schema design, you can add an increment in your new documents with the field schema_version. It may seem like a simple enough hack, but this can dramatically reduce the overhead of migration, as the new application code can now sit side by side the old application code where the mobile number was a top level field. The migration of the old documents can now be done gradually in the background without taxing the DB or you can opt into keep the old documents as it is to not introduce any breaking changes in to your system. In essence, you’ll feel more in control of the migration if you adopt this pattern.
Synopsis
Ultimately, the schema design should follow the constraints and access patterns of your own application. Below are some of the key points that you could consider to make an informed decision,
- Avoid joins: in contrast to a relational database, joins are expected to be an exception, not the rule for MongoDB. In general, we should try to ensure that our critical queries can find all the data they need within a single collection.
- Manage redundancy: we should try not to embed data across several collections that which could increase our storage costs as well as the complexity of the code-base.
- Beware of the 16MB limit: MongoDB has a 16MB limit on the size of an individual document. We need to make sure that we never try to embed so much information that we risk exceeding that limit. Thinking about the working set is a good approach to not shoot yourself in the foot with maintaining large documents.
- Maintain consistency: MongoDB does support transactions
but they require special programming and have significant constraints. If we want to atomically update sets of information, it can be advantageous to include those data elements in a single document. MongoDB doesn’t support referential integrity such as a SQL database where primary keys cannot be deleted before deleting any foreign keys associated with them. Always think about the consistency of your application state when updating data.
PS: Patterns discussed in this article were taken from the course MongoDB data modelling , which is an official course offered by the MongoDB university. Consider enrolling in the university as they provide high quality free material in various MongoDB related topics!.