Data Modeling in NoSQL databases like MongoDB — to Embed or to Reference

4 min readOct 17, 2021

Schemaless database

Document-oriented database can be schemaless. It means that the databases don’t care about schema of the data. But what exactly does it really mean to be schemaless?

Schemaless !== schema free

You always have a schema, even if you don’t statically type it. In fact, there are various schemas:

The one you are fully aware of in your mental model
The one your app code and database really implemented to store your data structures
The one you should have implemented to fulfill the requirements
The one you’ll be implementing later either way, a time when you might care about the code more than today

Data without a schema is useless. You get a document from MongoDB, what do you do with it? To read or change any field, you need to know the types and meanings of those fields. That’s a schema.

When people say that MongoDB “has no schema”, they really mean that it does not enforce schema the way SQL databases do. MongoDB pushes schema concerns up to your application level, where you can handle them more flexibly. For example, in order to add a new field to your documents, you don’t need to do an all-or-nothing ALTER on your collection or migrate/rollback—potentially millions of entries. With a schemaless database like MongoDB, most of the time, adjustment to db are transparent and automatic. Just add and save the new fieldand all is well. Rollbacks are unlikely to cause problems if our code is well written.

Data modeling in NoSQL

The general idea and design architecture using NoSQL databases like MongoDB is that these are mostly used to handle denormalized data. This means that generally you don’t always need to create a relation between two documents to handle aggregated data structures like in relational databases like SQL that uses JOIN clause to combine rows from two or more tables using a common field between them. Ideally, there should be no relationship between collections: if the same data is required in two or more documents, it must be repeated. One of the great benefits is that a single read operation is required to get all data.

However, note that this doesn’t mean that NoSQL is not used when we need to create relations. Various NoSQL databases now already have lots of feature of relational databases. So rather than have a fixed use case for using NoSQL or RDBMS, it mostly depends on how your system is being designed and data being modeled.

Embedding documents

Embedding is the ideal case for using NoSQL databases: with denormalized data.

Let us consider you are building a car information website whose primary function is to give information on car model and different parts being used in different model of the cars. In this scenario:

Since each webpage gives specific info for a particular model of the car, each page needs to fetch all the details for the particular model of the car.
Since this is just a site for detailed information of individual model, we don’t need to query on the car parts individually — hence the various car parts don’t need to be separated as per the requirement

Hence, we can just model it like this:

{
carName: “Xesla”,
carModel: “Xesla S”,
dealership: [
{ city: ‘Xpace’, name: ‘ABC’ },
{ city: ‘Xoon’, name: ‘XYZ’ }
]
}
{
carName: “Xesla”,
carModel: “Xesla X”,
dealership: [
{ city: ‘Xpace’, name: ‘ABC’ },
{ city: ‘Xars’, name: ‘MNO’ }
]
}

Since we don’t have any requirement for query relating to dealership, we don’t need to work on separating it out and referencing it with car model in this case. So for each model, all we need is one single query to get all the information as per the scenario to get car info with embedded dealership data as well.

Embedded data is completely dependent on the parent document. In such schema design, if you need to query on the embedded data independently, then you are screwed, and may need to consider other approaches.

Referencing

In this type of modeling, the parent document will hold the reference Id (ObjectID) of the child documents.

Let us consider the same example as above, but now, we need to be able to search cars based on dealership as well. As mentioned above, it is going to be difficult to query embedded data in this scenario, so we need to create relation in this case.

car:
{
_id: ObjectId(“ab12”),
carName: “Xesla”,
carModel: “Xesla S”,
dealership: [
ObjectId(“1234"), ObjectId(“5678”)
]
}
dealership:
{
_id: ObjectId(“1234”)
city: ‘Xpace’,
name: ‘ABC’
}

Now, we can query:

Get all cars where dealership is Xpace from its _id “1234”
Populate the dealership with its information in cars model so that user can see dealership info and not just plain unreadable _id

In this type of modeling, each part is a separate document so you can apply all part related queries on these documents. No need to be dependent on parent document.
Hence it is very easy to perform CRUD (Create, Read, Update, Write) operations on each document independently.
However, one major drawback with this method is that you have to perform one extra query to get the part details. So that you can perform application-level joins with the product document to get the necessary result set. So it may lead to drop in DB performance.

Conclusion

As you assess your options, take some time to think about the road ahead. If you already have a good sense of the data and requirement, then you can easily choose and model the data schema accordingly. Everything is cost and benefit, so choose your tradeoffs and then your model.