A Quick Practical Overview of Cosmos Db

November 8, 2020

Before Cosmos DB, my world was a SQL Relational World; that is to say, I have designed plenty of normalized Databases for projects of varying sizes. However, for the latest project that I have been working on, my team and I decided to go with Cosmos DB, which is a No-SQL Document Database. Every No-SQL Document-based DB out there has its nuances; this is no different from Cosmos DB.

It took me a lot of hands-on coding and reading the documentation to figure out all the features and nuances of Cosmos DB to design a correct and scalable schema and data-structure that the team has been using for our system.

In this blog post, I am going through the pain of documenting my findings so that others may go through this cheat sheet and save a lot of time and energy.

Selection of a partition key

Perhaps the most important aspect in the design of a Cosmos DB Schema is the partition key itself. This partition key will “break” your collection into logical partitions.

Where do I begin with this one? There is a lot of documentation about choosing the partition key.

I will summarize and encapsulate some of that below:

It should have a wide range of values.
It’s a commonly used parameter while accessing documents in the collection.

Example

Let us get into some examples because that’s the best way to understand this.

I have a collection which tracks the orders or purchases of user/s. Any access/query pattern into our collection will, for sure, have an UserId associated with it.

We create a new document every time a User places a new Order. We used UserId as our partition key.

Things to look out for

The partition key is selected at the time of the creation of the collection. It cannot be changed later, not without redoing the whole collection.
Regardless of what the documentation tells you, it will be best for you to choose a String value as the partition key.
Each logical partition has a max size of 20 GB.

References

https://docs.microsoft.com/en-us/azure/cosmos-db/partition-data
https://github.com/Azure/azure-sdk-for-java/issues/14485
https://www.youtube.com/watch?v=9v6WbCOzPiM (Azure Cosmos DB Partition Key Advisor – In case you have your data setup already!)

Using the “id” field

Every document in a Cosmos DB Collection needs to have this “id” field. This id is unique for every document within the logical partition.

Things to look out for

id is a String field.

Contrary to popular belief, this field is NOT the unique identifier for any document in the collection!

The unique identifier for a document is a combination of both the id field and partition key value.

Multiple documents can have the same id as long as they are in different partitions!

Unique Key Policy

A unique key constraint can be set up in the scope of every logical partition. Cosmos DB provides the flexibility to implement some complex, unique key policies.

Things to look out for

RU (Request Unit) charges are higher when a unique id policy is a setup for Create/Update operations.
Just like the partition key cannot be changed once it’s setup, the unique key policy cannot be changed either.

Reference

https://docs.microsoft.com/en-us/azure/cosmos-db/unique-keys

Some Max Numbers in Cosmos DB

Max size of a document: 2 MB
Max Length of a partition key value: 2048 bytes
Max length of an id value: 1023 bytes
Max storage across all documents in a logical partition: 20 GB

Reference

https://docs.microsoft.com/en-us/azure/cosmos-db/concepts-limits

Handling Concurrent Writing

(Plain English: Situation where multiple writers updating one document at the same time)

Concurrency is when multiple writers are trying to update the same document in a collection. Relational Databases handle this situation with locking i.e. you can lock the row or the table that is being updated until the transaction is completed.

Unfortunately, there is no locking facility provided by Cosmos DB. Cosmos DB instead provides an optimistic concurrency control to control access to its documents.

Every document has an “_etag” key that is auto-generated. This key is equivalent to a version number of a document. Writers can control access to the document by using this key.

For Example:

There are two writers who want to update the same document almost at the same time.

Partial updates of the document are NOT possible in Cosmos DB. The only way is to read the entire document into memory. Update the document (in-memory) and save it to the DB.

Without Optimistic Concurrency Control (Causes Information Loss)

Writer 1 reads the document.
Writer 2 reads the document.
Writer 1 updates the document (in-memory) and saves it to the DB.
Writer 2 updates the document (in-memory) and saves it to the DB, thereby losing all of writer 1’s updates.

With Optimistic Concurrency Control (No Information Lost)

Writer 1 reads the document.
Writer 2 reads the document.
Writer 1 updates the document (in-memory) and saves it to the DB (Cosmos DB updates the “_etag” value with every update of the document).
Writer 2 updates the document (in-memory) and saves it to the DB along with “ItemRequestOptions” set with “setIfMatchETag(eTag)” option. The document update fails as the “_etag” has been updated
Writer 2 reads the document again.
Writer 2 updates the document from Step 5 (in-memory) saves it to the DB along with the new “_etag”.

Things to look out for

Your document structure might need an _etag field.

Reference

https://docs.microsoft.com/en-us/azure/cosmos-db/database-transactions-optimistic-concurrency

Transactions

First, I would like to provide some context about the Transactions term for people not from the SQL world.

Let us say that multiple operations can be executed in a batch; even if one operation fails in the batch of operations that was executed, then all the operations that were executed previously in the batch are “rolled” back as if they were never executed and the batch execution is canceled.

Transactions are possible within the items of a logical partition. It is not possible to execute transactions across items located in multiple partitions.

You can achieve transactions in Cosmos DB using Stored Procedures, which are written in Javascript.

Reference

https://docs.microsoft.com/en-us/azure/cosmos-db/database-transactions-optimistic-concurrency

Some more useful concepts

Server-side programming
Triggers – Pre and Post
Stored Procedures
User-Defined Functions
Time to live for a document
Change Feed Processor

Notes for Kotlin Developers

As of today, Microsoft has a Java SDK (v4) available to use Cosmos DB. The biggest problem Kotlin developers will face using this SDK is Serialization, especially when they use Kotlin specific data structures like sealed classes.

Internally the Java SDK performs Serialization and Deserialization uses the Jackson library. Currently, there is no way to specify your own serialization and de-serialization mechanism in the Java SDK (you can specify your own serialization mechanism in the .NET SDK!). There is, however, a workaround. The workaround is as follows:

Serialization

Serialize your data structure using your favorite Kotlin Serializer (we use Kotlinx) into a JSON String.
From a JSON String convert into a JSON Node (Jackson Serializer).

De-serialization

Specify that you want the type of your data structure to be JSON Node.
Convert from JSON Node to a JSON String (Using the Jackson Serializer).
Convert from JSON String to the specified type using your favorite Serializer (We used Kotlinx).