A Quick Practical Overview of Cosmos Db

Before Cosmos DB, my world was a SQL Relational World; that is to say, I have designed plenty of normalized Databases for projects of varying sizes. However, for the latest project that I have been working on, my team and I decided to go with Cosmos DB, which is a No-SQL Document Database. Every No-SQL Document-based DB out there has its nuances; this is no different from Cosmos DB.

It took me a lot of hands-on coding and reading the documentation to figure out all the features and nuances of Cosmos DB to design a correct and scalable schema and data-structure that the team has been using for our system.

In this blog post, I am going through the pain of documenting my findings so that others may go through this cheat sheet and save a lot of time and energy.

Selection of a partition key

Perhaps the most important aspect in the design of a Cosmos DB Schema is the partition key itself. This partition key will “break” your collection into logical partitions.

Where do I begin with this one? There is a lot of documentation about choosing the partition key.

I will summarize and encapsulate some of that below:

Example

Let us get into some examples because that’s the best way to understand this.

I have a collection which tracks the orders or purchases of user/s. Any access/query pattern into our collection will, for sure, have an UserId associated with it.

We create a new document every time a User places a new Order. We used UserId as our partition key.

Things to look out for

References

Using the “id” field

Every document in a Cosmos DB Collection needs to have this “id” field. This id is unique for every document within the logical partition.

Things to look out for

Contrary to popular belief, this field is NOT the unique identifier for any document in the collection!

The unique identifier for a document is a combination of both the id field and partition key value.

Unique Key Policy

A unique key constraint can be set up in the scope of every logical partition. Cosmos DB provides the flexibility to implement some complex, unique key policies.

Things to look out for

Reference

Some Max Numbers in Cosmos DB

Reference

Handling Concurrent Writing

(Plain English: Situation where multiple writers updating one document at the same time)

Concurrency is when multiple writers are trying to update the same document in a collection. Relational Databases handle this situation with locking i.e. you can lock the row or the table that is being updated until the transaction is completed.

Unfortunately, there is no locking facility provided by Cosmos DB. Cosmos DB instead provides an optimistic concurrency control to control access to its documents.

Every document has an “_etag” key that is auto-generated. This key is equivalent to a version number of a document. Writers can control access to the document by using this key.

For Example:

There are two writers who want to update the same document almost at the same time.

Partial updates of the document are NOT possible in Cosmos DB. The only way is to read the entire document into memory. Update the document (in-memory) and save it to the DB.

Without Optimistic Concurrency Control (Causes Information Loss)

  1. Writer 1 reads the document.
  2. Writer 2 reads the document.
  3. Writer 1 updates the document (in-memory) and saves it to the DB.
  4. Writer 2 updates the document (in-memory) and saves it to the DB, thereby losing all of writer 1’s updates.

With Optimistic Concurrency Control (No Information Lost)

  1. Writer 1 reads the document.
  2. Writer 2 reads the document.
  3. Writer 1 updates the document (in-memory) and saves it to the DB (Cosmos DB updates the “_etag” value with every update of the document).
  4. Writer 2 updates the document (in-memory) and saves it to the DB along with “ItemRequestOptions” set with “setIfMatchETag(eTag)” option. The document update fails as the “_etag” has been updated
  5. Writer 2 reads the document again.
  6. Writer 2 updates the document from Step 5 (in-memory) saves it to the DB along with the new “_etag”.

Things to look out for

Reference

Transactions

First, I would like to provide some context about the Transactions term for people not from the SQL world.

Let us say that multiple operations can be executed in a batch; even if one operation fails in the batch of operations that was executed, then all the operations that were executed previously in the batch are “rolled” back as if they were never executed and the batch execution is canceled.

Transactions are possible within the items of a logical partition. It is not possible to execute transactions across items located in multiple partitions.

You can achieve transactions in Cosmos DB using Stored Procedures, which are written in Javascript.

Reference

Some more useful concepts

Notes for Kotlin Developers

As of today, Microsoft has a Java SDK (v4) available to use Cosmos DB. The biggest problem Kotlin developers will face using this SDK is Serialization, especially when they use Kotlin specific data structures like sealed classes.

Internally the Java SDK performs Serialization and Deserialization uses the Jackson library. Currently, there is no way to specify your own serialization and de-serialization mechanism in the Java SDK (you can specify your own serialization mechanism in the .NET SDK!). There is, however, a workaround. The workaround is as follows:

Serialization

  1. Serialize your data structure using your favorite Kotlin Serializer (we use Kotlinx) into a JSON String.
  2. From a JSON String convert into a JSON Node (Jackson Serializer).

De-serialization

  1. Specify that you want the type of your data structure to be JSON Node.
  2. Convert from JSON Node to a JSON String (Using the Jackson Serializer).
  3. Convert from JSON String to the specified type using your favorite Serializer (We used Kotlinx).