Azure Cosmos DB for the rest of us: 5 part blog series

For Business Intelligence and Data Science professionals, we like nothing better than the excitement of new ways to store data. So there was a lot of excitement over Azure Cosmos DB when it was announced at Build 2017.

Azure Cosmos DB can be described as the ‘everything everywhere’ database. Multi-model, all kinds of consistency, and so on. And that’s what many organisations want… something that’s close to a one source of the truth – it’s a one source for the data. But does that mean it’s the right source? How can the BI or Data Science consumer understand it? They are the ones who can be closer to the sign-off authority and they can help articulate the need for it.

I was interviewed recently for TechTarget and it became clear that the language and terminology can make Azure Cosmos DB’s utility harder to understand if you are new to it. I read the announcements and I thought… what does it actually mean, in the real world to Business Intelligence, analytics professionals and Data Science spheres? Hence this digestible blog series, aimed at explaining it in plain English for the people who will be the ones to consider using it. When you read the material, it pretty much says that Azure Cosmos DB does everything. However, it won’t do anything if it isn’t understood or made relevant.

Over the five days, I’ll pick out some of the underlying technology and why it’s useful in it’s different guises. In today’s post, I’ll pick out some of the terminology and explain what it actually means. Over the next four days, I’ll talk about the different flavours of database that are contained on Azure Cosmos DB, aimed at BI, Analytics and Data Science professionals. I’ll talk about some of the pieces that you can make use of in Azure Cosmos DB such as

  • Key-Value
  • Document Databases
  • Graph
  • Columnar / Column-Oriented Databases

Hopefully, by the end of the series, you’ll be as excited by the opportunities of Azure Cosmos DB as I am. If not, that’s ok – it’s possible that the technology isn’t for you, and inaction is an action in itself.

So let’s get started. Let’s look at the Azure Cosmos DB definition, taken from Microsoft’s site:

Azure Cosmos DB is the first globally-distributed data service that lets you to elastically scale throughput and storage across any number of geographical regions while guaranteeing low latency, high availability and consistency – backed by the most comprehensive SLAs in the industry. Azure Cosmos DB is built to power today’s IoT and mobile apps, and tomorrow’s AI-hungry future.

web-shutterstock-105518339

What?!?!?

Ok. Let’s go through that again, at normal person pace.

 

 

 

globally-distributed – distribution of computation close to the geographic location of the data and the users. It goes beyond interconnection of servers as in the ‘olden days’ of
legacy architectures. In this definition, the distribution of workloads within the
architectures must be visible, adjustable, and automated.

What does this mean for you?
It means you have the capacity to use the cloud facility closest to you. This is important for legal and practical reasons, such as data privacy laws in your region, for example.

elastically scale throughput – these means that computing resources can be scaled up and down easily by Azure. Azure will adapt to workload changes by provisioning and de-provisioning resources as required. If your requirement spikes for some reason, then it will rise up to meet demand. available resources match the current demand as closely as possible. Elastically scaling throughput refers to the capacity of information units being processed, and this processing does not need to be static.

What does this mean for you?
Think of your monthly reporting. Many organisations will run financial reports for the month end. This is a ‘spike’ in requirement, which you only need 12 times a year. You don’t necessarily want, or need, to buy servers and network resource specifically for this purpose; in fact, it may be overloading your existing resources. This is where Azure steps in. You could, for example, have VMs that wake up once a month, run your reports, and then go to sleep again.

elastically scale storage – your application to size the storage according to throughput and storage on demand, worldwide. Azure Cosmos DB is intricate enough that you could even scale second and minute granularities. You can accommodate unexpected spikes in your workloads, or size downwards as required. This is a change from previous architectures, where the database has often been the least scalable component in architectures. Often, the phrase  “scaling the database” means a project in itself.

What does this mean for you?
A data storage tier of an elastic application might add and remove data storage due to cost and performance requirements. For example, it could vary the number of used Virtual Machines for example – virtual machines ‘on tap’, if you will! Azure can monitor your elastic applications for you.

low latency – latency is the delay between a client request, probably a request made by you at your computer, and a cloud service provider’s response to that request.

What does this mean for you?
A data storage tier of an elastic application might add and remove data storage due to cost and performance requirements. For example, it could vary the number of used Virtual Machines for example – virtual machines ‘on tap’, if you will! Azure can monitor your elastic applications for you.

high availability – this sounds depressing but it’s very necessary. It assumes that there are points of failure at every component of a system, and that these points of failure will fail at some point. High availability is preparing that eventuality, by building in strategies for coping with for failure using automated processes to recover from it. Fault-tolerant systems designed for high availability are achievable in the cloud.

What does this mean for you?
It means keeping the lights on, and your  business running.

consistency – different entities (nodes) have their own copy of some data object, and they may not always be the same. This is a big topic and you can research further for yourself; this is tip of the iceberg – or speck of dust in the Cosmos? There are different types of consistency.

Eventual Consistency – this is the situation where conflicts can arise. However, nodes communicate their changes to each other to resolve those conflicts. In time, each node will agree upon the final value.

Strong Consistency – all nodes agree on the new or updated value. Here, all updates are visible to all clients simultaneously, which introduces a requirement for blocking in update operations.

What does this mean for you?
Let’s take the case of an online shopping basket. Your purchases may be up to date on some nodes… but not all of them. The others need to catch up in order to resolve the conflict. This may not be noticeable by you or the purchaser. This would be eventual consistency. In strong consistency, you want the data to ‘agree’ – for example, your monthly reporting. Your consistency level depends on your requirement.

 

How does this relate to Azure Cosmos DB?

Business value will be created in the applications and reorganizations enabled by Azure. You don’t have to worry so much about the Cloud infrastructure itself, for example, when considering tuning for throughput – Azure Cosmos DB allows you to easily increase or decrease the amount of reserved throughput available to your application. Also, since it is globally distributed, Azure Cosmos DB will replicate your data wherever your users are. For Business Intelligence and Data Science consumers, that’s incredibly useful for your users.

You can think more about your applications and workloads. Often, developers don’t want to think about database structures and they can rely on ORM tools to write SQL for them. This is really giving developers something that they do anyway; have a very forgiving place to store data.

You can choose what consistency you require. With Azure Cosmos DB, developers do not have to settle for the extreme consistency choices that I described earlier  – strong vs. eventual consistency. Instead, Azure CosmosDB offers some ‘grey’ in there by offering 5 well-defined consistency choices:

five-consistency-levels

Credit: Microsoft https://docs.microsoft.com/en-us/azure/documentdb/documentdb-consistency-levels

 

Consistency Levels and guarantees

credit: https://docs.microsoft.com/en-us/azure/documentdb/documentdb-consistency-levels
Consistency Level Guarantees
Strong Linearizability
Bounded Staleness Consistent Prefix. Reads lag behind writes by k prefixes or t interval
Session Consistent Prefix. Monotonic reads, monotonic writes, read-your-writes, write-follows-reads
Consistent Prefix Updates returned are some prefix of all the updates, with no gaps
Eventual Out of order reads

 

As we progress through this series, we will add more to this question. But for now, over to you!

Your homework!

Here are some videos on Azure Cosmos DB for you to view. You can learn more about the research we implemented in Azure Cosmos DB by watching this video from Turing Award-winning, Microsoft Researcher, distributed systems giant and an inspiration, Dr. Leslie Lamport.

Next steps!

Tomorrow, we will talk more about key-value databases and how this is manifested in Azure Cosmos DB. Standby for more Azure Cosmos DB goodness!