Want to learn how to light up Big Data Analytics using Apache Spark in Azure?

Businesses struggle with many different aspects of data and technology. It can be difficult to know what technology to choose. Also, it can be hard to know where to turn, when there are so many buzzwords in the mix: analytics, big data and open source. My session at PASS Summit is essentially talking about these things, using Azure and Apache Spark as a backdrop.

Vendors tend to tell their version of events, as you might expect, so it becomes really hard to get advice on how to have a proper blueprint to get you up and running. In this session, I will examine strategies for using open source technologies to improve existing common Business Intelligence issues, using Apache Spark as our backdrop to delivering open source Big Data analytics.

Once we have looked at the strategies, we will look at your choices on how to make the most of the open source technology. For example, how can we make the most of the investment? How can we speed things up? How can we manipulate data?

itoa-illustration-1200x572

These business questions are translated into technical terms. We will explore how we can parallelize your computations across nodes of a Hadoop cluster, once your clusters are set up. We will look combine use of SparkR for data manipulation with ScaleR for model development in Hadoop Spark. At the time of writing, this scenario requires that you maintain separate Spark sessions, only running one session at a time, and exchange data via CSV files. Hopefully, in the near future, we’ll see an R Server release, when SparkR and ScaleR can share a Spark session and so share Spark DataFrames. Hopefully that’s out prior to the session so we can see it, but, nevertheless, we will still look at how ScaleR works with Spark and how we can use Sparkly and SparkR within a ScaleR workflow.

Join my session at PASS Summit 2017 to learn more about open source with Azure for Business Intelligence, with a focus on Azure Spark.

An MVP for 7 years – what’s next?

a10a2d58-02cf-41cb-ab94-f0f31d539c43-original

 

Well, it’s been a hard year, for a number of reasons, but I appear to have come out the other side.

Looking forward, what comes next?

New things!

dscn08212

As some of you know, I care deeply about diversity in technology.

I have set up a Diversity Charter Slack channel to encourage user group leaders to talk about diversity and how we can encourage user group leaders to think about these issues.

I have set up an effort to have a Diversity Charter that user groups can use. I need help with things like logos, thoughts on a website and so on – so please do help if  you can!

The Diversity Charter looks like this, so far:

We believe that all members of the technical community are equally important.
We are part a tech community where we value a diverse network, and learn and share from one another:
regardless of age,
regardless of colour,
regardless of their ethnicity,
regardless of their religion or beliefs,
regardless of disability,
regardless of gender or sexual orientation,
regardless of their race,
regardless of their ability or lack of ability,
regardless of nationality or accent.
We are a diverse tech community where we are all individuals with differences, but we are all members and we can all learn from each other.

I look forward to your thoughts. Please do join my Slack channel diversitycharter.slack.com/ or ping me an email at diversity@datarelish.com in order to get an invite.

I will continue to help share my knowledge through blogging, writing, speaking, presenting, and increase my online presence. At heart, I am a content producer. It’s what I do, and it’s what I love.

I will continue working hard on the PASS Board. I just attended a Board meeting, which took place two nights during the week in the PST timezone. I am based in the GMT timezone, so I had a few very late nights or very early mornings, depending on your view. My recent focus is as a ‘trusted advisor’ capacity so I am helping to drive the new developer initiatives and business analytics initiatives in a strategic manner.

To keep the community fresh, I will continue to try to help to develop other community leaders. I have nominated a lot of people for the MVP Award this year, including David Moss,  Tomaz Kastrun and other people that I won’t mention, because they weren’t successful this time.

 

 

 

Business Book Review: Start with Why by Simon Sinek

Perhaps a better subtitle might be: ‘Start with Why: how Great leaders inspire others to focus’ and succeed.

Great leaders and organizations are good at seeing what most people can’t see, which is the mindset of having a longer-term vision. Starting with clear focus and *why* is a great start, and presumably shows that you have really thought about it.
People, brands and organisations need to start with WHY give people a way to tell the outside world who they are and what they believe.

As an external consultant, I have found the ‘celery test’ extremely useful when advising customers. Here is Sinek himself, on the topic:

Although the subtitle is about inspiring great leaders to take action, I found that I took the idea of ‘focus’ away from the book. Sometimes, I see organisations acting like a start-up; trying to achieve a breadth and coverage quickly, and hoping that something will stick with customers.

Starting with WHY means that organisations can achieve integrity, because they will have success, and a story to go with success. Organisations, large and small – and even one-man-band consultants – need to think about their megaphone and what they are actually saying to customers.
Don’t be fooled into thinking that customers pay much attention. Instead, they ‘snapshot’ and you only have a small time to get your message across. By keeping your messaging simple, it means that there is less ‘noise’ for confusion.
I recommend you read it – I found it inspirational, and it helped me to get back to my ‘story’, and to think about my customer’s ‘stories’ as well. I read that it was over-long but I liked Sinek’s way of weaving story and ‘relatable’ anecdote with the points he was making. Sometimes I would find something different in the anecdote than his intention, so I was taking my understanding to a different level.

Favourite Quotes

For values or guiding principles to be truly effective they have to be verbs. It’s not “integrity,” it’s “always do the right thing.” It’s not “innovation,” it’s “look at the problem from a different angle.” Articulating our values as verbs gives us a clear idea – we have a clear idea of how to act in any situation.

 

Review: Start with Why: How Great Leaders Inspire Everyone to Take Action

Start with Why: How Great Leaders Inspire Everyone to Take Action
Start with Why: How Great Leaders Inspire Everyone to Take Action by Simon Sinek
My rating: 5 of 5 stars

Perhaps a better subtitle might be: ‘Start with Why: how Great leaders inspire others to focus’ and succeed. Great leaders and organizations are good at seeing what most people can’t see, which is the mindset of having a longer-term vision. Starting with clear focus and *why* is a great start, and presumably shows that you have really thought about it.
People, brands and organisations need to start with WHY give people a way to tell the outside world who they are and what they believe. As an external consultant, I have found the ‘celery test’ extremely useful when advising customers. Although the subtitle is about inspiring great leaders to take action, I found that I took the idea of ‘focus’ away from the book. Ofte, I see organisations acting like a start-up; trying to achieve a breadth and coverage quickly, and hoping that something will stick with customers. Starting with WHY means that organisations can achieve integrity, because they will have success, and a story to go with success. Organisations, large and small – and even one-man-band consultants – need to think about their megaphone and what they are actually saying to customers.
Don’t be fooled into thinking that customers pay much attention. Instead, they ‘snapshot’ and you only have a small time to get your message across. By keeping your messaging simple, it means that there is less ‘noise’ for confusion.
I recommend you read it – I found it inspirational, and it helped me to get back to my ‘story’, and to think about my customer’s ‘stories’ as well.

View all my reviews

PASS BA Analytics Days coming up

BA Day Atlanta

Enjoy a day of intensive business analytics training.

BA Days are one-day learning events designed to provide intensive, in-depth training on business analytics topics.

To register or receive more information about either of the two upcoming 2017 BA Days, click on one of the dates below:

Wednesday, June 21 in Atlanta, GA, USA
o Data Science with Excel, Open Source R, and Python for Data Analysts
o Beyond the Basics – Advanced Power BI for the Business Analyst

Tuesday, August 1 in San Diego, CA, USA
o Applied Data Science in a World of Big Data
o Telling Compelling Stories Using Data to Achieve Business Goals

Sign up for the BA Insights newsletter for expert knowledge, articles, and information on future BA Day and other training events. Just go to myProfile and update communications preferences.

Azure Cosmos DB for the rest of us: 5 part blog series

For Business Intelligence and Data Science professionals, we like nothing better than the excitement of new ways to store data. So there was a lot of excitement over Azure Cosmos DB when it was announced at Build 2017.

Azure Cosmos DB can be described as the ‘everything everywhere’ database. Multi-model, all kinds of consistency, and so on. And that’s what many organisations want… something that’s close to a one source of the truth – it’s a one source for the data. But does that mean it’s the right source? How can the BI or Data Science consumer understand it? They are the ones who can be closer to the sign-off authority and they can help articulate the need for it.

I was interviewed recently for TechTarget and it became clear that the language and terminology can make Azure Cosmos DB’s utility harder to understand if you are new to it. I read the announcements and I thought… what does it actually mean, in the real world to Business Intelligence, analytics professionals and Data Science spheres? Hence this digestible blog series, aimed at explaining it in plain English for the people who will be the ones to consider using it. When you read the material, it pretty much says that Azure Cosmos DB does everything. However, it won’t do anything if it isn’t understood or made relevant.

Over the five days, I’ll pick out some of the underlying technology and why it’s useful in it’s different guises. In today’s post, I’ll pick out some of the terminology and explain what it actually means. Over the next four days, I’ll talk about the different flavours of database that are contained on Azure Cosmos DB, aimed at BI, Analytics and Data Science professionals. I’ll talk about some of the pieces that you can make use of in Azure Cosmos DB such as

  • Key-Value
  • Document Databases
  • Graph
  • Columnar / Column-Oriented Databases

Hopefully, by the end of the series, you’ll be as excited by the opportunities of Azure Cosmos DB as I am. If not, that’s ok – it’s possible that the technology isn’t for you, and inaction is an action in itself.

So let’s get started. Let’s look at the Azure Cosmos DB definition, taken from Microsoft’s site:

Azure Cosmos DB is the first globally-distributed data service that lets you to elastically scale throughput and storage across any number of geographical regions while guaranteeing low latency, high availability and consistency – backed by the most comprehensive SLAs in the industry. Azure Cosmos DB is built to power today’s IoT and mobile apps, and tomorrow’s AI-hungry future.

web-shutterstock-105518339

What?!?!?

Ok. Let’s go through that again, at normal person pace.

 

 

 

globally-distributed – distribution of computation close to the geographic location of the data and the users. It goes beyond interconnection of servers as in the ‘olden days’ of
legacy architectures. In this definition, the distribution of workloads within the
architectures must be visible, adjustable, and automated.

What does this mean for you?
It means you have the capacity to use the cloud facility closest to you. This is important for legal and practical reasons, such as data privacy laws in your region, for example.

elastically scale throughput – these means that computing resources can be scaled up and down easily by Azure. Azure will adapt to workload changes by provisioning and de-provisioning resources as required. If your requirement spikes for some reason, then it will rise up to meet demand. available resources match the current demand as closely as possible. Elastically scaling throughput refers to the capacity of information units being processed, and this processing does not need to be static.

What does this mean for you?
Think of your monthly reporting. Many organisations will run financial reports for the month end. This is a ‘spike’ in requirement, which you only need 12 times a year. You don’t necessarily want, or need, to buy servers and network resource specifically for this purpose; in fact, it may be overloading your existing resources. This is where Azure steps in. You could, for example, have VMs that wake up once a month, run your reports, and then go to sleep again.

elastically scale storage – your application to size the storage according to throughput and storage on demand, worldwide. Azure Cosmos DB is intricate enough that you could even scale second and minute granularities. You can accommodate unexpected spikes in your workloads, or size downwards as required. This is a change from previous architectures, where the database has often been the least scalable component in architectures. Often, the phrase  “scaling the database” means a project in itself.

What does this mean for you?
A data storage tier of an elastic application might add and remove data storage due to cost and performance requirements. For example, it could vary the number of used Virtual Machines for example – virtual machines ‘on tap’, if you will! Azure can monitor your elastic applications for you.

low latency – latency is the delay between a client request, probably a request made by you at your computer, and a cloud service provider’s response to that request.

What does this mean for you?
A data storage tier of an elastic application might add and remove data storage due to cost and performance requirements. For example, it could vary the number of used Virtual Machines for example – virtual machines ‘on tap’, if you will! Azure can monitor your elastic applications for you.

high availability – this sounds depressing but it’s very necessary. It assumes that there are points of failure at every component of a system, and that these points of failure will fail at some point. High availability is preparing that eventuality, by building in strategies for coping with for failure using automated processes to recover from it. Fault-tolerant systems designed for high availability are achievable in the cloud.

What does this mean for you?
It means keeping the lights on, and your  business running.

consistency – different entities (nodes) have their own copy of some data object, and they may not always be the same. This is a big topic and you can research further for yourself; this is tip of the iceberg – or speck of dust in the Cosmos? There are different types of consistency.

Eventual Consistency – this is the situation where conflicts can arise. However, nodes communicate their changes to each other to resolve those conflicts. In time, each node will agree upon the final value.

Strong Consistency – all nodes agree on the new or updated value. Here, all updates are visible to all clients simultaneously, which introduces a requirement for blocking in update operations.

What does this mean for you?
Let’s take the case of an online shopping basket. Your purchases may be up to date on some nodes… but not all of them. The others need to catch up in order to resolve the conflict. This may not be noticeable by you or the purchaser. This would be eventual consistency. In strong consistency, you want the data to ‘agree’ – for example, your monthly reporting. Your consistency level depends on your requirement.

 

How does this relate to Azure Cosmos DB?

Business value will be created in the applications and reorganizations enabled by Azure. You don’t have to worry so much about the Cloud infrastructure itself, for example, when considering tuning for throughput – Azure Cosmos DB allows you to easily increase or decrease the amount of reserved throughput available to your application. Also, since it is globally distributed, Azure Cosmos DB will replicate your data wherever your users are. For Business Intelligence and Data Science consumers, that’s incredibly useful for your users.

You can think more about your applications and workloads. Often, developers don’t want to think about database structures and they can rely on ORM tools to write SQL for them. This is really giving developers something that they do anyway; have a very forgiving place to store data.

You can choose what consistency you require. With Azure Cosmos DB, developers do not have to settle for the extreme consistency choices that I described earlier  – strong vs. eventual consistency. Instead, Azure CosmosDB offers some ‘grey’ in there by offering 5 well-defined consistency choices:

five-consistency-levels

Credit: Microsoft https://docs.microsoft.com/en-us/azure/documentdb/documentdb-consistency-levels

 

Consistency Levels and guarantees

credit: https://docs.microsoft.com/en-us/azure/documentdb/documentdb-consistency-levels
Consistency Level Guarantees
Strong Linearizability
Bounded Staleness Consistent Prefix. Reads lag behind writes by k prefixes or t interval
Session Consistent Prefix. Monotonic reads, monotonic writes, read-your-writes, write-follows-reads
Consistent Prefix Updates returned are some prefix of all the updates, with no gaps
Eventual Out of order reads

 

As we progress through this series, we will add more to this question. But for now, over to you!

Your homework!

Here are some videos on Azure Cosmos DB for you to view. You can learn more about the research we implemented in Azure Cosmos DB by watching this video from Turing Award-winning, Microsoft Researcher, distributed systems giant and an inspiration, Dr. Leslie Lamport.

Next steps!

Tomorrow, we will talk more about key-value databases and how this is manifested in Azure Cosmos DB. Standby for more Azure Cosmos DB goodness!