Azure Tools and Technologies Cheat Sheet

Don’t you think that the amount of Big Data technologies in Azure is very confusing? I’ve distilled some information from the official Microsoft Azure blog so it’s easier to read. Before we begin, here is a potted Hadoop history:


This cheatsheet contains a high level descriptions of the tools, APIs, SDKs, and technologies that you’ll see in Azure. Together, they are used in tandem with big data solutions, and they include proprietary Azure and open source technologies.

I hope that this cheat sheet will help you to more easily identify the tools and technologies you should investigate, depending on the function.

Function Description Tools
Data consumption Extracting and consuming the results from Hadoop-based solutions. Azure Intelligent Systems Service (ISS), Azure SQL Database,LINQ to Hive, Power BI, SQL Server Analysis Services (SSAS),SQL Server Database Engine,SQL Server Reporting Services (SSRS)
Data ingestion Extracting data from data sources and loading it into Hadoop-based solutions Aspera, Avro, AZCopy, Azure Intelligent Systems Service (ISS), Azure Storage Client Libraries, Azure Storage Explorer, Casablanca,Cloudberry Explorer, CloudXplorer, Cross-platform Command Line Interface (X-plat CLI),File Catalyst, Flume, Hadoop Command Line, HDInsight SDK and Microsoft .NET SDK for Hadoop, Kafka, PowerShell,Reactive Extensions (Rx), Signiant,SQL Server Data Quality Services (DQS),SQL Server Integration Services (SSIS),Sqoop,Storm,StreamInsight,Visual Studio Server Explorer
Data processing Processing, querying, and transforming data in Hadoop-based solutions Azure Intelligent Systems Service (ISS),Hcatalog, Hive,LINQ to Hive, Mahout,Map/reduce, Phoenix, Pig, Reactive Extensions (Rx), Samza, Solr,SQL Server Data Quality Services (DQS),Storm,StreamInsight
Data transfer Transfer  data between Hadoop and other data stores such as databases and cloud storage. Falcon,SQL Server Integration Services (SSIS)
Data visualization Visualizing and analyzing the results from Hadoop-based solutions. Azure Intelligent Systems Service (ISS), D3.jx, Microsoft Excel, Power BI, Power Map, Power Query, Power View, PowerPivot
Job submission Processing jobs  in Hadoop-based solutions. HDInsight SDK and Microsoft .NET SDK for Hadoop
Management Manage and monitor  Hadoop-based solutions. Ambari, Azure Storage Client Libraries, Azure Storage Explorer, Cerebrata Azure Management Studio, Chef, Chukwa, CloudXplorer, Ganglia, Hadoop command line,Knox,Azure Management Portal, Azure SDK for Node.js,Puppet, Remote Desktop Connection, REST APIs,System Center management pack for HDInsight,Visual Studio Server Explorer
Workflow Creating workflows and managing multi-step processing in Hadoop-based solutions. Azkaban, Cascading, Hamake, Oozie,SQL Server Integration Services (SSIS)

Any questions, please get in touch at

Collective Intelligence in the SQL Community

How does Collective Intelligence manifest itself in the SQL Server community? In case you are wondering, Collective intelligence is intelligence that emerges from the group, and shows itself in the collaboration and competition of many individuals. The way I look at it, this can be easily translated to the SQL Community which is a lively ecosystem, where the collaboration of many clever people accelerates the flow of knowledge and information for the good of everyone. If you’re interested in collective intelligence and new ways of thinking about this intelligence in an egoless environment, then I’d suggest that you take a look at Ledface.

In particular, the Ledface article on ‘Knowledge is not mine, not yours. Only the arrangement is ours’   is interesting, because we can see these concepts manifested in the SQL Server community. For me, Ledface make a very subtle point about helpers seeing an improvement in the overall domain in which they care about. I think that’s true in the SQL Community: I can see the passion where people really care about the next level for SQL Server, and pushing it forward for the benefit of SQL Server, its users, and the people who look after it. It’s about making it better for everyone else, as well as for individuals.

In order for this to work in a social environment, however, there needs to be minimal organisation with little or no rules. For example, if you use the Twitter hashtags incorrectly, then the community may sanction you by voicing this directly in a robust 140 character riposte, or by simply unfollowing you. If you’re really unlucky, you’d be blocked! For this to work, then I think that there is something in swarm intelligence to the SQL community; we organise ourselves, we help ourselves, and we sanction ourselves. The community is decentralised since we work all over the globe, which means that help is available 24 x 7 in a ‘follow the sun’ methodology.

In the SQL Community, we see examples of this helpfulness in many different ways.  For example, a newbie SQL novice contacted me recently to ask for links to T-SQL forums on the Internet, where they could post up some T-SQL questions. Here is a quick list of some useful resources:

SQL Server Central
In case you haven’t explored this site, it also has a dedicated scripts section which is a good place to look for scripts

Microsoft forums – This is a dedicated T-SQL forum, which is always useful

– I use Twitter in order to answer questions sometimes. I like doing this, because it means you are helping someone in ‘real time’ at the point at which they need it. The Twitter hashtag is #SQLHelp and if you need to know the very informal rules around asking these questions, a useful resource by Brent Ozar is here.  Although the help is ‘real time’, Jamie Thomson looked at this issue in his blog and I’d suggest you take a look.

Brent Ozar rightly points out that, as a courtesy, it’s nice to thank the Twitterati who have helped you via the #SQLHelp hashtag. I’d extend that courtesy out to the people on SSC and the Microsoft forums.

In my opinion, the SQL Community is stellar, partly because of our collective intelligence, but the ‘helping hand’ that we extend to one another. Long may it continue. I look forward to your comments.

Intelligent Laziness: Connecting to a GoDaddy SQL Server using SSIS and Tableau

I try to apply the principle of Intelligent Laziness to make life as straightforward as possible. It’s hard to find meaning and clarity in a world that guarantees neither, and the Absurd perspective means that I do not stop trying in any case. Please note that I’m not associated with GoDaddy in any way, and I’ll be pleased to hear of any other hosting options that you use; please leave a comment below.

One aspect is to try and get access to the data I need to deliver Business Intelligence projects, as quickly and effectively as possible. As a freelancer, however, there is an additional issue around physical location: mine and the clients. I don’t always work at the client site, and can work from home. In this case, I need access to data in a way that does not necessarily mean that I have access to the client’s network or VPN.

One solution I’ve found is to get client data via a GoDaddy hosted SQL Server databases, and I use SSIS to consume the data directly from there. There are a lot of posts already on connecting to GoDaddy hosted SQL Server databases using .Net and so on, but it’s very easy to connect directly from within SSIS. This post is aimed at showing you how easy it is to do that.

In order to show what we’re aiming towards, here is an example of a very simple dataflow, which retrieves the data from the GoDaddy source, and places it into a text file. If you click on the image, it will take you to a larger version held at my Flickr account.


In order to access the GoDaddy source, then it will be necessary to set up the connection to GoDaddy.  To do this, you will need the following items:

Server details: This will look something like
Username: this will be SQL authentication, not Windows authentication. So, type your SQL authentication username here. In this example, the username is ‘stirrj’
Password: This will be your SQL authentication password.
Database list: Once you’ve put this information in, you should be able to see a list of databases in the drop down list. Here, the example database is called ‘denali’.

To assist you, here is an example of how it looks:


Once you’ve set up the connection manager, you can take a look at retrieving the data as normal.  Here is an example query, which retrieves data against the database hosted at GoDaddy.


You can see your OLE DB Connection Manager name in the drop-down list.  Once you’ve set this up, you can test out that it works, and hopefully you will get some green boxes, as below:


I personally haven’t experienced any issues with the speed of retrieving data.  However, if you want to load this data into a Tableau workbook, you may want to consider a number of ways of optimising the speed of retrieval:

– you could try to connect directly. I haven’t tried this but I imagine you just use the sql server reference details as above
– Using SSIS, you could extract the data from GoDaddy to a local SQL Server environment for development work. I’m integrating data from various sources, so this is working for me. I then use a Tableau data extract (TDE) to hold the data in the Tableau workbook, and simply refresh the TDE extract. This works very nicely, particularly since I can share the Tableau workbook with other Tableau as required.

One important note to make is that the Tableau Desktop edition will not connect to SQL Server, remote or otherwise. If you are a Desktop user, you will need to extract the data to a CSV file, and then import the data from the CSV file.

I hope that helps someone; any questions, just leave a comment and I will come back to you as soon as I can.

SSIS and Intelligent Laziness: SQL Server Package Configurations

Porting and deploying SSIS 2008 packages as part of the software development life cycle can be a tricky process! Applying the principles of ‘Intelligent Laziness’, here are some thoughts around the best package configuration method to use which results in the minimum work with the maximum long-term results! There are two main issues regarding porting packages from development through to the production environment. One issue is ensuring that the package runs in the new environment, without any problems. The second issue in deploying packages to production involves ensuring that the deployed package picks up the correct production settings: not the old development settings.
Fortunately, there are different ways in SSIS of achieving the maximum portability in deploying packages throughout the software engineering cycle from development to production. The SSIS package is a collection of objects which depend on the environment. This can make deployment more complex than it perhaps needs to be for two reasons, which I call the ‘between environment’ and ‘within environment’ issues.
Between environment issue: for example, the package may take in data from one system, or perhaps load data from a file. This may change from environment to environment at the same point in time.
Within environment issue: there is the practical consideration is that the same environment will change over time. For example, security access may change, which means that your package login to source systems may need to be amended.
In order to address these issues, it’s important to be able to modify a package with the minimum disruption to the package operation; preferably, to make these changes in a way that is insulated from the package itself. SSIS offers a way of doing this, which is called Package Configurations. These have two important aspects:
Package Path to the property – this is like an address for each object in the package. An example may include \Package.connections[ConnectionName].ConnectionString, which may hold the connection string for an Oracle source database, for example.
Property Value – the actual value when the address is redeemed e.g. servername, user name
There are many different ways to configure Package Configurations, but my preferred method is the SQL Package configuration method. Essentially, SQL Server table configurations mean that the package configuration values are stored in a SQL Server table, which is de-coupled from the package itself. The packages only need to be deployed once. Each time the package executes, it sucks in the configuration parameters from SQL Server table. The SQL Server table does not even have to be stored on the SSIS server; it can be stored on a dedicated server. This is my preferred method for a number of reasons:
Flexible, central administration – that wholesale changes can be adopted much more easily than stored as XML, which would need re-deployed each time.
Transparency – by allowing support and development teams to be able to view the configuration table, then it’s possible to make the packages easier to administer. In the case where a package is being deployed, it’s possible for the configurations to be double-checked by everyone involved in the deployment process.
Managing ‘within environment’ changes – For example, let’s take the case where the package login needs to be changed from one value to another, or a password needs to be changed. By simply running an update statement before the next iteration of package execution, the password value would be simply retrieved from the central SQL table at the point of execution. In the situation where there are a number of packages that require an amended login, then this one simple change would mean that all of the packages would be automatically updated.
Managing ‘between environment’ changes – When taking a package from development through to production, it is possible to execute update statements to so that the package is pointing at the production settings rather than the development settings. This can be done as part of the deployment process. Further, in releasing an update script, the deployment personnel can double-check that the settings are correct.
Connection Manager changes – It’s possible to use the table to store connection manager information. This is useful when moving the SSIS package through the software engineering cycle from dev onwards.
Configuration Re-use – since the configurations are centrally stored, it is possible to re-use existing configurations. In reality, you probably end up building different packages from the same data stores; so why not re-use? This may save you some time in the longer run.
Package configurations can be your friend! They can really help developers to stick to software development principles of moving through dev/sys/uat/production. Personally, I don’t like using a production environment as a test environment and I would never recommend it. The usual ‘we don’t have the money to have separate environments’ excuse no longer washes, with the availability of VPCs and Hyper-Vs. To summarise, SQL Server package configurations can assist you to manage SSIS packages centrally with the minimum work involved in the management process.

Intelligent Laziness in Analysis Services 2008: Working Smart rather than Working Harder!

Further to my earlier theme of Intelligent Laziness, here is some key advice before setting out on the journey of creating Analysis Services cubes. Hopefully this advice will help you to ‘Work Smart’ rather than ‘Work Hard’!

Take the users along with you on the journey
If the end users don’t understand the cube, then they won’t like it; end result, they won’t use the cube. You could work iteratively so that you have a ‘working’ cube for users, and a development cube, so you can comfortably develop in a way that is insulated from users.

Plan, Plan, Plan!
It is important to plan early on in the process about your key success criteria for delivery. It’s easy to be swayed by enthusiastic users to include work that isn’t essential to the project.

Choose the correct version of Analysis Services
It sounds simple, but check you have the correct version of SQL Server that you need. For example, in SQL Server 2008 enterprise edition is aimed at large cubes which require partitioning. Motto: if in doubt, choose Enterprise.

End users querying the cube
How are the end users going to query the cube? There are lots of products which can allow users to interact with the cube, for example, Reporting Services is the obvious solution for MDX-savvy users. If the users are Excel power-users, then they might find XlCubed useful addition. If the users want data visualisation par excellence, then Tableau might be worth a look.

Letting Users know what’s in the Cube
It can be difficult to know how to best document the cubes. Try and reduce the amount of time to respond to user enquiries by relying on a mapping spreadsheet, which traces the lineage of the data from source to final cube measure.

Data Sources
It’s always worth checking how complete the data sources are. The SSAS cube depends on a well-formed data warehouse, and if this isn’t in place, then it is difficult to generate a cube from a source which it wasn’t designed to handle.

Users need dates; lots of them, and lots of different types e.g. Fiscal calendars, public holiday calendars, fiscal calendars for their business customers and so on. It is essential to work out what dates they need, and, importantly, who is responsible for generating the calendars.

The users can be involved here; ask them to give you their most popular calculations. It is useful to use a mapping spreadsheet to record this information.

Cube Hierarchies
The users can also create the hierarchy structure within the cube, by using sample data as a base. Some hierarchies are straightforward, such as geography. Product hierarchies, for example, may not be very straightforward, and it is easy to make assumptions which are not correct.

It’s better to get the security resources on board earlier than later on; it is essential to know who can access what calculations, and it’s important to know where the restrictions should be placed, sometimes for legal reasons.

Thanks for reading and hopefully this will save you some time and effort!