Note to Self: A roundup of the latest Azure blog posts and whitepapers on polybase, network security, cloud services, Hadoop and Virtual Machines

Here is a roundup of Azure blogs and whitepapers which I will be reading this month.

This is the latest as at June 2014, and there is a focus on cloud security in the latest whitepapers, which you can find below..

·         PolyBase in APS – Yet another SQL over Hadoop solution?
·         Desktop virtualization deployment overview
·         Microsoft updates its Hadoop cloud solution
·         LG CNS build a B2B virtual computer service in the cloud
·         Deploying desktop virtualization
·         Microsoft updates its Hadoop cloud solution
·         Accessing desktop virtualization
·         The visualization that changed the world of data
·         Access and Information Protection: Setting up the environment
·         Access and Information Protection: Making resources available to users
·         Access and Information Protection: Simple registration for BYOD devices
·         Success with Hybrid Cloud webinar series
·         Power BI May round-up
·         Access and Information Protection: Syncing and protecting corporate information

Here are the latest whitepapers, which focus on security:

Windows Azure Security: Technical Insights. Update to the Security Overview whitepaper which provides a detailed description of security features and controls.
  • Security Best Practices for Windows Azure Solutions. Updated guidance on designing and developing secure solutions.
  • Windows Azure Network Security. Recommendations for securing network communications for applications deployed in Windows Azure.
  • Microsoft Antimalware for Azure Cloud Services and Virtual Machines This paper details how to use Microsoft Antimalware to help identify and remove viruses, spyware, and other malicious software in Azure Cloud Services and Virtual Machines.
  • Data Visualisation with Hadoop, Hive, Power BI and Excel 2013 – Slides from SQLPass Summit and SQLSaturday Bulgaria

    I presented this session at SQLPass Summit 2013 and at SQLSaturday Bulgaria.

    The topic focuses on some data visualisation theory, an overview of Big Data and finalises the Microsoft distribution of Hadoop. I will try to record the demo as part of a PASS Business Intelligence Virtual Chapter online webinar at some point, so please watch this space.

    I hope you enjoy and I look forward to your feedback.

    Hadoop Summit Europe 2014 Call for Abstracts is now open

    Hadoop Summit Europe 2014 Call for Abstracts is now open

    If you are interested in registering, please click here. Good luck! The call for Abstracts for the EMEA Hadoop Summit is now officially open. FYR the closing date is 31st October 2013.
    Who should submit? If you are a developer, architect, administrator, data analyst, data scientist, IT or business leader or otherwise involved with Apache Hadoop and have a compelling topic that you would like to present at Hadoop Summit, the Hadoop folks will welcome your submission. 
    What do you get?
    Being part of a fun Hadoop crowd! Being a speaker is all kinds of cool. (For me, this is the best bit!)
    All presenters receive a complimentary all-access pass to Hadoop Summit so you get to learn too. 
    What should you submit?
    The content selection committee is particularly interested in compelling use cases and success stories, best practices, cautionary tales and technology insights that help to advance the adoption of Apache Hadoop.
    Tracks this year include:
    • Committer – Speakers in this track are restricted to committers across all Hadoop-related Apache projects only and content will be curated by a group of senior committers
    • The Future of Apache Hadoop – Investigating the key projects, incubation projects and the industry initiatives driving Hadoop innovation.
    • Data Science & Hadoop – Discussing applications, tools, and algorithms, research and emerging applications that use and extend the Hadoop platform for data science
    • Hadoop Deployment & Operations – Focusing on deployment, operation and administration of Hadoop clusters at scale, with an emphasis on tips, tricks, best practices and war stories
    • Hadoop for Business Applications and Development – Presentation topics that discuss the languages, tools, techniques, and solutions for deriving business value from data.

    Submission deadline October 31, 2013.

    Good luck!

    Eating the elephant, one bite at a time: Loading data using Hive

    In the previous ‘Eating the elephant’ blogs, we’ve talked about tables and their implementation. Now, we will look at one of the ways to get data into a table. There are different ways to do this, but here we will look only at getting data into an external Hive table using HiveQL, which is the Hive query language. Sounds similar to SQL, doesn’t it? Well, that’s because a lot of it looks and smells similar. When you look at this example here, you will see what I mean!

    Here is the Hive syntax to create an external table. In this post, we will look at simple examples, and perhaps look at partitioned tables on another occasion.

    First things first; you need to create an external table to hold your data. Before you do, check the contents of the folder /user/hue/UNICEF if it exists. If not, don’t worry, we have the rest of the syntax below:

    Create your table

    CountryName     STRING,
    AvgRankPosition     FLOAT,
    MaterialWellBeing     FLOAT,
    HealthAndSafety     FLOAT,
    EducationalWellBeing     FLOAT,
    FamilyPeersRelationships     FLOAT,
    BehavioursAndRisks     FLOAT,
    SubjectiveWellBeing     FLOAT)
    LOCATION '/user/hue/UNICEF';
    Note that the directory has now been created: /user/hue/UNICEF If it was not there already. 
    You may now see it in the File Explorer, and here is an example below:


    The next step is to upload some data to folders. For the purpose of showing what Hive does when you load up data files, let’s place a file into a different folder, called /user/hue/IncomeInequality/ and the file will be called ‘UNICEFreportcard.csv’ The syntax to load the file statically is here:

    LOAD DATA INPATH ‘/user/hue/IncomeInequality/UNICEFreportcard.csv’ OVERWRITE INTO TABLE `unicefreportcard`;

    You can see this here, and if it is not clear, you can click on the link to go to Flickr to see it in more detail:


    Once you have executed the query, you can check that the data is in the table. This is easy to do, using the HUE interface. Simply to go the Beeswax UI, which you can see as the yellow icon at the top left hand side of the HUE interface, and then select the ‘Tables’ menu item from the grey bar just under the green bar. You can select the UNICEFreportcard table, and then choose ‘Sample’. You should see data, as in the example screenshot below:


    When the data is imported, where does it get stored for the external table?

    The UNICEFreportcard file was originally in the IncomeInequality folder, but now the original UNICEFreportcard.csv file is moved from the IncomeInequality folder. Where has it gone? It has been moved to the UNICEF folder, which was specified as the location when we created the external table.

    When we drop the table, the data file is not deleted. We can still see the CSV file in the UNICEF folder. This is straightforward to test; simply drop the table using the button, and then use File Explorer to go to UNICEF folder and you will see the CSV. Here it is:


    To test the separation of the data and the table metadata, simply re-run the query to create the table as before. Now, when you view a sample of the data, you see the data as you did before. When you dropped the table originally, you simply removed the table metadata; the data itself was stored. The reason for this is due to the fact that this table is an External table, and Hive preserves a separation between the data and the table metadata.

    If the table was an internal table, if you drop the table, the data will be deleted as well – so be careful!

    To summarise, in this post,we’ve looked at one way of getting data into Hive. Our next steps will be to do something with the data, now that it is in Hive. So let’s move on to do some simple querying before we proceed to start to visualise the data.

    I hope that helps!


    Eating the elephant, one bite at a time: Partitioning in SQL Server 2012 and in Hive

    Hive and SQL Server offer you the facility to partition your tables, but their features differ slightly. This blog will highlight some of the main differences for you to look out for.
    What problems does partitioning solve? The problem arises due to the size of the tables. Therefore, the simplest solution is to divide the table up into smaller parts. These are called partitions. When a database table grows in size to the hundreds of gigabytes or more, it can become more difficult to complete actions such as: 
    * load new data
    * remove old data
    * maintain indexes 
    This is due to the size of the table, which simply means that the operations take longer since they must traverse more data. This does not only apply to reading data; it also applies to other operations such as INSERT or DELETE. 
    How does partitioning solve the problem? 
    In SQL Server, you have table partitioning features, which can make very large tables and indexes easier to manage, and improve the performance of appropriately filtered queries. If you’d like more information, see Joe Sack’s excellent presentation here
    The SQL Server Optimiser can send the query to the correct partition of the table, rather than sending the query to the whole table. Since the table is broken down into manageable chunks, it is easier for maintenance operations to manage a smaller number of partitions, rather than a massive lump of data in a table. SQL Server partitioning means that you need to use the partitioned keys everywhere in a partitioned table, so that the query can run more efficiently.

    SQL Server partitioning can be complex because it’s up to the SQL Server person, with their fingers on the keyboard, to decide how partitions will be swapped in and out, for example. This means that, often, two people might do two completely different things when it comes to deciding how the partitioning should go. As someone who often goes onsite to deliver projects, this means that you can sometimes see different things, done different ways.

    On the other hand, Hive does not use sophisticated indexes like many RDBMS which are able to answer queries in seconds. Hive queries tend to run for much longer, and are really intended for complex analytics queries.

    In SQL Server, what types of partitioning are available?
    There are two types of partitioning in SQL Server:
    Manually subdivide a large table’s data into multiple physical tables
    Use SQL Server’s table partitioning feature to partition a single table
    The first option is not available by default in SQL Server, and would require extensive DBA skills to implement.

    In Hive, it is possible to set up an automatic partition scheme at the point at which the table is created.  As you can see from the example below, we create partitioned columns as part of the table creation. This creates a subdirectory for each value in the partition column. This means that queries that have a WHERE clause will go straight to the relevant subdirectory and scan the subdirectory, therefore looking only at  a subset of the data. This means that the query returns the results faster, by use of the partition to ‘direct’ the query quickly towards the relevant data.What’s great about this, for me, is the consistency. Hive looks after the partition maintenance and creation for us.  Here is an example table script:

        CountryName STRING,
        GNI INT

    In Hive, partitioning key columns are specified as part of the table DDL and we can see them quite clearly in the PARTITIONED BY clause.  This helps Hive to use the folder structure to generate folders which are married to the partition. Here is an example of adding a partition using the command line:

    ALTER TABLE GNI ADD PARTITION(countryid = 1, year = 2008)
    LOCATION ‘user/hue/IncomeInequality/2008/1’;

    In Hive, partitioning data can help by making queries faster. If you work with data, you’ll know that users don’t like waiting for data and it’s important to get it to them as quickly as you can. How does it do this? Hive uses the directory structure to zoom quickly straight to the correct partition, based on the file structure. Here is the resulting file structure for our queries above:

    Partition Location in Hive

    You can also use the ‘Show Partitions ‘ command to show the partitions in a given table. Here is the result in Hive:

    Partition Show Command Result

    You can also use ‘DESCRIBE EXTENDED ‘ to provide partition information:


    Here is the resulting file:

    Partition Describe extended

    Incidentally, the naming structure is flexible and you don’t need to locate partition directories next to one another. This is useful because it means that you could locate some data on some cheaper storage options, and keep other pieces of data elsewhere. This means that you could clearly indicate, by naming, what is your older data from your cheaper data.

    What are your options for storing data? In the Microsoft sphere, Windows Azure HDInsight Service supports both Hadoop Distributed Files System (HDFS) and Azure Storage Vault (ASV) for storing data. Windows Azure Blob Storage, or ASV, is a Windows Azure storage solution which provides a full featured HDFS file system interface for Blob Storage that enables the full set of components in the Hadoop ecosystem to operate (by default) directly on the data managed by Blog Storage. Blob Storage is not a relatively cheap solution, and storing data in Blob Storage enables the HDInsight clusters used for computation to be safely deleted without losing user data due to their separation.

    For example, you could store archive on HDFS and Azure, for example, by specifying different locations for your partitions. It is also an option to scale horizontally by simply adding further servers to the cluster, which is a simple ‘hardware’ oriented approach that people understand i.e. if you want more, buy a server, which seems to be a common route for solving problems, whether it is the correct one or not!

    In the next segment, we will start to look at HiveQL, which focuses on getting data into the filesystem and back out again. Being a Business Intelligence person, it’s all about the data and I hope you’ll join me for the next adventure.

    I hope that helps,

    Eating the elephant one bite at a time: dropping databases

    In the last post, you learned how simple it is to create a database using Hive. The command is very similar to the way that we do this in SQL Server. As discussed, the underlying technology works differently, but ultimately they achieve the same end; database created.

    Now that you’ve created your database, how do you drop them again?  Fortunately, this is very simple to know how to do, if you’re already a SQL Server  aficionado.

    Here is an example:

    DROP DATABASE IncomeInequality;

    If you want to drop the database without  error messages if the database doesn’t exist, then you can use:

    DROP DATABASE IF EXISTS IncomeInequality;

    Dropping the database in Hive isn’t straightforward, however. In SQL Server, when you drop a database, it removes all of the database and the disk files used by the database. On the other hand, Hive will not allow you to drop a database if it contains tables. To get around this default behaviour, you have to add the CASCADE command to the command.


    Using this command will delete the tables first, and then drop the database.  Then, like SQL Server, its directory is deleted too.

    Once you’ve executed the command, you should double-check that the database has gone:


    You can then see the result in Hue (more on this later! Let’s keep it small bites of the elephant for now!)


    Here, we have dropped the database and we only have the default left.

    In Hive, we will look at dropping tables in the next post. This is more complex than it first seems.
    I hope that helps!


    Eating the elephant one bite at a time: creating a database using Hive

    Following on from the first part in my Hadoop series for the Microsoft Business Intelligence professional, we move to the next stage where we look at the simplicity of creating a database.

    Before we move forward, ensure that the Hortonworks Sandbox is up and running. Our objective today is to create a new database called ‘IncomeInequality’. Before we create a database, let’s see what is in there already. To do this, we use the SHOW command:


    If you are using the Query Editor, then the query still returns the answer if you don’t append the semi-colon at the end of the sentence.

    On the other hand, if you use the Hive Command Console in HDInsight, for example, it will expect the semi-colon. I’ve just got accustomed to adding it, so you will see it in the examples here.

    CREATE DATABASE IncomeInequality;

    This small statement will do what it says; create you a database called IncomeInequality.  As before, I have appended the semi-colon.

    Now when we execute the command to show databases, we can see that the new database has been added:


    As you can imagine, with BigData, you may well have a big number of databases!

    If you want to restrict the number of databases shown, you can use a Regular Expression to be more specific about the name of the database you are looking for.  Here is an example where we are looking for the database beginning with ‘i’:


    In this case, the list of databases shown includes only the names of the databases that begins with the letter ‘i’.  Here is an example:


    So far, so good. What actually happens when you create a database?

    The structure is very neat: Hive creates a new directory for each database. The tables within the database are stored in subdirectories of the original database directories.

    This way of database creation is different from SQL Server. When SQL Server creates a database, the following steps happen:

     – the model database is used as a ‘model’ (hence the name!) to initialise the new database and its metadata. User-defined objects in the Model database are copied to the new database. If a size is not specified, the new primary data file is given the same size as the primary data file for the Model database.
     – SQL Server then associates a Service Broker GUID to name the database.
     – two files are created: a primary file and a transaction log file.
     – the database fills the rest of the database with empty pages.

    To summarise, using Hive can also be very simple. However, there are ways in which we could add more detail to the database, and we will look at that in our next topic.

    I hope that helps someone!
    Kind Regards,