SQL Server partitioning can be complex because it’s up to the SQL Server person, with their fingers on the keyboard, to decide how partitions will be swapped in and out, for example. This means that, often, two people might do two completely different things when it comes to deciding how the partitioning should go. As someone who often goes onsite to deliver projects, this means that you can sometimes see different things, done different ways.
On the other hand, Hive does not use sophisticated indexes like many RDBMS which are able to answer queries in seconds. Hive queries tend to run for much longer, and are really intended for complex analytics queries.
In Hive, it is possible to set up an automatic partition scheme at the point at which the table is created. As you can see from the example below, we create partitioned columns as part of the table creation. This creates a subdirectory for each value in the partition column. This means that queries that have a WHERE clause will go straight to the relevant subdirectory and scan the subdirectory, therefore looking only at a subset of the data. This means that the query returns the results faster, by use of the partition to ‘direct’ the query quickly towards the relevant data.What’s great about this, for me, is the consistency. Hive looks after the partition maintenance and creation for us. Here is an example table script:
CREATE EXTERNAL TABLE GNI (
CountryName STRING,
GNI INT
)
PARTITIONED BY (CountryID INT, Year INT);
In Hive, partitioning key columns are specified as part of the table DDL and we can see them quite clearly in the PARTITIONED BY clause. This helps Hive to use the folder structure to generate folders which are married to the partition. Here is an example of adding a partition using the command line:
ALTER TABLE GNI ADD PARTITION(countryid = 1, year = 2008)
LOCATION ‘user/hue/IncomeInequality/2008/1’;
In Hive, partitioning data can help by making queries faster. If you work with data, you’ll know that users don’t like waiting for data and it’s important to get it to them as quickly as you can. How does it do this? Hive uses the directory structure to zoom quickly straight to the correct partition, based on the file structure. Here is the resulting file structure for our queries above:
You can also use the ‘Show Partitions ‘ command to show the partitions in a given table. Here is the result in Hive:
You can also use ‘DESCRIBE EXTENDED ‘ to provide partition information:
DESCRIBE EXTENDED GNI;
Here is the resulting file:
Incidentally, the naming structure is flexible and you don’t need to locate partition directories next to one another. This is useful because it means that you could locate some data on some cheaper storage options, and keep other pieces of data elsewhere. This means that you could clearly indicate, by naming, what is your older data from your cheaper data.
What are your options for storing data? In the Microsoft sphere, Windows Azure HDInsight Service supports both Hadoop Distributed Files System (HDFS) and Azure Storage Vault (ASV) for storing data. Windows Azure Blob Storage, or ASV, is a Windows Azure storage solution which provides a full featured HDFS file system interface for Blob Storage that enables the full set of components in the Hadoop ecosystem to operate (by default) directly on the data managed by Blog Storage. Blob Storage is not a relatively cheap solution, and storing data in Blob Storage enables the HDInsight clusters used for computation to be safely deleted without losing user data due to their separation.
For example, you could store archive on HDFS and Azure, for example, by specifying different locations for your partitions. It is also an option to scale horizontally by simply adding further servers to the cluster, which is a simple ‘hardware’ oriented approach that people understand i.e. if you want more, buy a server, which seems to be a common route for solving problems, whether it is the correct one or not!
In the next segment, we will start to look at HiveQL, which focuses on getting data into the filesystem and back out again. Being a Business Intelligence person, it’s all about the data and I hope you’ll join me for the next adventure.
I hope that helps,
Jen