Eating the elephant, one bite at a time: Partitioning in SQL Server 2012 and in Hive

Hive and SQL Server offer you the facility to partition your tables, but their features differ slightly. This blog will highlight some of the main differences for you to look out for.
What problems does partitioning solve? The problem arises due to the size of the tables. Therefore, the simplest solution is to divide the table up into smaller parts. These are called partitions. When a database table grows in size to the hundreds of gigabytes or more, it can become more difficult to complete actions such as: 
* load new data
* remove old data
* maintain indexes 
This is due to the size of the table, which simply means that the operations take longer since they must traverse more data. This does not only apply to reading data; it also applies to other operations such as INSERT or DELETE. 
How does partitioning solve the problem? 
In SQL Server, you have table partitioning features, which can make very large tables and indexes easier to manage, and improve the performance of appropriately filtered queries. If you’d like more information, see Joe Sack’s excellent presentation here
The SQL Server Optimiser can send the query to the correct partition of the table, rather than sending the query to the whole table. Since the table is broken down into manageable chunks, it is easier for maintenance operations to manage a smaller number of partitions, rather than a massive lump of data in a table. SQL Server partitioning means that you need to use the partitioned keys everywhere in a partitioned table, so that the query can run more efficiently.

SQL Server partitioning can be complex because it’s up to the SQL Server person, with their fingers on the keyboard, to decide how partitions will be swapped in and out, for example. This means that, often, two people might do two completely different things when it comes to deciding how the partitioning should go. As someone who often goes onsite to deliver projects, this means that you can sometimes see different things, done different ways.

On the other hand, Hive does not use sophisticated indexes like many RDBMS which are able to answer queries in seconds. Hive queries tend to run for much longer, and are really intended for complex analytics queries.

In SQL Server, what types of partitioning are available?
There are two types of partitioning in SQL Server:
Manually subdivide a large table’s data into multiple physical tables
Use SQL Server’s table partitioning feature to partition a single table
The first option is not available by default in SQL Server, and would require extensive DBA skills to implement.

In Hive, it is possible to set up an automatic partition scheme at the point at which the table is created.  As you can see from the example below, we create partitioned columns as part of the table creation. This creates a subdirectory for each value in the partition column. This means that queries that have a WHERE clause will go straight to the relevant subdirectory and scan the subdirectory, therefore looking only at  a subset of the data. This means that the query returns the results faster, by use of the partition to ‘direct’ the query quickly towards the relevant data.What’s great about this, for me, is the consistency. Hive looks after the partition maintenance and creation for us.  Here is an example table script:

CREATE EXTERNAL TABLE GNI (
    CountryName STRING,
    GNI INT
    )
PARTITIONED BY (CountryID INT, Year INT);

In Hive, partitioning key columns are specified as part of the table DDL and we can see them quite clearly in the PARTITIONED BY clause.  This helps Hive to use the folder structure to generate folders which are married to the partition. Here is an example of adding a partition using the command line:

ALTER TABLE GNI ADD PARTITION(countryid = 1, year = 2008)
LOCATION ‘user/hue/IncomeInequality/2008/1’;

In Hive, partitioning data can help by making queries faster. If you work with data, you’ll know that users don’t like waiting for data and it’s important to get it to them as quickly as you can. How does it do this? Hive uses the directory structure to zoom quickly straight to the correct partition, based on the file structure. Here is the resulting file structure for our queries above:

Partition Location in Hive

You can also use the ‘Show Partitions ‘ command to show the partitions in a given table. Here is the result in Hive:

Partition Show Command Result

You can also use ‘DESCRIBE EXTENDED ‘ to provide partition information:

DESCRIBE EXTENDED GNI;

Here is the resulting file:

Partition Describe extended

Incidentally, the naming structure is flexible and you don’t need to locate partition directories next to one another. This is useful because it means that you could locate some data on some cheaper storage options, and keep other pieces of data elsewhere. This means that you could clearly indicate, by naming, what is your older data from your cheaper data.

What are your options for storing data? In the Microsoft sphere, Windows Azure HDInsight Service supports both Hadoop Distributed Files System (HDFS) and Azure Storage Vault (ASV) for storing data. Windows Azure Blob Storage, or ASV, is a Windows Azure storage solution which provides a full featured HDFS file system interface for Blob Storage that enables the full set of components in the Hadoop ecosystem to operate (by default) directly on the data managed by Blog Storage. Blob Storage is not a relatively cheap solution, and storing data in Blob Storage enables the HDInsight clusters used for computation to be safely deleted without losing user data due to their separation.

For example, you could store archive on HDFS and Azure, for example, by specifying different locations for your partitions. It is also an option to scale horizontally by simply adding further servers to the cluster, which is a simple ‘hardware’ oriented approach that people understand i.e. if you want more, buy a server, which seems to be a common route for solving problems, whether it is the correct one or not!

In the next segment, we will start to look at HiveQL, which focuses on getting data into the filesystem and back out again. Being a Business Intelligence person, it’s all about the data and I hope you’ll join me for the next adventure.

I hope that helps,
Jen

Leave a Reply