Eating the elephant, one bite at a time: Loading data using Hive

In the previous ‘Eating the elephant’ blogs, we’ve talked about tables and their implementation. Now, we will look at one of the ways to get data into a table. There are different ways to do this, but here we will look only at getting data into an external Hive table using HiveQL, which is the Hive query language. Sounds similar to SQL, doesn’t it? Well, that’s because a lot of it looks and smells similar. When you look at this example here, you will see what I mean!

Here is the Hive syntax to create an external table. In this post, we will look at simple examples, and perhaps look at partitioned tables on another occasion.

First things first; you need to create an external table to hold your data. Before you do, check the contents of the folder /user/hue/UNICEF if it exists. If not, don’t worry, we have the rest of the syntax below:

Create your table
CREATE EXTERNAL TABLE IF NOT EXISTS UNICEFReportCard (
CountryName     STRING,
AvgRankPosition     FLOAT,
MaterialWellBeing     FLOAT,
HealthAndSafety     FLOAT,
EducationalWellBeing     FLOAT,
FamilyPeersRelationships     FLOAT,
BehavioursAndRisks     FLOAT,
SubjectiveWellBeing     FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/hue/UNICEF';
 
 
Note that the directory has now been created: /user/hue/UNICEF If it was not there already. 
You may now see it in the File Explorer, and here is an example below:
  
 

1_NewDirectoryCreatedWithTable



The next step is to upload some data to folders. For the purpose of showing what Hive does when you load up data files, let’s place a file into a different folder, called /user/hue/IncomeInequality/ and the file will be called ‘UNICEFreportcard.csv’ The syntax to load the file statically is here:

LOAD DATA INPATH ‘/user/hue/IncomeInequality/UNICEFreportcard.csv’ OVERWRITE INTO TABLE `unicefreportcard`;

You can see this here, and if it is not clear, you can click on the link to go to Flickr to see it in more detail:

2_HiveInsertDataIntoTable

Once you have executed the query, you can check that the data is in the table. This is easy to do, using the HUE interface. Simply to go the Beeswax UI, which you can see as the yellow icon at the top left hand side of the HUE interface, and then select the ‘Tables’ menu item from the grey bar just under the green bar. You can select the UNICEFreportcard table, and then choose ‘Sample’. You should see data, as in the example screenshot below:

1_DataInsertedResult

When the data is imported, where does it get stored for the external table?

The UNICEFreportcard file was originally in the IncomeInequality folder, but now the original UNICEFreportcard.csv file is moved from the IncomeInequality folder. Where has it gone? It has been moved to the UNICEF folder, which was specified as the location when we created the external table.

When we drop the table, the data file is not deleted. We can still see the CSV file in the UNICEF folder. This is straightforward to test; simply drop the table using the button, and then use File Explorer to go to UNICEF folder and you will see the CSV. Here it is:

4_FileLocationforExternalFiles
 

To test the separation of the data and the table metadata, simply re-run the query to create the table as before. Now, when you view a sample of the data, you see the data as you did before. When you dropped the table originally, you simply removed the table metadata; the data itself was stored. The reason for this is due to the fact that this table is an External table, and Hive preserves a separation between the data and the table metadata.

If the table was an internal table, if you drop the table, the data will be deleted as well – so be careful!

To summarise, in this post,we’ve looked at one way of getting data into Hive. Our next steps will be to do something with the data, now that it is in Hive. So let’s move on to do some simple querying before we proceed to start to visualise the data.

I hope that helps!
Jen

 
 

Eating the elephant, one bite at a time: Partitioning in SQL Server 2012 and in Hive

Hive and SQL Server offer you the facility to partition your tables, but their features differ slightly. This blog will highlight some of the main differences for you to look out for.
What problems does partitioning solve? The problem arises due to the size of the tables. Therefore, the simplest solution is to divide the table up into smaller parts. These are called partitions. When a database table grows in size to the hundreds of gigabytes or more, it can become more difficult to complete actions such as: 
* load new data
* remove old data
* maintain indexes 
This is due to the size of the table, which simply means that the operations take longer since they must traverse more data. This does not only apply to reading data; it also applies to other operations such as INSERT or DELETE. 
How does partitioning solve the problem? 
In SQL Server, you have table partitioning features, which can make very large tables and indexes easier to manage, and improve the performance of appropriately filtered queries. If you’d like more information, see Joe Sack’s excellent presentation here
The SQL Server Optimiser can send the query to the correct partition of the table, rather than sending the query to the whole table. Since the table is broken down into manageable chunks, it is easier for maintenance operations to manage a smaller number of partitions, rather than a massive lump of data in a table. SQL Server partitioning means that you need to use the partitioned keys everywhere in a partitioned table, so that the query can run more efficiently.

SQL Server partitioning can be complex because it’s up to the SQL Server person, with their fingers on the keyboard, to decide how partitions will be swapped in and out, for example. This means that, often, two people might do two completely different things when it comes to deciding how the partitioning should go. As someone who often goes onsite to deliver projects, this means that you can sometimes see different things, done different ways.

On the other hand, Hive does not use sophisticated indexes like many RDBMS which are able to answer queries in seconds. Hive queries tend to run for much longer, and are really intended for complex analytics queries.

In SQL Server, what types of partitioning are available?
There are two types of partitioning in SQL Server:
Manually subdivide a large table’s data into multiple physical tables
Use SQL Server’s table partitioning feature to partition a single table
The first option is not available by default in SQL Server, and would require extensive DBA skills to implement.

In Hive, it is possible to set up an automatic partition scheme at the point at which the table is created.  As you can see from the example below, we create partitioned columns as part of the table creation. This creates a subdirectory for each value in the partition column. This means that queries that have a WHERE clause will go straight to the relevant subdirectory and scan the subdirectory, therefore looking only at  a subset of the data. This means that the query returns the results faster, by use of the partition to ‘direct’ the query quickly towards the relevant data.What’s great about this, for me, is the consistency. Hive looks after the partition maintenance and creation for us.  Here is an example table script:

CREATE EXTERNAL TABLE GNI (
    CountryName STRING,
    GNI INT
    )
PARTITIONED BY (CountryID INT, Year INT);

In Hive, partitioning key columns are specified as part of the table DDL and we can see them quite clearly in the PARTITIONED BY clause.  This helps Hive to use the folder structure to generate folders which are married to the partition. Here is an example of adding a partition using the command line:

ALTER TABLE GNI ADD PARTITION(countryid = 1, year = 2008)
LOCATION ‘user/hue/IncomeInequality/2008/1’;

In Hive, partitioning data can help by making queries faster. If you work with data, you’ll know that users don’t like waiting for data and it’s important to get it to them as quickly as you can. How does it do this? Hive uses the directory structure to zoom quickly straight to the correct partition, based on the file structure. Here is the resulting file structure for our queries above:

Partition Location in Hive

You can also use the ‘Show Partitions ‘ command to show the partitions in a given table. Here is the result in Hive:

Partition Show Command Result

You can also use ‘DESCRIBE EXTENDED ‘ to provide partition information:

DESCRIBE EXTENDED GNI;

Here is the resulting file:

Partition Describe extended

Incidentally, the naming structure is flexible and you don’t need to locate partition directories next to one another. This is useful because it means that you could locate some data on some cheaper storage options, and keep other pieces of data elsewhere. This means that you could clearly indicate, by naming, what is your older data from your cheaper data.

What are your options for storing data? In the Microsoft sphere, Windows Azure HDInsight Service supports both Hadoop Distributed Files System (HDFS) and Azure Storage Vault (ASV) for storing data. Windows Azure Blob Storage, or ASV, is a Windows Azure storage solution which provides a full featured HDFS file system interface for Blob Storage that enables the full set of components in the Hadoop ecosystem to operate (by default) directly on the data managed by Blog Storage. Blob Storage is not a relatively cheap solution, and storing data in Blob Storage enables the HDInsight clusters used for computation to be safely deleted without losing user data due to their separation.

For example, you could store archive on HDFS and Azure, for example, by specifying different locations for your partitions. It is also an option to scale horizontally by simply adding further servers to the cluster, which is a simple ‘hardware’ oriented approach that people understand i.e. if you want more, buy a server, which seems to be a common route for solving problems, whether it is the correct one or not!

In the next segment, we will start to look at HiveQL, which focuses on getting data into the filesystem and back out again. Being a Business Intelligence person, it’s all about the data and I hope you’ll join me for the next adventure.

I hope that helps,
Jen

Eating the elephant one bite at a time: knowing your tables

In this segment, we will look at the most basic unit of the database: tables. We will look at their implementation in SQL Server and also in Hive.

Tables, in SQL Server, contain the data in a database.  SQL Server has a number of different types of database as well as the tables that are defined by the user. These table types include partitioned tables, temporary tables, system tables and wide tables. If you need more information on these types, please visit the official Microsoft documentation here.

In Hive, there are two types of tables: internal and external tables. In our last topic, we touched on tables and how we need to drop them in Hive before we can delete the database (unless you use the CASCADE command). 
Cindy Gross has written an excellent blog topic on internal vs external tables, and you can access it here

For SQL Server people, it is important not to think that the internal/external point is similar to the local / global temporary tables discussion. Local temporary tables are visible only to the current session, whereas global temporary tables are visible to all sessions. Temporary tables in SQL Server cannot be partitioned. 

It is important to note that Hive tables are not akin to SQL Server system tables, where SQL Server stores its internal knowledge about the configuration of the server. In System tables, users cannot query or update these tables directly, and view the information via system views

Instead, the Internal/External point refers to the level at which Hive manages the table, and whether Hive sees the data as being shared across different tools, or simply used by Hive only. Internal and external tables in Hive can be partitioned, and we will look at this point further in a later post. Unlike system tables, internal tables are used to store data which isn’t about the system itself, as in a SQL Server system table.

For our purposes here of eating the Hadoop elephant, one bite at a time, we will look at the differences between these table types. Here is a summary below:

Managed / Internal Tables – Hive owns the data, dropping the table deletes the table metadata and the actual data.
External tables – points at the data but does not own it.  Dropping the table deletes the metadata, not the actual data. 

The data for this table is taken from the UNICEF Report Card, which details childrens’ wellbeing in different countries. Here, the data looks at childrens’ wellbeing in rich countries, and you can read the report here.


Creating an External Table

In Hue, you need to be sure that you are pointing at the correct database. 
On the left hand side, you will see a drop down list of databases. 
We are going to choose the database ‘IncomeInequality’

The external table can be created using the following statement, which is terminated by a semi-colon “;”.

CREATE EXTERNAL TABLE IF NOT EXISTS UNICEFReportCard ( CountryName STRING, AvgRankPosition FLOAT, MaterialWellBeing FLOAT, HealthAndSafety FLOAT, EducationalWellBeing FLOAT, FamilyPeersRelationships FLOAT, BehavioursAndRisks FLOAT, SubjectiveWellBeing FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LOCATION ‘/user/hue/UNICEF’; 


This query can be copied and pasted into the Query button, and then you can execute the query 
simply by pressing the Execute button.
In order to check that the table has been created, you can also do a simple SELECT statement in 
Beeswax Query Editor in order to see if the data exists:
SELECT * FROM IncomeInequality.UNICEFReportCard;

This will return no data, but if it does not error, then we know that the table has been found.

Alternatively, you can click on the Beeswax (Hive UI) and then select the ‘Tables’ button.

Make sure you’ve selected the correct database, IncomeInequality, and then you will see the table on the right hand side.

When you click on the table, you can see two options: you can see the columns, or view a sample of the data. 
So, how do we get data into the table? 
In SQL Server, we’d use SQL Server Integration Services.
In Hive, using the Beeswax GUI, we load up some data files. The external table is essentially a structure over the data. 
So, if we drop the table, we don’t delete the data; it still resides in the data files. 
There are a number of ways to load files:
 The simplest is through the GUI. Simply select ‘Upload’ and insert your file. 
If you have more than one file, you could select a zip archive and your files will be loaded.
One caveat here: When I was using Internet Explorer version 10.9, 
the ‘Upload’ button didn’t seem to work properly. I switched to FireFox, and it worked fine.
Once you’ve uploaded the file, you can see it in Beeswax.

8 Beeswax File is uploaded

Now, if you go back and view the table, you can see it contains the data from the file. Here is an example below:

9 UNICEF Report Card data is loaded

You could also execute the following command from the Hive Query Editor in Beeswax:
LOAD DATA INPATH ‘/user/hue/UNICEF/UNICEFreportcard.csv’ OVERWRITE INTO TABLE `IncomeInequality.unicefreportcard`
Using the interface or queries, how can you distinguish between an internal table and an external table? 
How can you tell if a table is managed or external? Hive has a command which is like sp_help in SQL Server is used to give you details about a table. 
The Hive command is given here:

DESCRIBE EXTENDED and then the name of the table, for example:

DESCRIBE EXTENDED UNICEFReportCard;
We can see the output below. If you click on the image, it will pop up the original and you can read it better:

10 Extended Table Results

However, this command isn’t very readable, so you might want to try the following:
DESCRIBE FORMATTED default.sample_07;
This produces a more pretty format, which is easier to read.
 and you can see that this sample table, in the default database, is a managed or internal table.

11 DESCRIBE FORMATTED TableName

The data files for an internal table are stored in the subdirectory ‘warehouse’, 
which is where data for internal tables is stored. Hive looks after the data in these tables. 
To summarise, we have looked at the different table types and how to create them, and how to upload data into them. 
Thank you for joining me in this series so far. Our next step is to look at partitioning tables, and then we will start to look at analysing data.
Any questions in the meantime, please email me at jen.stirrup@copper-blue.com
Kind Regards,
jen


Eating the elephant one bite at a time: dropping databases

In the last post, you learned how simple it is to create a database using Hive. The command is very similar to the way that we do this in SQL Server. As discussed, the underlying technology works differently, but ultimately they achieve the same end; database created.

Now that you’ve created your database, how do you drop them again?  Fortunately, this is very simple to know how to do, if you’re already a SQL Server  aficionado.

Here is an example:

DROP DATABASE IncomeInequality;

If you want to drop the database without  error messages if the database doesn’t exist, then you can use:

DROP DATABASE IF EXISTS IncomeInequality;
 

Dropping the database in Hive isn’t straightforward, however. In SQL Server, when you drop a database, it removes all of the database and the disk files used by the database. On the other hand, Hive will not allow you to drop a database if it contains tables. To get around this default behaviour, you have to add the CASCADE command to the command.

DROP DATABASE IF EXISTS IncomeInequality CASCADE;

Using this command will delete the tables first, and then drop the database.  Then, like SQL Server, its directory is deleted too.

Once you’ve executed the command, you should double-check that the database has gone:

SHOW DATABASES; 

You can then see the result in Hue (more on this later! Let’s keep it small bites of the elephant for now!)

ShowDatabasesResult

Here, we have dropped the database and we only have the default left.

In Hive, we will look at dropping tables in the next post. This is more complex than it first seems.
I hope that helps!
Jen


Eating the elephant one bite at a time: creating a database using Hive

Following on from the first part in my Hadoop series for the Microsoft Business Intelligence professional, we move to the next stage where we look at the simplicity of creating a database.

Before we move forward, ensure that the Hortonworks Sandbox is up and running. Our objective today is to create a new database called ‘IncomeInequality’. Before we create a database, let’s see what is in there already. To do this, we use the SHOW command:

SHOW DATABASES;

If you are using the Query Editor, then the query still returns the answer if you don’t append the semi-colon at the end of the sentence.

On the other hand, if you use the Hive Command Console in HDInsight, for example, it will expect the semi-colon. I’ve just got accustomed to adding it, so you will see it in the examples here.

CREATE DATABASE IncomeInequality;

This small statement will do what it says; create you a database called IncomeInequality.  As before, I have appended the semi-colon.

Now when we execute the command to show databases, we can see that the new database has been added:

HUE_ShowDatabasesCommand

As you can imagine, with BigData, you may well have a big number of databases!

If you want to restrict the number of databases shown, you can use a Regular Expression to be more specific about the name of the database you are looking for.  Here is an example where we are looking for the database beginning with ‘i’:

SHOW DATABASES LIKE 'i.*'

In this case, the list of databases shown includes only the names of the databases that begins with the letter ‘i’.  Here is an example:

HUE_ShowDatabasesREGEXCommand

So far, so good. What actually happens when you create a database?

The structure is very neat: Hive creates a new directory for each database. The tables within the database are stored in subdirectories of the original database directories.

This way of database creation is different from SQL Server. When SQL Server creates a database, the following steps happen:

 – the model database is used as a ‘model’ (hence the name!) to initialise the new database and its metadata. User-defined objects in the Model database are copied to the new database. If a size is not specified, the new primary data file is given the same size as the primary data file for the Model database.
 – SQL Server then associates a Service Broker GUID to name the database.
 – two files are created: a primary file and a transaction log file.
 – the database fills the rest of the database with empty pages.

To summarise, using Hive can also be very simple. However, there are ways in which we could add more detail to the database, and we will look at that in our next topic.

I hope that helps someone!
Kind Regards,
Jen

Eating the elephant one bite at a time: Some tips in setting up the Hortonworks Sandbox VM

In this blog post series, we will look at how the Business Intelligence professional can learn to use Big Data as a source. This series is mainly aimed at Microsoft and Tableau professionals, but everyone is welcome. Creighton Abrams once said “When eating an elephant, take one time at a time” and this is how we are going to start learning Big Data using Microsoft and Hortonworks technologies.

A great place to start is the Hadoop Sandbox, generously provided by Hortonworks. To get started, you can download it here. It has a number of pre-configured tutorials, and we are going to use it here to get us started looking at Hadoop.

Once you’ve downloaded the VM, you can open it in your preferred VM mechanism. I chose VMWare Player simply because it’s what I am most accustomed to using.

Once you’ve downloaded the VM, the instructions for VMWare can be obtained from here.

When the VM is booted up, it instructs you to visit an IP address on the host machine. What happens if you get the message:

http://192.168.154.178  Page cannot  be displayed
There are two things to check:
1. Make sure your VM Player is set up to use host-only adapter with DHCP enabled. To get this, go to the Virtual Machine settings, and look for ‘Network Adapter’. Here is an example:

VMSettingsForHortonworks

2. Check to make sure that the network adapter on the VMWare adapter is able to receive IP addresses via DHCP. To do this, go to Network Sharing and Properties, click on the VM Adapter and select ‘Diagnose’.

These steps should resolve the issue.
I hope that helps.

The next step in this series is to see how we can create databases, and we can start to have some fun with data.

Jen

Power Business Intelligence for Office365 – resolving the Scylla and Charbdis dilemma.

People have often complained about the fragmented methodology that seemed to accompany Microsoft technology releases. I see this as a manifestation of the Syclla and Charybdis problem – people complain if they don’t release everything as a unified whole, or complain that things take a long time to deliver. In other words, Microsoft never seem to win and seemed to be required to take one choice or another, with neither outcome pleasing everyone. I’ve commented before that it often seems that Microsoft can’t win whatever they decide to do, and here is an example blog and commentary here.

From today’s announcement, it seems as if things are coming together and Microsoft are indeed winning. This news includes a mobile business intelligence deliverable. As you know, business users love Excel and it made sense for Microsoft to put Excel at the heart of their business intelligence strategy. For those who skim read, the main point is that the Microsoft Business intelligence strategy is coming together, and now that Microsoft are delivering a user-oriented Business Intelligence solution with mobile functionality (yes, even iPad!) that hopefully a lot of the heat will have gone away and Microsoft have resolved the Scylla and Charybdis problem by producing something very special and useful for business users.

So what has been delivered? Today at the World Partner Conference, exciting news was released about this very topic, with the news that Power BI for Office 365 has been released.

Power BI for Office 365 is a self-service business intelligence solution surfaced to users through Excel and Office 365. Essentially, business users can get Excel-happy with data analysis and visualization capabilities to identify deeper business insights.  The data can be held on-premise or within a trusted cloud environment. Excel, in other words, becomes the front end for allowing business users to have fun with their data.

What does Power BI for Office 365 mean for ordinary users? With Power BI for Office 365, customers can connect to data in the cloud or extend their existing on premise data sources and systems to quickly build and deploy self-service BI solutions hosted in Microsoft’s enterprise cloud.

Power BI for Office 365 enables customers to do more with their data:

  • Analyze and present insights from data in compelling visual formats either on premises or in the cloud from Excel. 
  • Share reports and data sets online with data that is always kept up to date. 
  • Ask questions of your data using natural language search and get immediate answers through interactive tables, charts and graphs. 
  • Access and stay connected to data and reports from your mobile devices wherever you are.

I’m very happy about this announcement because it’s what customers have been waiting for. My customers have been asking for this for years, and now it is almost here, customers will see things fall into place. Excel is the way forward for Microsoft Business Intelligence: hence, PowerView, Data Explorer (now Power Query) and Geoflow (now Power Map) are now part of the Office 365 story. And what a story!

The truth is that the majority of BI work is done in Excel, and I think Microsoft are just bringing everything home to where the Business Users do most of their work. It’s about working with the Excel users, and giving people opportunities to work with other data that are outside of their internal business walls.

You can sign up for can visit http://www.office.com/powerbi on Monday, July 8 to sign up to be notified when the preview of Power BI for Office 365 is available later this summer. I’m heading over to do it now, and finally I can give my customers the answers they’ve been looking for.

Goodbye Scylla and Charybdis 🙂

Odysseus vor Scilla und Charybdis by Johann Heinrich Füssli