Eating the elephant one bite at a time: creating a database using Hive

Following on from the first part in my Hadoop series for the Microsoft Business Intelligence professional, we move to the next stage where we look at the simplicity of creating a database.

Before we move forward, ensure that the Hortonworks Sandbox is up and running. Our objective today is to create a new database called ‘IncomeInequality’. Before we create a database, let’s see what is in there already. To do this, we use the SHOW command:

SHOW DATABASES;

If you are using the Query Editor, then the query still returns the answer if you don’t append the semi-colon at the end of the sentence.

On the other hand, if you use the Hive Command Console in HDInsight, for example, it will expect the semi-colon. I’ve just got accustomed to adding it, so you will see it in the examples here.

CREATE DATABASE IncomeInequality;

This small statement will do what it says; create you a database called IncomeInequality.  As before, I have appended the semi-colon.

Now when we execute the command to show databases, we can see that the new database has been added:

HUE_ShowDatabasesCommand

As you can imagine, with BigData, you may well have a big number of databases!

If you want to restrict the number of databases shown, you can use a Regular Expression to be more specific about the name of the database you are looking for.  Here is an example where we are looking for the database beginning with ‘i’:

SHOW DATABASES LIKE 'i.*'

In this case, the list of databases shown includes only the names of the databases that begins with the letter ‘i’.  Here is an example:

HUE_ShowDatabasesREGEXCommand

So far, so good. What actually happens when you create a database?

The structure is very neat: Hive creates a new directory for each database. The tables within the database are stored in subdirectories of the original database directories.

This way of database creation is different from SQL Server. When SQL Server creates a database, the following steps happen:

 – the model database is used as a ‘model’ (hence the name!) to initialise the new database and its metadata. User-defined objects in the Model database are copied to the new database. If a size is not specified, the new primary data file is given the same size as the primary data file for the Model database.
 – SQL Server then associates a Service Broker GUID to name the database.
 – two files are created: a primary file and a transaction log file.
 – the database fills the rest of the database with empty pages.

To summarise, using Hive can also be very simple. However, there are ways in which we could add more detail to the database, and we will look at that in our next topic.

I hope that helps someone!
Kind Regards,
Jen

Leave a Reply