Eating the elephant, one bite at a time: Loading data using Hive

In the previous ‘Eating the elephant’ blogs, we’ve talked about tables and their implementation. Now, we will look at one of the ways to get data into a table. There are different ways to do this, but here we will look only at getting data into an external Hive table using HiveQL, which is the Hive query language. Sounds similar to SQL, doesn’t it? Well, that’s because a lot of it looks and smells similar. When you look at this example here, you will see what I mean!

Here is the Hive syntax to create an external table. In this post, we will look at simple examples, and perhaps look at partitioned tables on another occasion.

First things first; you need to create an external table to hold your data. Before you do, check the contents of the folder /user/hue/UNICEF if it exists. If not, don’t worry, we have the rest of the syntax below:

Create your table

CREATE EXTERNAL TABLE IF NOT EXISTS UNICEFReportCard (
CountryName     STRING,
AvgRankPosition     FLOAT,
MaterialWellBeing     FLOAT,
HealthAndSafety     FLOAT,
EducationalWellBeing     FLOAT,
FamilyPeersRelationships     FLOAT,
BehavioursAndRisks     FLOAT,
SubjectiveWellBeing     FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/hue/UNICEF';
 
 
Note that the directory has now been created: /user/hue/UNICEF If it was not there already. 
You may now see it in the File Explorer, and here is an example below:
  
 

1_NewDirectoryCreatedWithTable



The next step is to upload some data to folders. For the purpose of showing what Hive does when you load up data files, let’s place a file into a different folder, called /user/hue/IncomeInequality/ and the file will be called ‘UNICEFreportcard.csv’ The syntax to load the file statically is here:

LOAD DATA INPATH ‘/user/hue/IncomeInequality/UNICEFreportcard.csv’ OVERWRITE INTO TABLE `unicefreportcard`;

You can see this here, and if it is not clear, you can click on the link to go to Flickr to see it in more detail:

2_HiveInsertDataIntoTable

Once you have executed the query, you can check that the data is in the table. This is easy to do, using the HUE interface. Simply to go the Beeswax UI, which you can see as the yellow icon at the top left hand side of the HUE interface, and then select the ‘Tables’ menu item from the grey bar just under the green bar. You can select the UNICEFreportcard table, and then choose ‘Sample’. You should see data, as in the example screenshot below:

1_DataInsertedResult

When the data is imported, where does it get stored for the external table?

The UNICEFreportcard file was originally in the IncomeInequality folder, but now the original UNICEFreportcard.csv file is moved from the IncomeInequality folder. Where has it gone? It has been moved to the UNICEF folder, which was specified as the location when we created the external table.

When we drop the table, the data file is not deleted. We can still see the CSV file in the UNICEF folder. This is straightforward to test; simply drop the table using the button, and then use File Explorer to go to UNICEF folder and you will see the CSV. Here it is:

4_FileLocationforExternalFiles
 

To test the separation of the data and the table metadata, simply re-run the query to create the table as before. Now, when you view a sample of the data, you see the data as you did before. When you dropped the table originally, you simply removed the table metadata; the data itself was stored. The reason for this is due to the fact that this table is an External table, and Hive preserves a separation between the data and the table metadata.

If the table was an internal table, if you drop the table, the data will be deleted as well – so be careful!

To summarise, in this post,we’ve looked at one way of getting data into Hive. Our next steps will be to do something with the data, now that it is in Hive. So let’s move on to do some simple querying before we proceed to start to visualise the data.

I hope that helps!
Jen

 
 

Leave a Reply