Eating the elephant one bite at a time: knowing your tables

In this segment, we will look at the most basic unit of the database: tables. We will look at their implementation in SQL Server and also in Hive.

Tables, in SQL Server, contain the data in a database. SQL Server has a number of different types of database as well as the tables that are defined by the user. These table types include partitioned tables, temporary tables, system tables and wide tables. If you need more information on these types, please visit the official Microsoft documentation here.

In Hive, there are two types of tables: internal and external tables. In our last topic, we touched on tables and how we need to drop them in Hive before we can delete the database (unless you use the CASCADE command).

Cindy Gross has written an excellent blog topic on internal vs external tables, and you can access it here.

For SQL Server people, it is important not to think that the internal/external point is similar to the local / global temporary tables discussion. Local temporary tables are visible only to the current session, whereas global temporary tables are visible to all sessions. Temporary tables in SQL Server cannot be partitioned.

It is important to note that Hive tables are not akin to SQL Server system tables, where SQL Server stores its internal knowledge about the configuration of the server. In System tables, users cannot query or update these tables directly, and view the information via system views.

Instead, the Internal/External point refers to the level at which Hive manages the table, and whether Hive sees the data as being shared across different tools, or simply used by Hive only. Internal and external tables in Hive can be partitioned, and we will look at this point further in a later post. Unlike system tables, internal tables are used to store data which isn’t about the system itself, as in a SQL Server system table.

For our purposes here of eating the Hadoop elephant, one bite at a time, we will look at the differences between these table types. Here is a summary below:

Managed / Internal Tables – Hive owns the data, dropping the table deletes the table metadata and the actual data.
External tables – points at the data but does not own it. Dropping the table deletes the metadata, not the actual data.

The data for this table is taken from the UNICEF Report Card, which details childrens’ wellbeing in different countries. Here, the data looks at childrens’ wellbeing in rich countries, and you can read the report here.

Creating an External Table

In Hue, you need to be sure that you are pointing at the correct database.
On the left hand side, you will see a drop down list of databases.
We are going to choose the database ‘IncomeInequality’

The external table can be created using the following statement, which is terminated by a semi-colon “;”.

CREATE EXTERNAL TABLE IF NOT EXISTS UNICEFReportCard ( CountryName STRING, AvgRankPosition FLOAT, MaterialWellBeing FLOAT, HealthAndSafety FLOAT, EducationalWellBeing FLOAT, FamilyPeersRelationships FLOAT, BehavioursAndRisks FLOAT, SubjectiveWellBeing FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LOCATION ‘/user/hue/UNICEF’;

This query can be copied and pasted into the Query button, and then you can execute the query

simply by pressing the Execute button.

In order to check that the table has been created, you can also do a simple SELECT statement in

Beeswax Query Editor in order to see if the data exists:

SELECT * FROM IncomeInequality.UNICEFReportCard;

This will return no data, but if it does not error, then we know that the table has been found.

Alternatively, you can click on the Beeswax (Hive UI) and then select the ‘Tables’ button.

Make sure you’ve selected the correct database, IncomeInequality, and then you will see the table on the right hand side.

When you click on the table, you can see two options: you can see the columns, or view a sample of the data.

So, how do we get data into the table?

In SQL Server, we’d use SQL Server Integration Services.

In Hive, using the Beeswax GUI, we load up some data files. The external table is essentially a structure over the data.

So, if we drop the table, we don’t delete the data; it still resides in the data files.

There are a number of ways to load files:

The simplest is through the GUI. Simply select ‘Upload’ and insert your file.

If you have more than one file, you could select a zip archive and your files will be loaded.

One caveat here: When I was using Internet Explorer version 10.9,

the ‘Upload’ button didn’t seem to work properly. I switched to FireFox, and it worked fine.

Once you’ve uploaded the file, you can see it in Beeswax.

Now, if you go back and view the table, you can see it contains the data from the file. Here is an example below:

You could also execute the following command from the Hive Query Editor in Beeswax:

LOAD DATA INPATH ‘/user/hue/UNICEF/UNICEFreportcard.csv’ OVERWRITE INTO TABLE `IncomeInequality.unicefreportcard`

Using the interface or queries, how can you distinguish between an internal table and an external table?

How can you tell if a table is managed or external? Hive has a command which is like sp_help in SQL Server is used to give you details about a table.

The Hive command is given here:

DESCRIBE EXTENDED and then the name of the table, for example:

DESCRIBE EXTENDED UNICEFReportCard;

We can see the output below. If you click on the image, it will pop up the original and you can read it better:

However, this command isn’t very readable, so you might want to try the following:

DESCRIBE FORMATTED default.sample_07;

This produces a more pretty format, which is easier to read.

and you can see that this sample table, in the default database, is a managed or internal table.

The data files for an internal table are stored in the subdirectory ‘warehouse’,

which is where data for internal tables is stored. Hive looks after the data in these tables.

To summarise, we have looked at the different table types and how to create them, and how to upload data into them.

Thank you for joining me in this series so far. Our next step is to look at partitioning tables, and then we will start to look at analysing data.

Any questions in the meantime, please email me at jen.stirrup@copper-blue.com

Kind Regards,

jen

Eating the elephant one bite at a time: knowing your tables

Like this:

Leave a ReplyCancel reply

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Jen Stirrup