Eating the elephant one bite at a time: knowing your tables

In this segment, we will look at the most basic unit of the database: tables. We will look at their implementation in SQL Server and also in Hive.

Tables, in SQL Server, contain the data in a database.  SQL Server has a number of different types of database as well as the tables that are defined by the user. These table types include partitioned tables, temporary tables, system tables and wide tables. If you need more information on these types, please visit the official Microsoft documentation here.

In Hive, there are two types of tables: internal and external tables. In our last topic, we touched on tables and how we need to drop them in Hive before we can delete the database (unless you use the CASCADE command). 
Cindy Gross has written an excellent blog topic on internal vs external tables, and you can access it here

For SQL Server people, it is important not to think that the internal/external point is similar to the local / global temporary tables discussion. Local temporary tables are visible only to the current session, whereas global temporary tables are visible to all sessions. Temporary tables in SQL Server cannot be partitioned. 

It is important to note that Hive tables are not akin to SQL Server system tables, where SQL Server stores its internal knowledge about the configuration of the server. In System tables, users cannot query or update these tables directly, and view the information via system views

Instead, the Internal/External point refers to the level at which Hive manages the table, and whether Hive sees the data as being shared across different tools, or simply used by Hive only. Internal and external tables in Hive can be partitioned, and we will look at this point further in a later post. Unlike system tables, internal tables are used to store data which isn’t about the system itself, as in a SQL Server system table.

For our purposes here of eating the Hadoop elephant, one bite at a time, we will look at the differences between these table types. Here is a summary below:

Managed / Internal Tables – Hive owns the data, dropping the table deletes the table metadata and the actual data.
External tables – points at the data but does not own it.  Dropping the table deletes the metadata, not the actual data. 

The data for this table is taken from the UNICEF Report Card, which details childrens’ wellbeing in different countries. Here, the data looks at childrens’ wellbeing in rich countries, and you can read the report here.


Creating an External Table

In Hue, you need to be sure that you are pointing at the correct database. 
On the left hand side, you will see a drop down list of databases. 
We are going to choose the database ‘IncomeInequality’

The external table can be created using the following statement, which is terminated by a semi-colon “;”.

CREATE EXTERNAL TABLE IF NOT EXISTS UNICEFReportCard ( CountryName STRING, AvgRankPosition FLOAT, MaterialWellBeing FLOAT, HealthAndSafety FLOAT, EducationalWellBeing FLOAT, FamilyPeersRelationships FLOAT, BehavioursAndRisks FLOAT, SubjectiveWellBeing FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LOCATION ‘/user/hue/UNICEF’; 


This query can be copied and pasted into the Query button, and then you can execute the query 
simply by pressing the Execute button.
In order to check that the table has been created, you can also do a simple SELECT statement in 
Beeswax Query Editor in order to see if the data exists:
SELECT * FROM IncomeInequality.UNICEFReportCard;

This will return no data, but if it does not error, then we know that the table has been found.

Alternatively, you can click on the Beeswax (Hive UI) and then select the ‘Tables’ button.

Make sure you’ve selected the correct database, IncomeInequality, and then you will see the table on the right hand side.

When you click on the table, you can see two options: you can see the columns, or view a sample of the data. 
So, how do we get data into the table? 
In SQL Server, we’d use SQL Server Integration Services.
In Hive, using the Beeswax GUI, we load up some data files. The external table is essentially a structure over the data. 
So, if we drop the table, we don’t delete the data; it still resides in the data files. 
There are a number of ways to load files:
 The simplest is through the GUI. Simply select ‘Upload’ and insert your file. 
If you have more than one file, you could select a zip archive and your files will be loaded.
One caveat here: When I was using Internet Explorer version 10.9, 
the ‘Upload’ button didn’t seem to work properly. I switched to FireFox, and it worked fine.
Once you’ve uploaded the file, you can see it in Beeswax.

8 Beeswax File is uploaded

Now, if you go back and view the table, you can see it contains the data from the file. Here is an example below:

9 UNICEF Report Card data is loaded

You could also execute the following command from the Hive Query Editor in Beeswax:
LOAD DATA INPATH ‘/user/hue/UNICEF/UNICEFreportcard.csv’ OVERWRITE INTO TABLE `IncomeInequality.unicefreportcard`
Using the interface or queries, how can you distinguish between an internal table and an external table? 
How can you tell if a table is managed or external? Hive has a command which is like sp_help in SQL Server is used to give you details about a table. 
The Hive command is given here:

DESCRIBE EXTENDED and then the name of the table, for example:

DESCRIBE EXTENDED UNICEFReportCard;
We can see the output below. If you click on the image, it will pop up the original and you can read it better:

10 Extended Table Results

However, this command isn’t very readable, so you might want to try the following:
DESCRIBE FORMATTED default.sample_07;
This produces a more pretty format, which is easier to read.
 and you can see that this sample table, in the default database, is a managed or internal table.

11 DESCRIBE FORMATTED TableName

The data files for an internal table are stored in the subdirectory ‘warehouse’, 
which is where data for internal tables is stored. Hive looks after the data in these tables. 
To summarise, we have looked at the different table types and how to create them, and how to upload data into them. 
Thank you for joining me in this series so far. Our next step is to look at partitioning tables, and then we will start to look at analysing data.
Any questions in the meantime, please email me at jen.stirrup@copper-blue.com
Kind Regards,
jen


Mobile Business Intelligence Presentation Portugal 2013

I gave this presentation at SQLSaturday event in Portugal during March 2013.  Thanks to the organisers for taking such good care of me during my visit.

As I discussed during the session, there were three main strands of mobilising Microsoft technology:

– use third party products such as PivotStream to free your PowerPivots
– use cloud computing e.g. Azure
– use SharePoint

These options will not suit everyone but it might help somebody to decide which path is the right one for them. I look forward to your commentary and feedback.

SQLSaturday Edinburgh Preconference Training Day sponsored by SQLSkills

The SQLSaturday Edinburgh team are extremely proud to present the following one day, advanced level training with SQLskills Jonathan Kehayias and Joseph Sack. The Training Day will be held on Friday 7th June at the University of Edinburgh Pollok Hall, Edinburgh, Scotland. You can find full details here, such as travel, accommodation and venue., and you can register here.
 
I’m very proud that Jon and Joe are joining us. They are extremely well-respected for their knowledge in SQL Server.
 
You may well have read their books or material. Jonathan is the ‘go-to’ expert and authority on Extended Events in SQL Server 2008— for example, he wrote the definitive whitepaper on Extended Events for Microsoft.
 
Joe is a SQL Server author, and you can see his Amazon page here. In 2009, Joe took over responsibility for the entire SQL Server Microsoft Certified Master program and held that post until 2011.
 
Jonathan and Joe are part of the highly-respected SQLskills team. We are also delighted to announce that SQLskills are our official Preconference Training Day Sponsor, and we are grateful to them for their fantastic support of SQLSaturday Edinburgh.
 
The SQLskills team are brought together by Paul S. Randall and Kimberly L Tripp, two of the world’s most renowned SQL Server experts, is dedicated to providing the highest quality and most in-depth training and consulting in the market. The SQLskills team have an emphasis on real-world expertise that you can immediately put into practice, making you a better SQL Server professional.
 
The Training Day itself has the following Agenda and you can register here.
 
Leveraging SQL Server 2012 Features and Tools
SQL Server 2012 offers a number of new features that improve high availability, security, performance analysis, and scalability that can be immediately leveraged for an immediate return on investment associated with upgrading. In this session, Joe and Jonathan will go through a huge list of problems solved by these new features and tools – and will demo them along the way!
Some of the new SQL Server 2012 features that will be discussed:
  • partitioning enhancements
  • ColumnStore indexes
  • contained databases
  • distributed replay
  • server-level roles
  • predictable failover with failover policies
  • availability groups
  • multi-database failover
  • read-only secondaries
  • and… much, much more!
If you want an information packed session with practical reasons to upgrade and powerful tips and tricks for using the new release – this workshop is for you!
Who are the presenters?
Jonathan Kehayias and Joseph Sack are well-known SQL Server experts, part of the elite SQLskills team of expert consultants.
What will you take away from the session?
By attending this session, you will obtain advanced SQL Server 2012 knowledge that will help you to conduct your upgrade more smoothly, whilst helping you to understand the new SQL Server 2012 features can help your environment.
SQLSaturday Edinburgh offer very reasonably priced sessions aimed at helping the SQL Server community. Given the low cost of £125 (exclusive of VAT) for this one-day training course, we believe that you will recoup this investment by delivering your upgrade with the expertise that you will gain from this session.

Microsoft SQL Server 2012 Latest Announcements: Hekaton, Service Pack 1 and PDW

There are three main announcements for Microsoft SQL Server 2012 announced today at the SQLPass Summit 2012 in Seattle, Washington. These are being announced by Ted Kummert, Corporate Vice President, Business Platform Division, Microsoft. You can also see Doug Leland’s announcement from the Business Platform Group here.
The announcements focus on three areas:
Ted Kummert introduces Project codenamed Hekaton a new in-memory technology for transaction processing that is an integral part of the data platform. Hekaton will ship in the next major version of SQL Server, and is wholly a part of it. It’s fast because the

The next version of the Parallel Data Warehouse is announced. 


SQL Server 2012 Service Pack 1 will be released. This will have Business Intelligence features and features that are not contained within the Business Intelligence sphere. I will cover these items later on in this blog.


What makes these announcements interesting?

The Hekaton announcement is interesting since it shows that Microsoft are real contenders in the in-memory area. SAP have been doing some interesting things, but now it’s Microsoft’s turn to play.  Hekaton will be part of an end-to-end SQL Server solution with in-memory capabilities from streaming and caching workloads right through to analytics. This offers a complete portfolio for transactional processing – full circle.

Is it fast? You betcha! Based on private customer testing to date, Hekaton will offer performance gains of up to 10 times for existing apps and up to 50 times for new applications optimized for in-memory performance. This is significant, and watch this space for more details. 

Here are the main points:

  • High performance, in memory OLTP engine.
  • Full ACID support.
  • T-SQL compiled to native code.
  • Enables massive improvement in performance. This will allow you to onramp existing apps to give you integrated performance for data loading, updating. And reading throughput is increased comparing best cases, due to the lack of latching and locking contention on the table.


The Parallel Data Warehouse announcement is very interesting. It will be available in the first half of 2013. 

It will be powered by PolyBase, which allows you to process queries across using relational and non-relational Hadoop data – right from SQL Server Management Studio. I don’t know about you, but I think that’s pretty awesome. The analytics that you’ll be able to do, right from good old familiar SSMS, will lead to breakthrough insights. You already know SSMS, and the query engine will translate the queries for you so you can focus on generating your data out for the insights. This will empower you to deal with big data and really use it, from an interface that you already know. It’s mixing HTFS nodes with SQL Server, which is incredibly powerful. It means that the PDW is a federated data processing engine, and very powerful.

The PWD is Built for big data. We have Next generation performance at scale. It uses the XVelocity column store. For big data, we start small with TBs and scale out to petabytes. This means that we can see a move away from SAN storage to reduce price. Hardware and software combined reduces the price. Since the hardware and the software are already aligned, this will help customers to reduce costs, since the hardware footprint will be reduced. 

Now for my favourite announcement – SQL Server 2012 Service Pack 1 will be announced, with a focus on Business Intelligence enhancements. It has new capabilities for Business Intelligence in Office 365. 



3. SQL Server Service Pack 1 Enhancements

This was announced back in June 2012. For Business Intelligence, there will be Power view enhancements. – Geographical mapping. Drill up and drill down. Hierarchies. Backgrounds. Oh, and pie charts 🙂

The significance is Excel 2013 – powerpivot and Powerview so everyone is now BI capable. This really offers a Data Democracy for every Excel user – we will all be data citizens.
Since powerpivot and powerview are natively into the product, everyone can use their data to make decisions, and even change the world around them. Look at the great work that the San Francisco and London data citizens are doing with their data. Excel can really help everyone – not just business users – to grab their data and use it to empower themselves.

To summarise, these announcements offer a full, rounded big picture – it will accelerate insights for all data. Big data crunching. Rapid, real time insights.This is the power of analytics, and means that the Microsoft Data Platform offers end-to-end from transactions right through to the business users and the insights that they can gain from their data. It’s not enough simply to own data; you have to use it. The new announcements from SQL Server are empowering for business users as well as technical DBAs.

New capabilities = in memory, big data, business analytics. It’s all there. Take a look at the earlier links, and watch the keynote from the SQLPass website.  

SQL Server Data Tools – September 2012 Update and updated Power Tools are released!

The SSDT team’s continued commitment to delivering regular, frequent releases of SSDT to the web has shown itself once again! There’s a new download available to you, and you can get it here.
Also – the SSDT team have produced new Power Tools. The full blog list is here. The purpose of the current blog is to release this information to Business Intelligence people, in the hope that some of this news will be relevant and provoke some interest amongst BI people to take a look! The full SSDT Team Blog is here
So what is all the latest news for Business Intelligence people? Here is a summary of the main topics:
Visual Studio 2012
 – this release of SSDT supports the Visual Studio 2012 shell.  
 – contains several bug fixes to the SSDT version that shipped in Visual Studio 2012
 – release can be applied  as an update on top of Visual Studio 2012 Professional, Premium, and Ultimate Editions

Updated SSDT Power Tools
The September 2012 update is compatible with updated Power Tools. Take a look here:

 
 These updated Power Tools rely on the September 2012 update, so make sure that you install it first! 
 

LocalDB Configuration Improvements
  – easier to create a new single localdb instance called Projects to host all the project debug databases.  
– enhancement to SQL Server Object Explorer by surfacing the default localdb instance and enabling the deletion of localdb instances.
 
Support for ANSI_NULLS and QUOTED_IDENIFIER Properties
  – now, SSDT support database objects with non-standard ANSI_NULLS or QUOTED_IDENTIFIER properties on project and script files.
I hope that helps – so head on over to the SSDT Team Site and take a look at the September 2012 update and the updated Power Tools.
Enjoy!
Jen

SQL Server 2008 R2 Service Pack 2 (SP2) is Released!

Note the release of SQL Server 2008 R2 Service Pack 2 (SP2). Both the Service Pack and Feature Pack updates are available for download on the Microsoft Download Center. 
This a free upgrade, so please head over to the Download Center for a look.
This particular pack is interesting because it contains a few key community Reporting Services requests

  • Reporting Services Charts Maybe Zoomed & Cropped
    Customers using Reporting Services on Windows 7 may occasionally find charts are zoomed in and cropped. To work around the issue some customers set ImageConsolidation to false.This is tackled in the Service Pack
  • Batch Containing Alter Table not Cached
    In certain situations with batch files containing the alter table command, the entire batch file is not cached.
  • Collapsing Cells or Rows, If Hidden Render Incorrectly
  • Some customers who have hidden rows in their Reporting Services reports may have noticed rendering issues when cells or rows are collapsed. When writing a hidden row, the Style attribute is opened to write a height attribute. If the attribute is empty and the width should not be zero.

I’m holding a Business Intelligence and Data Visualisation Precon!

I’m delighted to announce that I’m holding a pre-conference one-day training in Data Visualisation in SQL Server 2012 in Cambridge on 7th September, 2012. For more details, and to register for the precon, please click here and to register for SQL Saturday 162 event, please click here.
My thanks go to SQLPass and to Mark Broadbent in particular, who have worked extremely hard in order to put on this community event. Here is an outline for my precon and please don’t hesitate to get in touch if you have any other queries!
If you want to conduct advanced, business-oriented Business Intelligence analysis in SQL Server 2012, then it is essential to understand data visualisation.

Using the new SQL Server 2012 Tabular Model and PowerPivot as data sources, this course will aim to teach about the new self-service Business Intelligence features in SQL Server 2012 whilst focusing on data visualisation. We will also look at SharePoint, and what it can offer.

This session is aimed at Excel and/or Business Intelligence developers who want to make informed data visualisation decisions about reporting, with supporting cognitive psychology theory where relevant. The takeaways will focus on:

  • Finding patterns in the data. 
  • Further Data Visualisations – learn about visualisations that are perhaps not so well-known including Stephen Few’s bullet charts and Tufte’s Sparklines in SSRS.
  • The complexities of displaying multivariate data. For example, we will look at Tufte’s “small multiples” in Power View and in Reporting Services.
  • Putting it all together: Considerations for Dashboards with PerformancePoint
9:00   Welcome and What is data visualisation? What it is not, and why is it important? What products make up the SQL Server 2012 Business Intelligence stack?
09:30  Tabular Model – what is it? When is it best used? What distinguishes it from multidimensional cubes?
10:30   PowerPivot – what is it? When is it best used? What distinguishes it from multidimensional cubes? What are the new features in SQL Server 2012?
11:15     Break
11:30   Finding patterns in the data. You will learn about the process involved in finding patterns in the data, looking at some of the more well-known data visualisation examples. 

 – We will learn about: Power View, using PowerPivot and the Tabular model as a basis. 
 – We will also look at tips and tricks in optimising Tabular Models so that they can render Power View effectively.

12:30     Lunch
13:30  Further Data Visualisations. In this section, you will learn about visualisations and how they are best used. This includes bullet charts, and other visualisations based on Tufte’s “small multiples” such as sparklines. We will learn:

 – We will look at PerformancePoint, and how we can use PowerPivot as a basis and ‘gotchas’ in using PowerPivot as a basis for PerformancePoint
 – We will look at Excel Services, and how we can provision self-service using the Tabular Model and PowerPivot as the basis for Excel and Excel Services
–   We will look at other helpful visualisations such as sparklines, bullet charts and marimekko charts.

15:00 Multivariate statistical data. In this section, we will cover the complexities of displaying multivariate data since is potentially more complex. Here, we look at ways of displaying multivariate data such as table lenses and crosstab arrangements of Tufte’s ‘small multiples’. We will use Reporting Services as a technology to surface multi-dimensional data.
15:15 Break
16:15 Dashboards – Putting it all together. We will look at different ways of implementing Dashboards, KPIs and other visualisations. This will involve a range of technologies, from KPIs in PowerPivot, to the new features in Power View.
17:30 Close