What does the Data Analyst need for a Toolkit?
- Power BI – Excel, Power Query, Power View
Power BI – cleanse, model, visualise
R – check, check, check, explore, statistics.
In this series, we will look at:
Power BI – cleanse data
Some basic R scripts
analysing and visualising data in R
Accessing data in the Hortonworks Sandbox for analysis with R
Power BI visualisation in Excel such as the little People Graph, which I love, because it is simply fun!
So let’s get started!
Microsoft Power Query for Excel, is a new add-in that provides a seamless experience for data discovery, data transformation and enrichment for Information Workers, BI professionals and other Excel users.
With Power Query you can:
· Identify the data you care about from the sources you work with (e.g. relational databases, Excel, text and XML files, OData feeds, web pages, Hadoop HDFS, etc.).
· Discover relevant data from inside and outside your organization using the search capabilities within Excel.
· Combine data from multiple, disparate data sources and shape it in order to prepare the data for further analysis in tools like Excel and Power Pivot, or visualization in tools like Power View and Power Map.
· Share the queries that you created with others within your organization, so they can be easily found via Search.
It is part of our toolkit because it helps us to easily clean and prepare the data for consumption by R.
So let’s get to it.
Download and install Power Query
Before we proceed to download, let’s check the supported operating systems for Power Query. At the time of writing, these are listed below:
· Supported Operating System
· Windows 7, Windows 8, Windows Server 2008 R2, Windows Server 2012
· Windows Vista (requires .NET 3.5 SP1)
· Windows Server 2008 (requires .NET 3.5 SP1)
· Windows 7
· Windows 8
· Windows 8.1
The following Office versions are supported:
· Microsoft Office 2010 Professional Plus with Software Assurance
· Microsoft Office 2013 Professional Plus, Office 365 ProPlus or Excel 2013 Standalone
Microsoft Power Query for Excel requires Internet Explorer 9 or greater.
Microsoft Power Query for Excel is available for 32-bit (x86) and 64-bit (x64) platforms, your selection must match architecture of the installed version of Office.
Once you have checked these items, you can download the version of the Power Query add-in that matches the architecture (x86 or x64) of your Office installation.
- Download the Power Query for Excel add-in.
- Open the folder where you downloaded Microsoft Power Query for Excel.
- Double-click the downloaded installer file, and then follow the steps in the wizard.
- After the installation is complete, click Finish.
- Power Query will appear as a new tab in the Excel ribbon.
- Run the MSI installer and follow the setup steps.
Enabling Power Query
- Once Power Query is installed, you may need to enable it. To do this, go to File and then choose Options. The Excel Options dialogue box will open.
- Go to the Add-Ins option on the left hand side.
- At the foot of the Excel Options box, you will see an option to Manage options. In the drop down list, look for COM Add-ins.
- Make sure that Power Query is checked.
Connecting to External Data Sources
We have options to connect to a wide range of data sources, both public and internal.
Connecting to Web Page data
- To open Internet Explorer, press the Windows ( ) button.
- Then, in Start, commence typing Internet Explorer.
- When the Internet Explorer application appears, click the application tile.
- Go to the Wikipedia Human Development Index page in order to see it before we import it.
- We will now import the data to this page.
Add a Wikipedia page data source to Power Query
- In the POWER QUERY ribbon tab, click From Web.
- In the Web Content dialog box, in the URL text box, paste the Wikipedia URL (http://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index).
- Click OK.
- In the right hand side, you will see a list of Tables. Click Table 2, and you will get a preview of the data.
- Double click on it, and you will see the data appear in the Excel window.
- Click on it, and rename the query to HDI Very High.
- Make sure to click on Add to the Data Model at the bottom right hand side.
- Next, click on Apply and Close.
- You will be returned to the Excel workbook.
We will re-run the query now for the other HDI levels: High, Medium, and Low.
To do this, we will reuse the existing query by clicking on the Power Query tab again, and then selecting Recent Sources.
- In the drop-down list, you will see a list of your recent sources. Select the Wikipedia page.
- You will be returned to the Power Query Editor, and you will see the tables on the right hand side.
- Select Table 3 by double-clicking it. Next, Make sure you have clicked on Add to the Data Model at the bottom right hand side.
- Repeat these steps for the medium Human Development information.
- For the Medium and Low Human Development Indexes, you will need to repeat these steps using Table 4 and Table 5 respectively.
Transforming the data
Go back to the first HDI Very High Human Development query, and we can double click on it to produce the Power Query Editor.
We can transform the HDI Very High Human Development data by removing the column Rank Change in rank between 2013 report to 2011 report. It is not required.
- To do this, highlight the column Rank Change in rank between 2013 report to 2011 report and right-click to delete it. We do not need it so if we select Remove, the column will be deleted.
- Click on Row number 1. We will make this row the header row. To do this, select Use First Row as Headers and you will see that this row is now the header row.
- Rename the HDI column to HDI 2012 by right clicking it, and selecting Rename.
- We will still have a row that is text. To remove it, highlight the first row and select the button Remove top Rows from the Home tab in the Power Query Editor.
- Change the HDI column to a Decimal Number by highlighting the column, going to Data Type and selecting Decimal Number from the drop down list.
- Let’s rename the HDI_1 column to Difference by right-clicking it and selecting Rename.
- Let’s add in a custom column. To do this, right click on the Difference column and select Insert Custom Column
- In the Insert Custom Column editor, double click the HDI 2012 column to put it into the text box.
- Then, enter a minus sign after the HDI 2012 column
- Finally, double click the Difference column to select it and put it into the text box.
- Your query should look like this:
- [HDI 2012] – [Difference]
- Press the return button. Your final query output should appear as follows.
- Now click Apply and Close.
Now, you can see in Excel we have five columns:
- HDI 2012
- HDI 2010
Two of the columns contain HDI data, which is split by date. It will be easier to visualise the data if all of the HDI data is in the same column. Therefore, we can unpivot the columns so that it is easier to visualise.
- To do this, select the HDI 2010, HDI 2012 and the Difference columns.
- Go to the Transform tab and go to the option Unpivot Columns.
- You will now see that the HDI 2010, HDI 2012 and the Difference columns would go in the same column, which is headed Value.
- The attribute names have gone into a column called Attribute.
- The other two columns, Rank and Country, have stayed the same.
- We don’t actually need the Difference column now that we have obtained the data, so let’s filter it out.
- To do this, go to the Attribute column heading and click the downward arrow next to it.
- Deselect the Select All button, and select only the HDI 2010 and HDI 2012 options, and then click OK.
Now we can rename the attributes HDI 2010 and HDI 2012 so that it contains the year name only. To do this, select the Attribute column and go to the Transform tab.
- Go to the option Replace Values and the Replace Values box will appear.
- In the Value to Find box, enter HDI 2012.
- In the Replace With box, enter 2012 and then click OK.
- Repeat these steps for HDI 2010, renaming it to 2010.
- Let’s rename the Attribute column to year. To do this, select the column and then right click. In the pop up menu that appears, select Rename. Rename the column to Year.
- Now, let’s rename the Value column so that it is now called HDI Value. To do this, select the column and right click to get the pop up menu again. Look for the rename option, and rename the column to HDI Value.
- If we examine the Rank column, we can see that some countries and jointly ranked. So, for example, Hong Kong and Iceland are jointly ranked 13th top in the list.
- Let’s add an index to our Power Query. We could use this later as a key in the table.
- To do this, right click on HDI Data and click Edit.
- The Power Query Editor will appear again. Go to the Add Column tab in the Power Query Editor and select the option Insert Index Column.
- The Index column will appear at the right hand side. To move it, select the column and go to the Move option.
- Look for the option ‘To the Beginning’ and select it. The Index column will move to the left hand side of the table.
- Once you have done this, go to the Home Tab and click on the Apply and Close button.
Connecting to Online data
Let’s do an online search. This is a fun part of Power Query.
Before we dive in to use the Editor, let’s take a look at some of the features in the Power Query Editor.
You can see that there are four main tabs: Home, Transform, Add Column and View.
The Home tab contains useful functionality which are common when we are cleansing data.
Sometimes data files contain unnecessary data when they are exported from other sources. This involves specifying to Power Query which data you want to keep, not which data you want to throw away.
One example of this unnecessary data might include headers, titles, dates, owner information and so on. We can remove these rows by removing the top number of rows from the data set.
- Click on the Power Query tab.
- Click on the Online Search button.
- The Online Search box will appear on the right hand side.
- Type in past and future GDP in the search box and click return.
- The search box will show two tabs: one is organisation and the other is public. If you click on organisation, you will probably see no data. If you click on public, there will be many data sources to choose from.
Extracting data from the Azure Data Market.
In the real world, however, we need to be able to connect to different data sources, which may contain large amounts of data.
We will look at connecting to multiple data sources at a time. This is a useful way of enriching our data., which is another way of describing data. In this definition, it is a field that can be considered an independent variable, regardless of the data type.
To connect the online data and local data, we will connect to Windows Azure Datamarket using Odata, which is a standardized protocol to provide CRUD (create, read, update, delete) access to a data source via a website. It is the data API for Microsoft Azure, but other organizations use it as well, such as eBay, SAP and IBM.
Before you start, you need to create a folder where you can download data for the purposes of running through the examples. You should pick a folder name that is meaningful for you. Also, be sure to select a location that has plenty of space. In this example, we will use the following location to store data:
To connect to Windows Azure Datamarket, please sign up for a free account using a Windows Live ID. To do this, please visit https://datamarket.azure.com/ and follow the instructions. This may involve activating your account via a link, so please follow the instructions carefully.
Sign in to the Windows Azure Datamarket and navigate to the following URL:
About half way down the page, look for the Sign Up button and select it.
- This will take you to a Terms and Conditions page. After you’ve read the Terms and Conditions, and, if you agree with them, tick the box to specify that you agree, and click on Sign Up.
- This will take you to a Thank You page. Look for the link to Explore this Dataset on this page, and click it.