Some Fun – a Film Critics Dashboard!

It’s the time of year again, where the film critics and pundits try to guess who will be awarded an Academy Award©, otherwise known as an Oscar, at this year’s event.  In order to explore the numbers behind the Oscar© guesswork, I’ve tried to create a ‘Film Critic dashboard’ which based on Guardian Datastore data on awards made by awarding committees such as the European Film Awards, Golden Globes awards and Screen Critics awards in  Jan 2010. These awards have been achieved by luminaries in the film industry for 2009; these include British actress Kate Winslet, Meryl Streep and George Clooney, amongst others.
It’s possible to provide an interesting dashboard, produced using Tableau, which explores the numbers, or counts, of the awards achieved by film or by individual. The dashboard created below has used color in order to distinguish films for each other. For example, ‘Inglorourious Basterds’ is denoted in orange in the ‘Best Film’ and ‘Best Actor/Actress’ graphs for consistency.
Further, the color palette has been chosen with colour-blind individuals in mind. Figures show that approximately 10% of males and 1% of females are color-blind, which means that they have problems in distinguishing between green and red (Few, 2008). Further, the colour ‘red’ has different culture connotations. In the US, red can be associated with jeopardy, as we can see in the phrase ‘caught red-handed’.  This perception is in contrast with China, where red is considered a lucky colour. With this in mind, Tableau automatically provide a ‘Color-blind’ palette, which means that it is possible to try and accommodate issues surrounding color-blindness for viewers affected with this condition. An example dashboard, using a color-blind palette, is given below, and please click here for a larger version:

Overall Film Dashboard

It is also interesting to note that the only actor to have won any ‘Best Supporting’ Actor awards this year has been Christoph Waltz. Similarly, the only winner of major ‘Best Supporting Actress’ awards has been Mo’Nique, for her role in Precious. If the past is a correct indicator of future wins, then they had better get their ‘Winner’ speeches ready for the Academy Awards©! Please click here for a larger version, if required. 

Best Actor Actress Award Winners

The bottom left hand side of the graph shows a count of ‘Best Film’ awards given to different films. It is possible to see that ‘The Hurt Locker’, directed by Kathryn Bigelow, has won more ‘Best Film’ awards than the other films. The example below allows the graph to be seen in more detail, and please click here for a larger version:

Best Film Award Winners

Some contextual information is accommodated by provisioning a list of the Academy Award© wins by the actors and actresses who have won awards in 2009. Here is an example below:

Academy Awards To Date

Again, the intensity of color is used to identify the holder of the most Academy Awards©. It’s clear to see that James Cameron tops the list with a total of three Oscars© to his name, with George Clooney and Meryl Streep holding second place with two awards each.  If past Academy Award© achievements are an indicator of future wins, then we could guess that these two actors are a safe bet. But who knows?
Finally, it was interesting to see the overall awards achieved by films. The ‘Best Film’ award winners were shown in the bottom half of the dashboard, and the bottom right-hand heat map shows the ranking of film overall award winners. This view has been expanded over the previous ‘Best Film’ award winners to include awards such as ‘Best Screenplay’ and ‘Best Score’.  The heat map is provided here for review:


Total Film Awards

When other award types are included, it is possible to see that ‘Crazy Heart’ is the winner, with ‘The Hurt Locker’ and ‘Inglourious Basterds’ following quickly behind.
With this data in mind, we can watch the Academy Award© achievements with interest, and see how this dashboard and corresponding data has helped –or hindered – the correct prediction of the outcome!

Add to Technorati Favorites

Data Visualisation: Beauty and Clarity?

This blog post focuses on an analysis of the donations made, by country, to assist the healing of Haiti after the devastating series of earthquakes in January 2010.  The process of tagging donations as specifically coming from organization, governments and countries is complex, and not as quick as one might expect. For example, the European Union has an emergency aid fund, but so do most of its members. So, for example, Germany might donate to the EU fund, and provide its own donation as well. This makes it difficult to track exactly where the generous donations are coming from, due to the actual route of the donation. Further, the type of donation can also make things more complex. For example, Canada citizens will have their donations matched by a donation by Canadian government – in addition to the funds that the kind-hearted Canadian government has already pledged as an emergency response to funds required immediately. This means that the actual full amount of donations is still to be determined since funds are still coming in from different sources within Canada in mind.
The data community is finding its own way of determining donation amounts. Jer Thorp uses the concept of an ‘Avatar minute’, is how much of the film that a viewer would be able to watch with a given amount of money. For example, 16 cents would purchase you about 3 minutes of the film, for example. So, Thorp’s analysis is broken down into two pieces: the first part aims to show each government’s donation is worth per head, and secondly, how many minutes of Avatar each citizen would be able to watch with the donation.  The ‘Avatar minute’ concept has been applied to visualizations of Haiti Earthquake donations at Jer Thorp’s website and if you haven’t had a look yet, I would recommend it. The images are incredibly beautiful, like this one below, which aims to show the amount of donations pledged by each country in ‘Avatar minutes’. All credit (and copyright) is given to Jer Thorp for this beautiful image:


Thorp’s image is undoubtedly beautiful, so the image met its objective as displaying data as a beautiful piece of art. Displaying information beautifully is emerging field, and Microsoft’s Pivot project is an example of this phenomenon.  Further, Stephen Few provides an analysis in a recent blog post.
As a data analyst, however, the visualisation raised more questions. For example, the imagery made it difficult to compare donations between countries. In the above example, the size of the letters seemed to indicate the donation amount pledged in ‘Avatar minutes’. However, the analysis didn’t seem to make it clear that Luxembourg, Denmark and Guyana made similar donations per head, partly because the word length and letter size made it difficult to compare between them easily. Comparison was made more difficult by the 3D imagery, which adds beauty but didn’t add clarity. As Stephen Few points out, 3D imagery can make data more difficult for viewers to decode. In the above image, it isn’t clear whether the length or width of the film reel has a relation with the donation amount, or whether it is underlying the headline point about the concept of ‘Avatar minutes’.
With these issues in mind, the data has been re-visualised here using Tableau in order to clarify underlying patterns in the data, and make comparison straightforward. With this in mind, I set out to use the same data, from two sources: ReliefWeb and the ever-excellent Guardian Data Source. Before we look at the visualization note that, with the aforementioned issues in mind, the underlying data set has been restricted to include indubitable funds from different countries at the time of writing. This has been done to focus the ‘business question’ of the graph, this excludes money from individuals, organizations, and money that hasn’t been tagged definitively as coming from a specific source yet. Further, this data from ReliefWeb was accurate as at 21st January, 2010, and it is clear that this situation regarding inbound donations is subject to rapid change. The sample dashboard can be found here, or a larger sample can be obtained by clicking here:
Overall Dashboard

One single unified dashboard, with a number of graphs showing different views of the data, can allow the viewer to compare and contrast patterns more easily. For example, the top graph undoubtedly shows that the United States have provided the most funds, and a larger image can be obtained by clicking here:
Top 10 Donations by Country
The sheer size of the US Government donation is further clarified by the right-hand image, which shows that the United States have provided 5 times the individual donations of France and the UK, which have provided second and third highest donations respectively.
Top 3 Donations
When population numbers are added to the mix, then some additional features of the data are highlighted. The following graph shows population as the thin blue line, and donation amount as a thicker blue background line. For example, it’s possible to see that, at the time of writing, Australia and Sweden have given proportionally more per head than the other countries, since their ‘donation’ line is longer than their ‘population’ line. A larger sample can be obtained by clicking here:

Haiti Total Donations by Country and Population
This is further quantified by the final section of the dashboard, which shows the amount of donation, per head, provided by each government. This was calculated simply as: [Donations in US Dollars]/[Population] The image is provided next:
Donor Ranking
The donation numbers have been included, and the proportion is also indicated by the strength of the blue colour; in other words, the stronger the blue colour, the higher the population.  This allows  the user the facility to look at the numbers, or just have an ‘at-a-glance’ view in order to obtain the ‘sense’ of the data. It’s possible to see that Australia, Sweden and Norway have given more then one dollar per head, at the time of writing (21st Jan, 2010).
To summarise, the initial data visualization was incredibly beautiful. However, the features that made it beautiful, also served to make the underlying data less clear. To conclude, in information visualization, there can be a payoff between beauty and clarity, but both domains can live side-by-side to serve different needs, and to learn from each other.

New DATE Format in SQL Server 2008

I’m currently using SSIS to import data into Replicon, editions of timesheet software that track time and expenses for attendance, payroll, project costing and client billing. As expected, there are always some opportunities to pick up experience along the way.
For some columns, Replicon expects the imported date format to be in a strict date format, i.e. 2010-10-01., or in other words, the date part only. In previous versions of SQL Server, this would mean whittling the time portion from the normal SQL Server datetime datatype, and inserting the remaining date format into Replicon.
The good news is SQL Server 2008, there is a nice new date format which is simply called DATE. This is much more readable, and here is an example of it below:

select CONVERT(date, getdate())

For today, 20th January 2010, this would provide the following result:


Then it is possible to use this new DATE format to extract data from the source system, and use SSIS to write the information to a CSV file. However, when the information is written to a CSV file, the output to the CSV file might have the time portion added back into it, for example:

2010-01-20 00:00:00:000

Why does this occur? Well, if the output column type is set to anything else other than DT_DATE, then the output will include the time as well as the date. This holds true even if the original SQL query, extracting the data, takes the DATE format; when exporting to a CSV file, SSIS will just add the time portion to the end of the date portion to provide a datetime format.
This makes sense, since the resulting target CSV has been configured to provide a datetime format. So, SSIS is simply doing what it has been asked to do: it adds a default HH:MM:SS:ms is added to the date format to ensure that the CSV file is created in accordance with the configured column date types.
For Replicon, the date format YYYY-MM-DD HH:MM:SS:ms is invalid since it expects the date only, i.e. YYYY-MM-DD. So, this part of the import would fail.
So, it is important to ensure that the DATE format is preserved when the output is written to a CSV file. In the Flat File Connection Manager Editor, it is important to double-check that the column data type for the DATE is set to date (DT_DATE). A sample image of this requirement is provided below, and if it helps, please click on the image if you need to a larger image:
Hope that helps!

Add to Technorati Favorites

Analysis Services 2008: Issue and Resolution #2

Here is another Analysis Services error message which I’ve seen crop up from time to time:

Errors in the OLAP storage engine: A duplicate attribute key has been found when processing: Table: ‘dbo_mytablename’, Column: ‘ColumnName’, Value: ‘UNK’. The attribute is ‘ColumnName’.

There are a number of ways to get rid of this. I’m in favour of sorting out the source, but if the source does not belong to you then it can be difficult to take the action that you might like.

1. Sort out the source.

You may not be the ‘owner’ of the source. If you are, then there must be duplicates in the source somewhere, and this will require some digging around to discover the duplicates. This should be easy to discover using basic counts and grouping in SQL, and sorting the count in descending order so that the offending keys bubble to the top of the SSMS screen.

2. Get rid of NULLS.

If amending the NULLs isn’t an option for you, then perhaps you can circumvent this issue by using a VIEW in the DSV.

Note that SSAS will convert nulls to zeros. So, if your key contains NULLs, you will end up with these items converted to zeros. Thus, if you have more than one NULL, your column will contain lots of zeros; hence the duplicate attribute key error. The conversion to zeros might well be hiding NULLS in the source system, so it’s always good to have a look in the source column.

3. Analysis Services properties

In the properties, set errorprocessing to default or change keyduplicate setting to ‘ignore’. This might be a ‘quick fix’ but it isn’t ideal since it may well result in issues later on .

I hope that helps!

Add to Technorati Favorites

Creating a new database from a backup

Here’s a quick note on how to take a backup of an existing database, and restore it to a brand new database where you don’t have a copy of the schema.
A few preliminaries:
If you try to do this without using REPLACE, then you might get the following error message:
Error 3154: The backup set holds a backup of a database other than the existing database
You may have to rename the backup to give it a file extension of *.bak
There are a number of ways to create a new database from a backup.
Here is a quick script that takes a backup called star.bak, and uses it to make a database called Star:


FROM DISK = ‘C:\Star\Star.bak’
Alternatively, it is possible to do this using the interface. Here is the ‘WITH REPLACE’ option that you need to use:

You may also need to give the file an ‘owner’. Here is the location where the valid file owner is set:

Data Visualisation of Worldwide Chocolate Sales – Kraft and Cadbury

Despite the current economic difficulties, chocolate sales are continuing to rise throughout the world (Mintel, 2010). For example, British chocolate lovers have increased sales by nearly 6% from 2007, and in China, chocolate sales rose by 18%. Currently, chocolate sales remain in the headlines with the potential takeover of Cadbury by Kraft Foods Inc. In order to understand the sales and brand value numbers better, and to understand the value in combining Kraft and Cadbury, Tableau data visualization software was applied to the sales and brand value amounts for 2008.

The results of the data analysis resulted in a dashboard, which can be seen below. Since the brand value and actual worldwide sales metrics are related, it makes sense to have them on the same dashboard since the relationships allows the viewer to understand more about the relationships between the data. The dashboard shows an analysis of the worldwide sales amount and brand value amount for chocolate sales by company for 2008. The top graph displays the worldwide sales for the major players, sorted in descending order. The bottom graph displays the brand value for individual Cadbury and Kraft brands. To be clear, the ‘brand value’ is the value that a brand is considered terms of income, potential income, reputation, prestige, and market value’. Please click here for a larger image, if necessary.

resize overall dashboard - chocolate worlwide sales and brand value comparison

Each graph has horizontal bars, which have different colour blocks. The darkest block shows 0% – 60% of the amount, the medium coloured block shows 61% – 80% of the total amount, and the lightest colour is reserved for over 80% of the total amount. Finally, at the bottom right hand side, there is a ‘football league’ of brand values, regardless of their company, at the bottom right hand side, which has been sorted to show their brand value amounts in millions sterling, in descending order.
To quote Stephen Few, ‘Comparison is the beating heart of data analysis’ (Few, 2009, p55). When users compare data, they are really doing two things: looking for similarities, and looking for differences in the data. One common area of comparison is looking at magnitudes, and Tableau easily offers the ability to look at magnitudes in a user-oriented way. This can clearly be seen in the top graph, which is highlighted below, and a full version can be reviewed by clicking here.

resize world chocolate sales

In the example of the ‘World Chocolate Sales’ graph, the length and the width of the bars in the bar chart allow us to easily compare sizes. This is clearly seen in the top graph; the magnitude of worldwide chocolate sales, by company, is expressed by the length and the width of the bar. So, to illustrate, Mars Inc have the longest and widest bar, since Mars was the overall winner in terms of worldwide chocolate sales. In contrast, Kraft Foods Inc has the shortest and thinnest bar, since they were ranked last out of the top 5 chocolate vendors.
A simple sort of the worldwide chocolate sales data reveals something counter-intuitive: the sort shows that Cadbury is the second-largest company in terms of chocolate sales in the world, and additionally that Cadbury is shown to have more worldwide chocolate sales than Kraft. This analysis is further enhanced by the gradations in the bar colour, which help the viewer to identify the magnitude of sales. These gradations show that Kraft’s total sales are less than 60% of Cadbury’s worldwide total sales. This is surprising, given the recent takeover bid by Kraft. Tableau has helped us to understand this counter-intuitive feature of the data by facilitating a very simple sort. In the words of Stephen Few, ‘Don’t underestimate the power of a simple sort’ (Few, 2009, p61).
Given this surprising result in the data analysis, what further insights can be revealed regarding why Kraft, the fifth-largest chocolate company in terms of sales, is interested in taking over the second-largest chocolate seller, Cadbury? A further sort of the brand value data, as displayed in the bottom right hand gauge, might help reveal the answer to this question; the graph is given below.

Kraft and Cadbury Brand Value Gauge

It is clear to see that Kraft have a greater amount of brand value than Cadbury. Here, the magnitude of the brand value is reflected in the colour; the darker brown shows the higher value, and the lighter value shows the smaller brand value for Cadbury PLC. Further Analysis shows that each vendor has one outstanding brand which serves as an outlier, pulling up the overall average. Tableau allows the data analyst to easily add in a reference line, which shows the average. Further, a line graph has been added in order to facilitate comparison of magnitude of the brand value differences. A large version can be found by clicking here.

resized cadbury and kraft brand value

To show consistency with the earlier brand value finding, the brand value is ranked in two ways: Kraft Foods Inc have been placed at the top half of this graph, and Cadbury has been placed at the bottom of this graph to reveal that, overall, the brand value of Kraft was greater than that of Cadbury. This visualisation may be surprising, since it shows that Kraft’s brand value is greater than Cadbury’s brand value, even though the worldwide sales for Kraft are less than Cadbury’s.
It is also interesting to work out the brand values without looking at the company name. This has been supplied in the bottom right hand gauge entitled ‘Brand Gauge’. The example is given here:

Brand Gauge

Kraft has the greatest brand value in its Milka Range, followed by Cadbury’s Dairy Milk. It is interesting that Cadbury’s Creme Egg features on the list at all, since it is only sold for the first half of each year.
Looking forward, it is clear to see that, by purchasing Cadbury, Kraft would acquire the brand value of Dairy Milk, which is significantly higher than many of its own brands. Further, if the chocolate sale worldwide of each company was added together, this would create a new leader in chocolate sales since this would dwarf the worldwide chocolate sales of the current leader, Mars.
To summarise, two items have been covered: an analysis of recent chocolate sales and brand value data for 2008, along with some thoughts around the sales and brand value for a company combining both Kraft and Cadbury. It has been shown that Tableau as a data visualisation tool served to facilitate an understanding of the sales and brand value data, in addition to stimulating speculation over the future.

Mintel, 2010. Chocoholics Unite as Chocolate Sales Worldwide Defy Recession

Stephen Few (2009). Now You See It: Simple Visualization Techniques for Quantative Analysis (Hardcover) Analytics Press

Add to Technorati Favorites