Useful Data Sources for Demos, Learning and Examples

One question that pops up from time to time is the question over sample datasets for use in self-learning, creating training materials or just for playing with data. I love this question: I learn by actively trying things out too. I love the stories in the data, and this is a great way to find the stories that bring the data to life, and offer real impact.


Since I deliver real projects with customer impact, I can’t demonstrate any real customer data during any of my presentations since my projects are confidential, so I have three approaches:
  • I use sample data and I have a signed NDA
  • I ask the customer for their data, anonymised and have a signed NDA.
  • I use their live data and have a signed NDA
If the customer elects the first option, then I use sample data from below.
To help you get started, I’ve cobbled together some pointers here, and I hope it’s useful. Please feel free to leave more ideas in the comments.


The latest edition of Entrepreneur has an insightful article on open source (frameworks vs libraries) and it has some good pointers to datasets at the bottom of the page. I’ve also pasted them here for you:
Bernard Marr has an updated list of datasets here, on Forbes. I’m not going to steal Marr’s list so I recommend that you go and head over to his page, where you’ll find sixty-plus options.

Data Source Direct Connectivity to R, Python, Ruby and Stata

R has a number of APIs that connect to public datasets e.g. the World Data Bank, which allows connectivity from R, Python, Ruby and Stata.  I used this for my recent demos at the Power BI event in the Netherlands, and it worked sweetly. SO you’d write your script to call the package, embed it in Power BI and it will go and get the data for you. I then create the chart in R, and put it into the Power BI workbook.


Quandl offers financial data, and it has a hook so that R can connect directly to it as well.


Kaggle is owned by Google, presumably so that Google can promote Tensorflow. Since people share code based on Kaggle datasets, it’s very easy to pick code, copy it, change it, and see how it works. However, this can be an issue, since you can’t be sure that the code is correct.

Final Note

If you’re teaching or presenting using this data and/or sample code, you can be pretty sure that your training delegates have got access to the Internet too so you need to be sure that you credit people properly.
I am not mostly doing training, although I do training now and again. I am a consultant first and foremost. I’m meta-tracking my time with Desktime and with Trello since I am measuring exactly how I spend my time, and training does not account for a high percentage; project delivery takes the majority of my time.
Although I’m a guest lecturer for the MBA program at the University of Hertfordshire, and I’m going to be a guest lecturer on their MSc Business Analysis and Consultancy course, I do not consider myself a trainer. I am a consultant who sometimes does training as part of a larger project. I haven’t gone down the MCT route because I regard training as part of a bigger consultancy route. I never stop learning, and I don’t expect anyone else to stop learning, either.


How do you choose the right data visualisation in Power BI to show your data?

How do you choose the right visualisation to show your data? Usually the customer wants one thing, the business user want something else, the business sponsor wants something flashy…. and it’s hard to tease out the requirements, and that’s before you’ve even opened up Power BI such as Power View, Excel, Tableau or whatever your preferred data visualisation software.

In other words, there are simply too many charts to choose from, and too many requirements to meet. Where do you start?

I found this fantastic diagram which can help you to choose the right visualisation. I’m often surprised to see that people haven’t seen this before. Note: this diagram was done by Andrew Abela of Extreme Presentation and the source is here and his email address is on the slide, so be sure to thank him if you’ve found it useful. If you can’t see it very well, click here to go to the source.


Chart Choosers should not replace common sense, however, and Naomi Robbins has written a nice piece here which is aimed at the wary. However, diagrams like Abela’s can really help a novice to get started, and for that, I’d like to thank him for his work.

How does it related to Microsoft’s Power BI? If you look at the visualisations that are available in Power View, you can see that most of the visualisations in the diagram are available in Power BI.  The ones that are excluded are the 3D graphs, circular area charts, variable width charts, or the waterfall chart.

Why no 3D? I personally hope that Microsoft will leave 3D out of Power BI tools, unless of course it is in Power Map.  With 3D on a chart, it is harder to identify the endpoints, and it can take us longer. It might also mean that points are occluded. If you’re interested and want to see examples, here is one by the Consultant Journal team or you can go ahead and read Stephen Few’s work. If you haven’t read anything by Stephen Few, get yourself over to his site right now. You won’t regret it. Why is it different from Power Map? 3D maps provide context, and they are the exception where I will use 3D for a data visualisation showing business data. I’m obviously excluding other types of non-business data here, such as medical imaging and so on.

Why no circular area or variable width charts? I am not a fan of variable width of circular area because we aren’t very good at evaluating area when we look at charts and graphs, and Robert Kosara has an old-but-good post on this topic here.

This blog is mainly for me to remember stuff but I hope it helps someone out there too.

Best Wishes,

Data Visualisation with Hadoop, Hive, Power BI and Excel 2013 – Slides from SQLPass Summit and SQLSaturday Bulgaria

I presented this session at SQLPass Summit 2013 and at SQLSaturday Bulgaria.

The topic focuses on some data visualisation theory, an overview of Big Data and finalises the Microsoft distribution of Hadoop. I will try to record the demo as part of a PASS Business Intelligence Virtual Chapter online webinar at some point, so please watch this space.

I hope you enjoy and I look forward to your feedback.

Precons: What we learn by listening

When words are scarce, they are seldom spent in vain. (Shakespeare, Richard II, Act 2, Scene 1)

This quotation, from Shakespeare’s Richard II, sums up a lot of my thinking about Business Intelligence generally. It’s always about people. As a consultant, presenter, speaker or panellist, I’m there to listen as well as talk. I have posted my Data Visualisation and Business Intelligence pdf notes here. If you cannot access it, please let me know at jen.stirrup [at] copperblueconsulting dot com.
I used the slides as the basis of my Precon in Poland at the Poland SQLDay event. The precon is mainly demo, and trying to solve delegates’ business problems on the day. I like to try and find suggest a few common business problems, and demonstrate different ways I’ve solved them. 
As a Data Visualisation practitioner, I believe in visualisation – but I also believe in listening, too. I do not like to PowerPoint people to death.  I like to do end-to-end – so I start from the ground up. In other words, I set out with the business problem, and then walk through to the end result. Sometimes these business problems are issues which delegates didn’t know that they had, until they’ve thought about it whilst attending the precon! I like to try and do useful things that people can apply to their own environments when they get home. I always have my own ideas about things that I’d like to show people, but I welcome attendees to bring their own thoughts and issues about their current and future business problems. After all, Business Intelligence is about people.
Since I do a lot of the demo work pitched at the audience needs, I don’t have that many slides. I can produce more if I need to do so, but I always try and get a balance between demo and slides. 
In this way, although I’m talking a lot during the day to the delegates, I am also listening to the delegates as well.  I believe that this is vital to the success of the day. If I can help with a specific business problem that the other delegates are interested in, then this is a good day for me since I’ve made a difference somewhere.
People often ask me: how can you get up on stage, and speak in front of so many people?  I don’t hold with the ‘I talk, you applaud’ school of presentation at all. Here is my answer:

Courage is what it takes to stand up and speak; courage is also what it takes to sit down and listen. Winston Churchill

Sometimes I get questions that I don’t know the answer. In SQL Server, I now think that it covers such a wide arena of users, that it is no longer possible to know everything about it unless you actually work for Microsoft (and I don’t).  
Fundamentally, however, I have to agree with Hemingway: to listen, you’ve got to want to hear the message. I always respect my delegates. Even if they’re not the ones talking, they live, breathe and sleep SQL Server and related technologies all day, every day. They deserve our respect and in order to serve them best, it is up to me to listen. So my slide deck isn’t fulsome; but I hope that the attendees got something out of the day, and it was my pleasure to work with them. 

I like to listen. I have learned a great deal from listening carefully. Most people never listen. Ernest Hemingway. 


Color-blindness – why does it happen, and how can data visualisations help?

How people perceive colour is an interesting issue. The Young-Helmholtz Trichromatic theory of vision proposed that we have red/green/blue receptors, which are then combined to show different colours.  It is thought that the red-green receptors are close together, and perhaps this is the root of the issue. Needless to say, the issue is complex but interesting.
According to popular wisdom, it is thought that the red-green chromatic channel developed in order to provide an evolutionary advantage for determining ripe fruits against a background of foilage. Tell that to your children, next time they refuse to eat fruit! However, this ‘ripe fruit’ theory has been difficult to observe in field studies. One group of researchers conducted field studies in black-handed spider monkeys, and found that luminance contrast was just as important in distinguishing fruits. If you’re interested to read more, here is an interesting study that illustrates the complexities of perception, which involves the field study of primates. On the other hand, a separate study showed that trichromatic primates found it easier to determine and select ripe fruits, and you can find more information here.
How does this impact data visualisation? It is possible to produce visualisations that make the most of luminosity in order to encode values, along with the size of the data point, in order to convey the message of the data visualisation. Another issue is that determining colour and luminosity can be a subjective issue, and point size may help to provide additional cues. I envisage it as if it is the detection of fruit in viticulture. Therefore, one winemaker might ascertain that a grape’s optimal point of ripeness is at one point, and another viticulturist might determine that the ripeness point is at another point in time. Similarly, it isn’t always easy to ask experimental subjects to ascertain the amount of ‘greenness’, ‘redness’, or ‘blueness’ of a point. There has been some work in computer vision, aimed at distinguishing the RGB in fruit, which is interesting to read.
It is suggested that about 12% of males are colour-blind, which means that they are restricted from using the red-green channel. If you are interested in reading more about the experience of a colour-blind person, please do read this entertaining blog by Geoffrey Hope-Terry.
To summarise, data visualisations can therefore augment understanding by assisting the perceptual processes involved in luminance and the blue-yellow colours. It is also possible to use the size of the data point to convey the message of the data. In other words, data visualisations should aim not to exclude members of the audience by including lots of red and green together.

Mobile Business Intelligence – Try it out!

Thank you to everyone who attended my SQLBits ‘Mobile Business Intelligence in Action’ session recently. If you are interested to try out Mobile Business Intelligence on your iPad or mobile device, here are the links below:

Jedi Knight Actuals of UK Census 2001 Dashboard 
Jedi Knights Percentage of UK Census 2001 Dashboard
AdventureWorks Sales by Geography Dashboard
AdventureWorks Actuals Sales
AdventureWorks Analysis Dashboard

I haven’t tried this on every browser and every device, so I would be very interested in your feedback.
I look forward to hearing from you. Please leave a comment below, or email me at jenstirrup [at]

Representing data about the iPad

The current blog will take three different ways of representing the same data set, in order to see how it can be done simply and clearly – or not so clearly. I have taken some samples, and reworked them as a progression throughout this blog.

Although I am discussing the iPad here, this is not a preview about my iPad and Mobile Business intelligence sessions which I’m delivering at SQLBits session in October, or my User Group sessions in Leeds and Surrey this year; however, obviously the iPad is very much in my mind, hence the perpendicular topic of this blog!

The dataset is interesting because it aims to show the impact of the iPad announcement on notebook sales. This study was conducted by NPD, Morgan Stanley Research. CNN Money has written a short article on the impact of the iPad on netbook sales, which proposes that the iPad is at least ‘partially’ responsible for the decline in netbook sales. The rather dramatic bar chart, which underlines this point, is given here:
There are a few issues with the bar chart:

 – The axis doesn’t go from 0 – 100%, which I would expect, given that it is supposed to show percentages. This skews the results slightly; for example, the 70% seems higher.
– 3D gradient issues don’t add anything. Sometimes 3D can make an image look more ‘pretty’. Here, the 3D does not add anything ‘pretty’ or enhance anything about the message of the data
– it’s not clear why the data has been represented as distinct categories when time is continuous rather than discrete
– the big pink arrow shouldn’t have been necessary; the graphic should have been enough.
– there is nothing to make the negative value stand out, or to distinguish it in any way.

There have been other examples of the same data, re-visualised. Here is an example from a wonderful infographic, which has been completed by the Focus Group. I have taken an excerpt of it here since the whole infographic is not the focus of this blog:

iPad and Notebook sales by the Focus Group

The above infographic solves some of the issues of the earlier version, which was reproduced by CNN money.

– There is no 3D
– The big arrows have gone

However, although it is visually appealing, it does repeat some of the earlier issues found in the CNN money chart, since the scale still does not reach 100% on the Y axis. Further, it also introduces some new issues:

– The black background might be visually appealing, but as a ‘best practice’, a white background is better. This allows the representation of the data to dominate the scene, not the background or other non-necessary items.
– hatched lines replace the arrows, to denote the time of the announcement of the iPad and the actual release of the iPad. This is an issue because it is slightly jarring to the eye.
– the month timeline isn’t evenly marked in terms of months; it is therefore difficult to ascertain if the data is skewed horizontally in any way.

In order to improve these representations of the data, I have used Tableau in order to create a simple line graph. This was all that was needed in order to get the message across, without skewing it or obscuring it in any way. Here is my example below, which can also be found on the Tableau Public website:

iPad and Notebook Sales

I have removed the issues found in the earlier visualisations and added some further enhancements:

The negative growth percentage has been highlighted with red colouring
added in clean annotations which do not obscure other parts of the data visualisation
ensured that the Y-axis shows 100% so that the data is not skewed
used a line graph since the X-axis is continuous, not discrete
removed the black background to emphasise the components of the data that provide the message of the data

Although the data visualisation has been improved, there are still contextual answers which the graph cannot answer:

– what about the impact of the iPhone, or other tablets?
– what about the impact of the time of year e.g. post-Christmas sales?
– what about the impact of the impending recession?

Therefore, the initial analysis as described by CNN money simply provided a ‘headline’ message, and further analysis would need to be conducted in order to answer the question more fully. That said, a proper visualisation of the data is a useful tool towards getting the ‘bigger picture’ right, as well as the ‘smaller picture’.

I hope that this was interesting, and look forward to your comments.
Jen x