Obtaining Sample Data Sources

One of my readers, Misbah, wrote recently to ask a query about obtaining sample data sets. I’ve done some research on this – Misbah, this blog is dedicated to you and I wish you all the best in your studies. I’ve discussed some of the problems in obtaining datasets, along with some resolutions.

There are different issues in obtaining data, which are summarised here:

Confidentiality – in my career, I’ve come across some amazing data sets. However, out of respect for the confidentiality and sensitivity of the data, I never share it.
Accuracy – it can be difficult to obtain data that has been rigorously collected
Data types – psychological data can be difficult to obtain than true rational data. For example, a question such as ‘on a scale of 1 to 5, how happy are you today’ cannot be true rational data, but is simply more of a label.

How is it possible to get some reliable, free data sets that are easy to use and free from confidential restraints? Well, here are some resources which I like to use for sample sets:

The Guardian Datastore – this has plenty of sets of sample data on everything from security, war, MPs expenses to fun things such as chocolate sales. Some of the sample Tableau images on this blog have used data from this source.

The London Datastore – this has plenty of London-focused data sets.

Good old Excel also has the RAND and RANDBETWEEN function, which is a volatile function which will produce a random set of data in a spreadsheet.

Another place to look for data is the Tableau Public website. Unlike the Google Data Explorer, Tableau allows you to download the data as well.

A final place to look is Swivel, which describes itself as a Youtube for data. 

I hope that this helps you to get some sample data for the visualisations.

Add to Technorati Favorites

4 thoughts on “Obtaining Sample Data Sources

  1. Hi Jen, good post, I hadn't come across Swivel before.

    One thing I would add If you can't find an external data set that suits – create your own. You can stress test performance and create boundary cases which is vital for testing.

    in SSIS – start with a script component as a kicker and then use derived columns to change your data as required.

  2. Jen, you gave me a new idea, you mentioned that you use London database, I am blogging about Russia so maybe I should try to find some nice data about Moscow. Situation with data in UK and Russia is not the same but I am sure that something interesting I will be able to find. Thanks

Leave a Reply