One question that pops up from time to time is the question over sample datasets for use in self-learning, creating training materials or just for playing with data. I love this question: I learn by actively trying things out too. I love the stories in the data, and this is a great way to find the stories that bring the data to life, and offer real impact.
Since I deliver real projects with customer impact, I can’t demonstrate any real customer data during any of my presentations since my projects are confidential, so I have three approaches:
- I use sample data and I have a signed NDA
- I ask the customer for their data, anonymised and have a signed NDA.
- I use their live data and have a signed NDA
If the customer elects the first option, then I use sample data from below.
To help you get started, I’ve cobbled together some pointers here, and I hope it’s useful. Please feel free to leave more ideas in the comments.
The latest edition of Entrepreneur has an insightful article on open source (frameworks vs libraries) and it has some good pointers to datasets at the bottom of the page. https://www.entrepreneur.com/article/310965 I’ve also pasted them here for you:
- Medical data
- Mixture of data sets
- National Park Service
- Google Trends
- General Services Administration (GSA)
Bernard Marr has an updated list of datasets here, on Forbes. I’m not going to steal Marr’s list so I recommend that you go and head over to his page, where you’ll find sixty-plus options.
Data Source Direct Connectivity to R, Python, Ruby and Stata
R has a number of APIs that connect to public datasets e.g. the World Data Bank, which allows connectivity from R, Python, Ruby and Stata. I used this for my recent demos at the Power BI event in the Netherlands, and it worked sweetly. SO you’d write your script to call the package, embed it in Power BI and it will go and get the data for you. I then create the chart in R, and put it into the Power BI workbook.
Quandl offers financial data, and it has a hook so that R can connect directly to it as well.
Kaggle is owned by Google, presumably so that Google can promote Tensorflow. Since people share code based on Kaggle datasets, it’s very easy to pick code, copy it, change it, and see how it works. However, this can be an issue, since you can’t be sure that the code is correct.
If you’re teaching or presenting using this data and/or sample code, you can be pretty sure that your training delegates have got access to the Internet too so you need to be sure that you credit people properly.
I am not mostly doing training, although I do training now and again. I am a consultant first and foremost. I’m meta-tracking my time with Desktime and with Trello since I am measuring exactly how I spend my time, and training does not account for a high percentage; project delivery takes the majority of my time.
Although I’m a guest lecturer for the MBA program at the University of Hertfordshire, and I’m going to be a guest lecturer on their MSc Business Analysis and Consultancy course, I do not consider myself a trainer. I am a consultant who sometimes does training as part of a larger project. I haven’t gone down the MCT route because I regard training as part of a bigger consultancy route. I never stop learning, and I don’t expect anyone else to stop learning, either.