Data Mining: good data, then do the maths

Data mining can be loosely described as ‘searching for patterns in data’, but it is important to ensure that the data is properly in place before starting. It’s easy to get sidetracked with the selection of Data Mining algorithm, for example; that’s obviously a core piece, but if the data isn’t in place, then we can’t be sure that the results will be correct. This emphasis on data collection and integrity hasn’t always been around, however. It was hard to think of a data mining example where the importance of data made interesting reading! So I looked instead to the history of astronomy; Tycho Brahe (1546 – 1601) laid the foundations for today’s astronomy, by emphasising the rigorous and clean collection of data regarding the planets and the stars, which was a real innovation.

Tycho believed that astronomy could not be pursued by non-rigorous collations of astronomical data. Instead, Tycho believed that astronomy could only be understood through systematically collecting data. This also meant including redundant data, which Tycho was prepared to put the work into completing; he actually conducted this study for almost twenty years, and, amazingly, without a refracting telescope! This is an incredible achievement, since this feat occurred prior to the development of the refracting telescope (Keplerian Telescope), which was created by Kepler in 1611.

Significant to us today, however, is that Tycho produced the most accurate and systematic astronomical data of his time; he successfully managed to note the orbits of the planets to a very close degree. Tycho systematically collected the triangulated locations of the planets and stars throughout the course of the year, believing that the factual, observed data was the only way forward. Tycho produced hundreds and hundreds of statements about the location of each planet over the course of the year, for example ‘On the 15th March 1572, at 2.04am the planet Mars was 32’38” above the horizon, and 12’30 west of the pole star’.

Many people sought after Tycho’s data, and Tycho was much-sought after as a teacher, due to his data. Eventually, Johannes Kepler became Tycho’s apprentice. Kepler took Tycho’s data, and used it to generate his Laws of Interplanetary Motion. Briefly, these laws showed that the planets moved in elliptical orbits, rather than in circular orbits. Newton once described himself as ‘standing on the shoulders of giants’, and rightly so, since he used Kepler’s work to inform his work on gravity. Building on Tycho Brahe’s data, Isaac Newton (1642–1727) later deduced the fundamental mechanisms underlying the movements of planets. Newton’s Three Laws of Motion (uniform motion, force=(mass * acceleration), action-reaction), along with his Law of Universal Gravitation, therefore come directly from Tycho’s original observations.

Until mid-18th Century, the known planets were Mercury, Venus; Earth (obviously), Mars, Jupiter and Saturn. A detailed investigation of the orbit of Saturn showed that it was not an ellipse, but there was a slight deformation. This could be explained by another planet, whose gravitational pull was affecting Saturn’s ellipse. In 1740, Newton’s mathematics were used to create predictions surrounding the existence of another planet were published, and in 1781, Uranus was found, within 2 space minutes of the predicted location. Uranus’ orbit was also shown to have a deformed elliptical orbit, and again, another planet was posited. In 1821, based on predictions made by Newton’s mathematics, Neptune was discovered in 1846.

Thus, Newton’s theory, together with the appropriate accurate data in place, had deduced the existence and precise orbits of two previously-undiscovered planets. The theory didn’t fit all the facts, however; Mercury’s orbit also is not totally elliptical. Newton’s theory required the presence of another planet, located between Mercury and the Sun, in order to affect Mercury’s orbit enough to pull it slightly away from its ellipse. As an aside, this hypothetical planet was proposed to be called Vulcan, which Star Trek fans will know as the home planet of Spock. However, Mercury’s activity was explained by Einstein’s Theory of Relativity.

Central to this story, however, is Tycho’s stringent collection of data, and emphasis on rigorous data integrity. It is important to note that Tycho was an innovator in terms of his determination to collect careful observations of data; until this point, no-one had done this. Without his work and emphasis on clean data, the history of achievements in astronomy might have taken longer to achieve.

To summarise, it is not just all about finding patterns; the data has to be right in the first place. There has to be enough data for trial and test, along with a reduction in missing data, and even repeatable observations where possible. This applies not just to Data Mining, but other spheres too.

Leave a Reply