5 Factors to Consider before Pouring Data in your Data Lake

As organizations move into a Big Data world, many projects will include a Data Lake component. What is a data lake, and how do we get our data into the Data Lake?

There are many different interpretations of a Data Lake, but the interpretation given by Attunity and Hortonworks’ jointly produced Data Lake Adoption and Maturity Survey Findings Report – is a good explanation since it refers to the Data Lake as a strategy as well as an architecture.

A Data Lake is defined the data lake as an architectural strategy and an architectural destination, thus addressing both the end state architecture and establishing an adoption and transformation strategy for data architecture related decisions on the journey to the data lake

In order to make the investment of the Data Lake, the data must get into the Data Lake somehow. If you believe the hype, organizations should simply be able to pour the data into the Lake without being concerned with joining it together. Things are not that simple, however. One component that’s difficult for analysts to grasp is that the data storage is different, and this will impact them at the point of query. And this is without the complexities of combining Data Lake data with data from in-memory databases, for example. Data is stored in raw flat files in Hadoop’s Distributed File System (HDFS). Moving from rectangular to non-rectangular data held in Hadoop – this is a real mind-shift for business users. People are going from a relational world to a batch processing world, and having to mix both; this is not going to be straightforward for many organizations, who already struggle with the rectangular data.

The complexity lies in getting the data into the Data Lake in the first place. A data lake is primarily a collection of data services. Raw data takes up more space, however, and it is much more difficult for analysts to navigate. In turn, this means that it is more difficult for analysts to query, and the querying will not happen at the speed of the business. In turn, this will make it difficult to take machine scale data and make it human scale, so that it can be summarized, sliced and diced and compressed for business decision making. In order to approach these issues, the cloud offers experimentation and exploration, and is well suited to Data Lake implementations. The inclusion of cloud as the location for a Data Lake can add an additional point of complexity when getting the data into the Data Lake.

Who actually owns the data going into the Data Lake? IT retains overall responsibility for the guardianship of the data in the Data Lake, plus cataloguing the data for retrieval. The business analysts and data scientists are responsible for the usage of the Data Lake, surfing this unstructured data repository in order to answer business questions by progressively adding structure to the data. The data lifecycle of the data lake alters how the data pipeline works to get data into the Data Lake as a first step.

One key opportunity offered by Data Lakes is the ability to break down silos in the organization. The information can be amassed from various sources from different departments and from external sources, and it can be co-located in order to create a wider organizational Big Data program, with a focus on Analytics. Many organizations have a series of dirty data puddles rather than a data lake, and this isn’t conducive to business insights driven by data.

Any strategy will need to reflect that the landscape of data is fluid and changing. These data puddleswill need to be incorporated into the Data Lake as part of the data lake creation process, andorganizations will need to review their in-house legacy systems. For larger organizations, the in-house data warehouse is a key concern, which has a more important short-term relevance. And this happens before the business have the opportunity to create queries on their data lake. This will help to move the organization forward by making it easier to ask horizontal questions of the business, rather than just vertical questions which focus on the siloed departments within the business. The Data Lake is an attractive strategy for circumventing these disconnections within the organization, so it is not just a technical concept but a mover for deeper organizational change.

Anyone who has worked with data will know the pain of meshing together in spreadsheets and, in order to move at the speed of the business, Business Analysts spend a lot of time restructuring, reformatting, blending, meshing and consolidating data. There is real pressure to find answers and insights in an increasing number of data sources which themselves increase in size, whilst doing so in decreasing space of time. However, in order to get that far, the data has to get into the Data Lake in the first place. This involves a lot of moving parts, but it’s clear from the study that many organizations, whilst remaining skeptical of the hype, are continuing to move forward. How is it possible to move forward? What needs to be considered before putting data into the Data Lake?

1. Organizational maturity in adopting the Data Lake

Organizations need to understand better where they are in terms of their maturity and commitment to Data Lake. In their report, Data Lake Adoption and Maturity Survey Findings Report, Radiant Advisors lay out a number of key stages in determining the organizational maturity in their Data Lake approach. These stages are listed here, but the reader is referred to the Data Lake Adoption and Maturity Survey Findings Report for a full description.

Evaluate
Reactionary
Proactive
Core Competency

The organization will need to take a step back to understand better their existing status. Are they just starting out? Are other departments which are doing the same thing, perhaps in the local organization or somewhere else in the world? Once the organization understands their state better, they can start to broadly work out the strategy that the Data Lake is intended to provide.

As part of this understanding, the objective of the Data Lake will need to be identified. Is it for data science? Or, for example, is the Data Lake simply to store data in a holding pattern for data discovery? Identifying the objective will help align the vision and the goals, and set the scene for communication to move forward.

2. Executive Sponsorship

If an organization has only started out on the journey towards a Data Lake, then they will need to involve an executive sponsor in order to help provide the right vision and strategy for the Data Lake. This means that there will be executive sponsorship and support to lead the organization in the execution of the data-driven strategy. Understanding the ROI and business goals will help to keep the project on track, and the goals in focus.

3. Governance

Data Governance is critical for enterprise data, and particularly for data which concerns people. Governance will help the organization to find a common framework to consider important points when putting data in the Lake, such as discovering what data should go in the Data Lake, how it gets there, and how it should be protected.

4. Engaging the Users

As a consequence of obtaining executive sponsorship, a key item is ensure that the organization has obtained the right skills in-house, both technical and professional, to make execution easier. Eventual adoption of the Data Lake will be easier if the users are involved as the Data Lake develops and progresses. This will also involve learning new skill sets, and users will feel that this is a valuable exercise. It is also a way of getting over the Shiny New Objects syndrome, whereby people get dazzled by bright new things and this can mean that the technology is emphasized at the expense of other relevant factors, such as the business processes. User engagement facilitates long-term adoption and usage of many technologies, and this is also applicable to the Data Lake.

5. Prepare – Call to Action

As a next step, it’s recommended to read the Data Lake Adoption and Maturity Survey Findings Report from the Attunity website for more data and advice in order to devise a strategy to facilitate the Data Lake adoption process.

The PASS Business Analytics Conference is having a number of Big Data experts, with a focus on Data Lake. The emphasis is on practical information, as well as a ‘Communicate and Lead’ track to help developing leaders to take their organization towards a data-driven enterprise strategy.

Conclusion

To summarise, there are a number of strands which need to be considered when adopting a Data lake, and considering the factors involved in populating the Data Lake with data. It’s clear that there are many issues involved and it’s not as simple as pouring the data in. These issues are wide-ranging and touch many parts of the organization, and enterprises will need to take this issues into account when approaching the Data Lake opportunity.

References

Best Practices for Data Lake – Who’s Using It and How Can You Get the Most Value From It? By CaroleGunst. See more at: http://attunity.com/blog/best-practices-for-data-lake-whos-using-it-and-how-can-you-get-the-most-value-from-it#sthash.DUefFh7b.dpuf

Data Lake Adoption and Maturity Survey Findings Report – http://learn.attunity.com/data-lake-adoption-and-maturity-survey-findings-register-0