Why are Data Lakes Gaining Popularity as Data Processing Platforms?

Analytics, especially Machine Learning, is undoubtedly the lynchpin of every modern-day data-driven organization. Machine Learning (ML) is the ability to learn from data without being explicitly programmed. As a programming technique, ML has been around for many years now. But why is it gaining huge popularity in recent years? Well, the main reason is the improved economics of data processing for ML workloads in the cloud.  Further, data-lakes make it easier to extract value by simplifying the ML data processing. In this post, let’s look at the history of data-lakes and how the concept of data lakes has evolved over the past 3 decades.

Let’s Go Back Three Decades

ML techniques require a large amount of data to produce meaningful insights. Academics studied ML Algorithms as early as 1950s, but they rarely applied on real-world problems as data at scale was not available. Data was being generated, but there was no way to store that huge amount of data. Applications kept on generating data, but databases could not scale beyond a single server (vertical scaling issue). Businesses used to take backups of historical/less critical data to tapes after reaching machine storage limits and archive it. ML algorithms did not get enough historical data to learn from and hence were not very effective. ML adoption was very less.

On-premise Data-lakes Version 1.0 with Hadoop

As time passed, storage became cheaper and cloud-based technologies gained traction. It became viable to store large volumes of data on-premise or in the cloud. Storage prices became cheap enough to store years’ worth of business data, even though most enterprises did not have a plan about how to use it. But data could live in the storage for many years untouched.

Falling storage prices did not provide any relief to traditional data stores. They could only scale vertically in a single system, but this was super expensive. Horizontal scalability for storing and processing large datasets was still a challenge. An answer to this was Hadoop, which gained popularity around 2007. Hadoop’s origin was from open source projects inspired by the work done at Yahoo and Google. Hadoop could store and process large datasets using distributed clusters across commodity hardware. Hadoop allowed two main things. First, it allowed the compute and storage to scale horizontally beyond one machine. We could design a Hadoop cluster of hundreds of machines to appear like a single machine. Second, the software would keep a track of all data without worrying about the schema of data.

Traditional data stores work on the principle of schema on write i.e. data follows a table structure, but we had to design tables in advance. Technologies like Hadoop made it possible to dump data without schema. What this means is one can store data without worrying about the structure of data. This allowed storing all types data, structured or unstructured, in its raw form.

Hadoop solved the problem of horizontal scaling beyond a single machine. Customers kept adding general-purpose machines to scale Hadoop clusters. This added both compute and storage. While it solved the horizontal scaling problem, it never improved the application performance. Traditional data stores were high performing because of their proprietary data indexing techniques. Businesses saw no value in using Hadoop as a compute engine. Most large companies started using Hadoop as just cheap storage. They offloaded less-critical data from data stores to Hadoop clusters. This became the Ver 1.0 of data lakes.

Challenges with on-premise Hadoop Data-lake — the rise of Data-lakes Version 2.0

Hadoop relies on two components, HDFS and MapReduce. HDFS performs file management and MapReduce compute management. Hadoop in its early days required intense programming. Writing complex MapReduce jobs in java to execute a simple query was cumbersome. To address this, it introduced a service sitting on top of MapReduce known as Hive to convert SQL queries into actual Java code executed on the Hadoop cluster. This made things a little easier to make Hadoop more usable. Over time Hadoop ecosystem evolved with over ten different tools powered by Apache projects, stitched together to create a data-lake.

Managing the ecosystem became another nightmare. Later, frameworks like Cloudera, MapR and Hortonworks appeared in the market. These companies provided their software to manage Hadoop ecosystem components. This somewhat triggered the adoption of Hadoop in the Fortune 1000 companies. They now started migrating their less critical data and long running ETL workloads to Hadoop based data-lakes. These data-lakes usually used one of the above Hadoop Distributions and now emerged as Ver 2.0.

Yet, customers still used traditional data warehouses to run their critical analytical workloads because compute was comparatively slower in Hadoop. But the usage of Hadoop increased for data storage purpose, although compute capability was not being used much. Customers added more general-purpose machines to add more storage resulting in bigger clusters with unused compute resources. This soon led to storage becoming a scalability challenge. Later, engines like Spark came, but none of those could match traditional data warehouses in performance. Spark required more memory, which became another procurement challenge for enterprises. By 2015-16 timeframe customers started realizing that on-premise Hadoop (Ver 1.0 and 2.0) cannot solve all their problems due to its inability to scale and perform.

The Emergence of Cloud Data Lakes – Version 3.0

Public cloud offerings were gaining prominence during 2015-16 timeframe. Cloud based service-providers like Amazon Web Services came with their own unique benefits, but solved the problem of Data-lake Ver 2.0 by provisioning two things, segregation of compute and storage, and the ability to scale them independently. This means customers only pay for what they use, be it storage or compute. Hadoop customers who had to buy three times the storage to account for HDFS replications saw the value.  This triggered the migration of a lot of analytics workloads from on-premise to the cloud. The data lake Ver 3.0 emerged. 

Also, with cloud data lakes, came the extra benefit of not having to invest in cap-ex for new initiatives. A customer can reserve data-lake resources in the cloud for the duration of the project. This matched the exploratory nature of ML initiatives with a “fail fast” approach.

Today, cloud-based data lakes offer scalability of compute and storage on demand. Data lakes can now seamlessly handle structured or unstructured data i.e. business, operational, customer interaction or machine data and their ability to power enterprise level BI, advanced analytics, AI & ML applications. In today’s digital era data is the new oil. Machine learning applications powered by cloud based data lakes are all set to launch a new data revolution.

Let’s wait and watch.