Journey of Building a Data Lake on AWS – from Ideation to Business Value

The number one question we get asked from our potential data lake customers is “how do I set up a production ready enterprise data lake and how long does it take to build one”.  From our experience of building data lakes on AWS for the past three years, it could take anywhere between 3 months to 1 year depending on the end goal. To understand the timelines for building data lakes, let us first go through the details of the journey of setting up a data lake from scratch.

Who Wants to Build a Data Lake?

Enterprise customers are always searching for insights from a variety of data sources. Most of the times they have to deal with challenges like (a) combining data from two or more data sources and (b) limitations of technology platforms. These challenges draw them into exploring a data lake based solution. Every customer has a unique set of requirements. Yet, there are two distinct categories of customers approaching Cloudwick for data lake implementation strategies.

(A) Use-case driven: These are companies where the business teams identify a use case for which they need a data lake. Instead of IT team, the Business team (could be manager, analyst or data scientist) directly approaches Cloudwick, to build a data lake to serve their particular use cases. These purpose-built data lakes do not require all the enterprise features.

(B) Platform driven: These customers are mostly enterprises that want to build a data lake as a platform for multiple use cases across their entire business. IT leads the implementation and takes a platform approach to data lakes with an aim to build all enterprise level features.

Building Blocks for Setting Up an Enterprise Data Lake

Most enterprises wanting to build a data lake would follow the progression shown in Figure 1.  The blocks are in the increasing order of complexity with a linear relationship with Time (months) on x-axis and Cost on y-axis.

Figure 1: Building blocks for setting up a data lake

Figure 1 shows feature blocks required to build a data lake from scratch to a production ready solution capable of performing advanced analytics and machine learning. We will get into details of these building blocks in the following sections as we analyze the journey.

Skills Required to Build a Data Lake

Also, listed on the right side of Figure 1 (and below) are different skills required to build and operate a data lake. The expertise required includes resources with a wide variety of skills ranging from project management to machine learning, who will have to work together as a team from day one till the end. This kind of talent and skill set is hard to find and keep.

  • Product Owner
  • AWS Solution Architect
  • AWS Technical Lead
  • AWS DevOps
  • AWS Data Engineer
  • AWS ML Engineer
  • Serverless UI Developer
  • Quality Assurance
  • Project Manager
  • Support Engineers

How Does Everyone Get Started?

Cloudwick’s journey of data lake implementation with most customers starts during the ideation phase. The customer would have a use case in mind and would have attempted to build a data lake to support it. Occasionally we also get requests from customers who tried setting up a data lake on their own and failed. They are now looking for a proof point before justifying a bigger investment. During this stage performance is not a consideration but proving the concept is the main goal.

In Figure 1, the first block talks about moving from ideation to a basic proof-of-concept (POC) which means establish connections to bring the data into the data lake, do simple transformation jobs or SQL queries and create basic reports. To build a POC, you need to first set up AWS account, roles, users, storage, etc. Then comes data gathering and from our experience, coordinating and getting data in a desirable format from even two data sources itself will take at least 2-3 weeks.  Once data is available, transformation and creation of basic reports will follow.

To do the above, we require a data architect to conceptualize the solution and a data engineer to do the actual work (using an AWS console with no automation). Ideation to basic POC will take at least a month.

How to Move Ahead Beyond Concept?

You need to convince the decision makers and influencers with the results of the POC to get a full-scale buy-in for a production go ahead. In our experience we see most customers who opt-in for a data lake POC have more or less decided on data lakes but need some proof point. A successful data lake POC in most cases ends up moving into production. Once the decision makers and stakeholders in an organization understand the business value proposition of combining data sources to get business insights, the next step is to build a data lake.

We advise customers to design their enterprise data lake project just like any other major enterprise cloud project, i.e. as an infrastructure-as-a-code implementation. You can do further maintenance and enhancements using a CI/CD (Continuous integration (CI) and continuous delivery (CD)) process.  This CI/CD process along with AWS serverless components like lambdas, API gateways and step functions will deliver the entire data lake infrastructure such as roles, users, storage and the data lake user interface into production.

Does Every Company Need Enhanced Data Lake Capabilities?

The answer might be a ‘No’. For smaller companies that have just one or two use-cases and just 3-4 reports to generate, a purpose-built data lake with limited capabilities would be enough. Also, they sometimes choose an AWS console for production deployment i.e. without a CI/CD process. A one month enhancement effort will convert a POC itself to a production version for these customers. These implementations correspond to the first block of Figure 1 a. k. a. “purpose built” data lakes and most of the times will not have advanced features like search, catalog, etc.

Large enterprises want to build a data platform beyond one use case. Their internal teams and business functions drive the need for a data lake to run analytics and Machine Learning (ML). These companies follow the progression shown in Figure 1. They do a POC and later move into production by following the CI/CD process. These companies leverage data lake platforms for ingestion, ETL, ML and visualization.

The second block in Figure 1 is the first step of a production platform build which is the actual process of building a base data lake. Third block and the  remaining blocks correspond to adding more features on top of this base data lake such as integration with existing or new AWS services (S3, Redshift, Cognito et al.). Customers can also integrate the data lake with enterprise data warehouse (EDW) solutions or can implement an enterprise level search feature. By following this process flow, companies can have a basic data lake in as little as three months. Activities from the third block onwards warrants contributions from the entire team of different resources (UI developer, QA engineer, PM etc..).

From the fourth block onwards, the data lake implementation focuses on the company’s governance and security roadmap. Based on Cloudwick’s experience, most of the enterprises do custom Glue and Athena integrations and UI development to address their specific needs. Enterprise customers who opt for complex features will have to invest at least nine months to build a production ready data lake platform with all the required features. These include security, governance, automated deployment process and integration with other AWS services.

In summary, block 1 is POC and block 2 is setting up the deployment process. Block 3 to 6 cover the breadth of features and integration with other AWS services.

How to Manage a Data Lake?

Data lake ongoing management and support will require at least 40% of the resources used during the build phase. In most enterprises, project teams will hand over the ownership of the data lake platform to IT post production deployment to manage, support and evolve the platform.

There are two kinds of activities to take care of while managing a data lake. First, to evolve the platform and keep the platform up to date with AWS services, it is important that any updates to AWS services released by AWS gets incorporated into the platform as soon as possible. Second data lake management activity would be the day-to-day support and maintenance of the data lake platform ensuring its availability, performance and end user support.

When Does a Company See Value from its Data Lake Investment?

Once the data lake platform is production ready with all automation and integration in place, business users can access the platform to get started with their use-case implementation. If you have a well architected and developed data lake platform, then implementing any use case including machine learning and analytics use cases will be easier. Cloudwick has seen many customers build new reports within two weeks and machine learning use-cases that took months being built within four weeks. A well-orchestrated data lake platform lifts the burden of access controls, automation and DevOps off the business teams and lets them focus on the real business problems.