Eight data challenges addressed by cloud Data Lake

A data lake infrastructure designed for handling customer data should not be a one size fits all traditional solution.  Customers do not think of your business unit as a collection of silos. Instead they consider it as a single unit. They have traded their data with the expectation of a thoughtful experience in return. In order to deliver on that expectation fully, any data infrastructure should be designed with requirements catering to the individual customer’s data. For an individual, their daily interactions with digital services and devices generate a certain amount of data every day, which increases exponentially over time, resulting in large and complex datasets. This leads to the global “Big Data” problems of variety, velocity, volume etc. 

Most current data warehouse do not meet the challenges imposed by the emergence of Big Data:

  1. In order to ensure data quality, the measurement applied on certain data makes it inflexible to handle other data.
  2. There is a limit on certain kind of analysis on certain kind of data owing to the static structure of data inside the data warehouse while it is difficult to ascertain upfront the kind of analysis that is possible given the variety, volume and veracity of the data.
  3. Data cleaning and data aggregation task of a data warehouse may lead to loss of fragments in the aggregation process, which might be valuable in the future.

A data lake solution is becoming increasingly popular these days. A customer’s interaction with an organization generates fragmented datasets of the same user, with data fragments stored in different locations using different formats. This fragmentation poses significant challenge in collecting the data fragments from different service providers and also poses a challenge to store the data for analytics purposes. The task of building a data lake infrastructure should be met with a thorough understanding of these challenges. Following graph highlights these challenges:

Variety:

A customer interaction with an organization generates data in different formats, unlike just the old rows and columns and database joins. It’s very different from application to application and much of it is unstructured.  Apart from the standard structured data formats, data could be in the form of photos, videos, audio recordings and email messages and documents. An effective data lake solution should be able to store and provide a way to process all kinds of data.

Volume:

Just like a data warehouse, a data lake should be able to handle the volume of big data as well. Consumers and enterprises are creating 2.5 quintillion bytes of data globally everyday. A data lake should be able to handle this volume of data in a cost-effective and scalable manner.

Decoupled Storage and Compute:

In traditional Hadoop and data warehouse solution, storage and compute are tightly coupled. As a result compute and storage must be scaled together and clusters must be persistently running otherwise the data becomes inaccessible.  This makes it difficult to optimize costs and data processing workflows. While mass data storage is becoming increasingly cheap, compute is expensive and ephemeral. By separating compute and storage, data teams can easily and economically scale storage capacity to match rapidly growing datasets while only scaling distributed computation as required by their big data processing needs. Hence it is important to decouple storage and compute so that they can be scaled separately depending on the data teams needs.

Data Governance:

Although a data lake can solve the above challenges but the process of doing so can easily convert a data lake into a data swamp. If a data lake is polluted with data from everywhere, it can easily become incapable of moving meaningful data out to anywhere. A good data governance system is required which helps maintains quality datasets in a data lake that are clean, relevant, and can be easily found, accessed, shared, managed and protected.

Metadata Management:

An effective metadata management is a central part of a data lake. A data lake should be able to handle any format of data from structured and semi-structured to unstructured data. But as alluded in the above point, this risks turning a data lake into a data swamp. An effective metadata management should also store metadata about the data along with the data. This can help in converting the raw data into cleaned transformed data alleviating the problem of incomprehensibility of data in a data lake to some extent. Any good data lake solution should have a good metadata management solution in place.

Unified Data: 

In order for a successful real-time engagement with the customer, there is a need for a unified storage facility for storing the data fragments, which are originally stored in different locations using different formats. Apart from handling the data ingestion of streaming data, the application should be extensible to enrich the real time data with the CRM data and other data sources like CRM, Marketing Databases, Call center, Social media, IOT, VOC etc. A unified data store can help in the process of analyzing and querying all customer related records. For instance, say you have ERP data in one DataMart and Weblogs on a different file server; when you need a query that joins them together some complex federation scheme or another software is required. A data lake can help out by putting all the data into one big repository making it readily accessible for any query you might conceive. Thus, a data lake should not only provide the ability to ingest raw data but also provide a way to easily unify the customer data from disparate data sources.

Data Protection & Security:

Since users can ingest all kinds of datasets into the platform, you cannot afford to overlook the security issues that arise when warehousing all this data. In terms of protection and data security, all the datasets in a data lake should be protected have network isolation to prevent undesired access to environment. Moreover, since data lakes store content retrieved from other sources, you may need to think of ways to protect the data as it may be done already in the original content sources. Data encryption is one known way to protect content. For strong data protection, data encryption should be applied at the storage level (data at rest) and while data travels over the network (data on transfer). Data lakes may be intended to break the barriers that silos create by allowing users of the lake access to the centralised content in it. Still, some applications or rather, some data require document-level access restrictions in place: users must only see documents to which they are granted read permissions. Hence, some amount of user level access controls is required in order to ensure it.

Advanced Analytics and ML:

A data lake should support a guarded integration with AI and ML services. These services can be used directly on the structured data or can also be used to generate metadata for unstructured dataset. The integration of AI and ML services can help data consumers like data scientists to go beyond the boundaries of the warehouse to gleam new analysis from the data. AI services can also be utilised to generate metadata out of unstructured datasets. This can also help in alleviating the problem of data incomprehensibility in a data lake.

In order to handle these challenges effectively, the data lake infrastructure should be such that you begin with a custom built schema for structured or non-schema for unstructured and then provide functionality to the user of the platform (IT or data science or analytics) to query and transform only relevant data resulting in a subset of data or a transformed data with a defined schema if required. The data lake infrastructure should keep all the data even the data that will never be used as opposed to a data warehouse in which only processed data is stored for reporting purposes.  Data lakes usually have a flat architecture that stores data regardless of its format and places the responsibility of understanding and generating the data on the data provider and the data consumer who are directly involved in data generation and consumption.

The potential benefits of a properly constructed Data Lake far outweigh the build-out effort so setting proper expectations is paramount. Measurable results may not surface immediately and are usually long term. But in the end there are major benefits of using a data lake. A data lake can help in the process of generating a unified data, which can be pushed to integrate with external services for providing real time insights to customers. It can be used to provide metadata of raw data that can be used for custom analytics. You can also pull the data for personalisations and can be used to provide triggered interaction events to consumer. The transformed data can be further exported for visualisations. All this can lead to improvement in customer experience, maximising revenue and reducing the cost to serve.

In our other blogs posts titled “How a cloud data lake solution can handle decoupling of storage and compute?” and “Effective Governance in Amorphic Data”we look into the implementation aspects of these challenges addressed in details in the orchestration service – Amorphic Data.