Data Lakes are a way to store huge amounts of data in their native form. Data stored this way may have varying degrees of structuring or may not even be structured at all. In PwC’s Transforming Risk into Confidence podcast series, PwC’s Joe Sousa compared a data lake to … a lake.
Analogy not creative enough? The truth is, no better one can be found. In his podcast, Joe describes a classic data warehouse as a store of bottled water that can be easily used and consumed. A data lake is like a huge body of water containing water in its natural state. The lake is filled from various sources, and users can examine it, take samples or dive into its depths.
Growth of data
Before I go on to present what in my opinion are the most important issues related to the implementation of data lake technology, a few words about the growing importance of data in the world of business:
- Data is a big deal. In 2020, humanity will create and consume nearly twice as much data as in 2018. And IDC predicted in their May Global DataSphere that “the next three years of data creation and consumption will eclipse that of the previous 30 years” (yes, video streaming is responsible for a significant part of this amount).
- Companies began to notice the great potential of being data driven, and are using data to add value: ResearchAndMarkets.com expects The Global Enterprise Data Management Market to reach $133.4 billion by 2026, rising at a market growth of 10.2% CAGR during the forecast period, and that’s just one of many similar predictions.
As mentioned before, a data lake is a collection of variously structured data, coming from different sources and available for different users. It seems that such a solution may mean chaos. Meanwhile, this technology is used by companies from various sectors, including — as my own experience confirms — organizations operating in a highly regulated environment, such as the pharmaceutical industry. For a mature data-driven organization, the data lake approach allows for democratization of access to data, ensures transparency and allows the entire organization to use one authoritative data source.
As much as possible, as quickly as possible and without errors
Businesses in pharmaceutical sector build their competitive advantage by constantly updating their offer with new products. With faster and faster cycles of introducing new products, it’s the technology that brings new opportunities not only in terms of production, but also collecting and analyzing market feedback on the basis of data, advanced analytics and disseminating data literacy among employees.
Pharma companies are great at handling data themselves. But they often rely on specialized IT companies with data management competencies that are hard to find elsewhere.
This is especially applicable now, as industry leaders have taken on the task of researching and manufacturing hundreds of millions of COVID-19 vaccines in much shorter time than ever. This would be hard to achieve without IT support — the ongoing provision of data necessary to analyze the situation. Data Lakes seem to be the natural sources of this data.
The key question is — how to meet these expectations and be able to agilely and quickly provide new data in Data Lakes in such a highly regulated industry as pharma?
Below are some suggestions.
1. Tailor-made solution delivery framework
On the one hand, far-reaching regulations. On the other — the need for quick and agile release and deployment of new IT solutions. This is typical for the pharma industry. What these companies do is to set up a framework for IT solutions delivery to ensure that quality and regulatory requirements are met. Data Lake ingestion is usually based on some repetitive procedures (where data sources and transformation logic may differ depending on the ETL). This helps to adapt the solution delivery framework to these specific circumstances and create predefined blueprints from various data ingestion flow types and scenarios. This allows the project delivery teams to follow a clearly defined path and start delivering required compliance and quality deliverables right from the beginning of the project lifecycle.
IT solution owners cooperate closely with Compliance to assess the risk, which is a key indicator of which delivery path should be chosen for a particular implementation.
2. Microservices Data Lake for easier deployments
Lean procedures are important, but their execution is limited if technology lags behind. Data Lakes usually grow together with businesses. It is important to bring the value as early as possible, therefore also the Data Lake technology needs to support growing data volumes and processing requirements.
Additionally, with a lot of time pressure, you cannot afford much, if any, production downtime, and one source of downtime is the need to make changes to data repositories. With a classic data warehouse, all new functionalities must be introduced to the whole repository.
This approach requires a lot of time and work, and most importantly, re-testing and validation of the whole solution to make sure it was not impacted by the changes introduced.
One idea to address such disadvantages is to apply Microservices as an architectural pattern for building Data Lake ingestion platforms. Decomposing key functionalities of the data ingestion flow into separate and independent applications may significantly reduce time spent on delivering the whole Data Lake solution as testing and validation is focused mainly on the modified microservice (others can just follow a standard regression testing path).
For example: adding a new source data connector to a component responsible for data spooling doesn`t require re-testing other microservices which are dedicated to data transformation or publishing.
That’s why microservices are so important in data lakes. The decomposition of the database functionality into independent microservices allows you to significantly reduce the effort and resources, because changes are made to relevant parts of the repository only.
In a well-designed data lake, a change to a table in a data source feeding the lake is a quick, small-scale event if the appropriate data-centric approach is taken.
Although “DataOps” may sound like another buzzword, it actually provides a comprehensive approach for quicker delivery of data-driven innovation and analytics. With regard to Data Lakes, it may apply to such practices as:
- Data ingestion automation — for example, by leveraging the metadata-driven ETL approach, where business users just provide business mappings and parameters as the inputs, and ETL tasks are automatically generated on this basis.
- DevOps and Cloud technology usage — this applies to various aspects, but some key benefits are:
a) use of Cloud and Containerization allows you to have multiple environments where Data Engineers can independently create their own transformations and Data Products;
b) automated deployment pipelines — thanks to CI/CD tools, the ETL can be deployed with minimum human interaction
- Data Quality checks implementation as a part of the ETL — e.g., alerts you when a business check fails for data loaded to the Data Lake
- Self-service data ingestion — this may work especially well for repetitive and simple data ingestion flows built with microservices. If a user wants to see a new table in the Data Lake, they can just “self-order” it through a customer portal
4. Data governance
A Data Lake is like one big data reservoir. However, when you look closely, it turns out that there are certain rules of the game here, regarding the access to data and data allocation. These rules are the foundation of data lake optimization. They include both precise arrangements for data classification (for example, isolation of sensitive data) and management of access rights based on the tools used or central management (data owner decisions), as in the Apache Ranger model.
In pharma, this is a fundamental necessity, because there are formal requirements regarding the procedures and mechanisms of data access.
5. Data about data
If we are to stick to the lake analogy, and it might be worth it, imagine that you have knowledge of every drop of water that falls into that great body of water. In data lakes, this is not only possible, but in fact essential.
Tools such as Colibra or Alation inform users about the location of specific data, acting in essence like user guides on the lake.
The data about data has to be collected in the data lake system. All solutions that we have designed always log all data ingestions automatically.
A Data Lake can mean transparent access to data on all levels of the organization in a self-service model. It creates a path towards one authoritative and reliable source of data for the entire organization. It’s also fast and agile — and these are the features necessary to answer the demanding reality of the pandemic.