Data lake in the pharmaceutical industry: 5 things to keep in mind to get it right
Published on Jan 14, 2021

Data Lakes are a way to store huge amounts of data in their native form. Data stored this way may have varying degrees of structuring or may not even be structured at all. In PwC’s Transforming Risk into Confidence podcast series, PwC’s Joe Sousa compared a data lake to … a lake.

Analogy not creative enough? The truth is, no better one can be found. In his podcast, Joe describes a classic data warehouse as a store of bottled water that can be easily used and consumed. A data lake is like a huge body of water containing water in its natural state. The lake is filled from various sources, and users can examine it, take samples or dive into its depths.

Growth of data

Before I go on to present what in my opinion are the most important issues related to the implementation of data lake technology, a few words about the growing importance of data in the world of business:

  • Data is a big deal. In 2020, humanity will create and consume nearly twice as much data as in 2018. And IDC predicted in their May Global DataSphere that “the next three years of data creation and consumption will eclipse that of the previous 30 years” (yes, video streaming is responsible for a significant part of this amount).
  • Companies began to notice the great potential of being data driven, and are using data to add value: expects The Global Enterprise Data Management Market to reach $133.4 billion by 2026, rising at a market growth of 10.2% CAGR during the forecast period, and that’s just one of many similar predictions.

The lake

As mentioned before, a data lake is a collection of variously structured data, coming from different sources and available for different users. It seems that such a solution may mean chaos. Meanwhile, this technology is used by companies from various sectors, including — as my own experience confirms — organizations operating in a highly regulated environment, such as the pharmaceutical industry. For a mature data-driven organization, the data lake approach allows for democratization of access to data, ensures transparency and allows the entire organization to use one authoritative data source.

As much as possible, as quickly as possible and without errors

Businesses in pharmaceutical sector build their competitive advantage by constantly updating their offer with new products. With faster and faster cycles of introducing new products, it’s the technology that brings new opportunities not only in terms of production, but also collecting and analyzing market feedback on the basis of data, advanced analytics and disseminating data literacy among employees.
Pharma companies are great at handling data themselves. But they often rely on specialized IT companies with data management competencies that are hard to find elsewhere.
This is especially applicable now, as industry leaders have taken on the task of researching and manufacturing hundreds of millions of COVID-19 vaccines in much shorter time than ever. This would be hard to achieve without IT support — the ongoing provision of data necessary to analyze the situation. Data Lakes seem to be the natural sources of this data.
The key question is — how to meet these expectations and be able to agilely and quickly provide new data in Data Lakes in such a highly regulated industry as pharma?
Below are some suggestions.

1. Tailor-made solution delivery framework

On the one hand, far-reaching regulations. On the other — the need for quick and agile release and deployment of new IT solutions. This is typical for the pharma industry. What these companies do is to set up a framework for IT solutions delivery to ensure that quality and regulatory requirements are met. Data Lake ingestion is usually based on some repetitive procedures (where data sources and transformation logic may differ depending on the ETL). This helps to adapt the solution delivery framework to these specific circumstances and create predefined blueprints from various data ingestion flow types and scenarios. This allows the project delivery teams to follow a clearly defined path and start delivering required compliance and quality deliverables right from the beginning of the project lifecycle.
IT solution owners cooperate closely with Compliance to assess the risk, which is a key indicator of which delivery path should be chosen for a particular implementation.

2. Microservices Data Lake for easier deployments

Lean procedures are important, but their execution is limited if technology lags behind. Data Lakes usually grow together with businesses. It is important to bring the value as early as possible, therefore also the Data Lake technology needs to support growing data volumes and processing requirements.
Additionally, with a lot of time pressure, you cannot afford much, if any, production downtime, and one source of downtime is the need to make changes to data repositories. With a classic data warehouse, all new functionalities must be introduced to the whole repository.
This approach requires a lot of time and work, and most importantly, re-testing and validation of the whole solution to make sure it was not impacted by the changes introduced.
One idea to address such disadvantages is to apply Microservices as an architectural pattern for building Data Lake ingestion platforms. Decomposing key functionalities of the data ingestion flow into separate and independent applications may significantly reduce time spent on delivering the whole Data Lake solution as testing and validation is focused mainly on the modified microservice (others can just follow a standard regression testing path).
For example: adding a new source data connector to a component responsible for data spooling doesn`t require re-testing other microservices which are dedicated to data transformation or publishing.
That’s why microservices are so important in data lakes. The decomposition of the database functionality into independent microservices allows you to significantly reduce the effort and resources, because changes are made to relevant parts of the repository only.

3. DataOps

In a well-designed data lake, a change to a table in a data source feeding the lake is a quick, small-scale event if the appropriate data-centric approach is taken.
Although “DataOps” may sound like another buzzword, it actually provides a comprehensive approach for quicker delivery of data-driven innovation and analytics. With regard to Data Lakes, it may apply to such practices as:

  • Data ingestion automation — for example, by leveraging the metadata-driven ETL approach, where business users just provide business mappings and parameters as the inputs, and ETL tasks are automatically generated on this basis.
  • DevOps and Cloud technology usage — this applies to various aspects, but some key benefits are:
    a) use of Cloud and Containerization allows you to have multiple environments where Data Engineers can independently create their own transformations and Data Products;
    b) automated deployment pipelines — thanks to CI/CD tools, the ETL can be deployed with minimum human interaction
  • Data Quality checks implementation as a part of the ETL — e.g., alerts you when a business check fails for data loaded to the Data Lake
  • Self-service data ingestion — this may work especially well for repetitive and simple data ingestion flows built with microservices. If a user wants to see a new table in the Data Lake, they can just “self-order” it through a customer portal

4. Data governance

A Data Lake is like one big data reservoir. However, when you look closely, it turns out that there are certain rules of the game here, regarding the access to data and data allocation. These rules are the foundation of data lake optimization. They include both precise arrangements for data classification (for example, isolation of sensitive data) and management of access rights based on the tools used or central management (data owner decisions), as in the Apache Ranger model.
In pharma, this is a fundamental necessity, because there are formal requirements regarding the procedures and mechanisms of data access.

5. Data about data

If we are to stick to the lake analogy, and it might be worth it, imagine that you have knowledge of every drop of water that falls into that great body of water. In data lakes, this is not only possible, but in fact essential.

Tools such as Colibra or Alation inform users about the location of specific data, acting in essence like user guides on the lake.
The data about data has to be collected in the data lake system. All solutions that we have designed always log all data ingestions automatically.

A Data Lake can mean transparent access to data on all levels of the organization in a self-service model. It creates a path towards one authoritative and reliable source of data for the entire organization. It’s also fast and agile — and these are the features necessary to answer the demanding reality of the pandemic.

Written by

Tobiasz Kroll

Go To Expert Spotlight Page

This Article Tags

All Tags

Our Customers Success Stories

How we boosted a COVID-19 vaccine production by digitally enhancing manufacturing operations

In response to an unprecedented global pandemic, our client – one of the pharma industry leaders- set out to develop, test, mass produce and organize the global distribution of a COVID-19 vaccine.

How to manage national resources to fight COVID-19 in the cloud?

Availability of ICUs is one of the key factors to keep the death toll as low as possible. ICU-M has been one of the reasons Germany has been managing the pandemic relatively well, keeping the number of deaths per 1M population at the low end compared with other EU states.

Bringing a global pharma company out of the dark via a Single Source of Truth (SSOT)

To get the full picture of the global market situation, companies use various sources of data. But to effectively control and steer business activity at all levels on global scale they need a centralized and trustworthy data source.

How to restore effective sales in animal pharma ensuring high quality and reliability of data

Our solutions cover all aspects of customer and sales transactions processing for life science industry companies carrying out their sales activity on many markets in cooperation with wholesalers, distributors and retailers.

Can proper data management help restart a shutdown pharma production plant?

Introducing rigorous quality assurance and management tools into data integration to enable the fastest possible resumption of a production plant’s functioning shut down due to issues with ERP system implementation.

How we helped Boehringer-Ingelheim stabilize their data ecosystem

Data governance, data management and data quality are the basis for the effective integration of IT systems, especially when the organization implements new solutions.

How we made the slow and inaccurate CEESA reporting in an animal pharma company much faster and much more accurate

Abundance of data can have enormous potential for business, but it can also be a source of problems. With large-scale operations – and this is where international animal pharma players operate – automation and advanced analytical methods become essential to building value from data insights.

Improving the workflow of Randstadt’s HR with an employee self-service solution

Low-code is the perfect tool for creating scalable solutions that ensure employee empowerment and self-service for simple activities and tasks. This can significantly relieve HR departments of the overload of administrative matters.

See how we used low-code to enable our pharma client to stay transparent and compliant with European legislation

End-to-end, flexible low-code based solution integrated with all the peripheral systems in company to meet rigorous transparency regulations for the pharma industry.

Replacing Excel sheets with AdaptiveGRC modules to improve internal audit

Emails and Excel sheets are passé. Digital audit needs improved flow of information, managing audit recommendations and enforcing their implementation clearly synchronized on a multi-level structure.

Data lake is a collection of variously structured data, coming from different sources and available for different users.
It creates a path towards one authoritative and reliable source of data for the entire organization. It’s also fast and agile — and these are the features necessary to answer the demanding reality of the pandemic.

Latest Articles

Let`s Talk About a Solution

Questions?We’re here to answer them.

Our engineers, top specialists, and consultants will help you discover solutions tailored to your business. From simple support to complex digital transformation operations – we help you do more.

    We will only use the collected data for the following purposes:

    The Controller of your personal data is C&F S.A. with its headquarters in Warsaw, ul. Domaniewska 50, 02-672 Warszawa, Poland. We ask for your consent to the processing of your personal data collected using the form above. We may also collect other data as specified in our Privacy Policy.