Data Lake: Agile, Unstructured, but Governed and Controlled
Published on Dec 21, 2021

Companies in regulated industries — such as pharmaceutical, financial and logistics, can utilize Data Lakes to improve the agility of data processing and become data driven. But it is crucial for these organizations to stay compliant with industry regulations and at the same time to not slow down the development of their data capabilities.

​This, of course, can be done.

The agile approach assumes we are unable to predict future needs, so it is difficult to write rules for the future. Accordingly, the Data Lake contains data waiting for the right moment to become useful. That data is raw, and not modeled or integrated. It becomes structured as soon as a well-defined query arises.

This provides many benefits, but you need a special approach to leverage them. Especially when you are in a regulated industry, such as finance, pharma, or transport and logistics.

No structure, no limits

The Data Lake environment can accept any number of raw files without the need to structure them, without affecting the existing infrastructure. It does not set limits and does not force you to define the structure while writing. Instead, by collecting raw data, a Data Lake allows it to be modelled while being read. And due to the high scalability achieved in this way, Data Lakes open the way to analytics based on large amounts of data.

A Data Lake is not a tool, nor a specific solution or product that a company can buy, install, and use. As a reference architecture, it is an organization’s mode of approach to using data in its operations, and involves:

  • Data management: implementation of data governance rulesets throughout the whole lifecycle of data, from creation to retirement.

Can such an agile, unstructured repository filled with raw data be used in regulated industries? What about the risk of breaching compliance with strict formal requirements and restrictions that, like in the pharmaceutical industry, define processes of drug research and pharmaceutical product development, manufacturing with quality control, drug registration, distribution and even marketing?

The answer is yes, it can, if the data governance principles are well-designed and implemented. Only a well-governed Data Lake can minimize the risk and cost of non-compliance, legal complications, and security breaches.

Regulatory requirements entail:

  • Thorough analysis and repair of each encountered error
  • Formalizing of installation processes and other database maintenance processes
  • Validation of any IT tools used

Working with the leaders of the pharmaceutical industry, we have developed a method of implementing Data Lakes in a way that allows you to control data while maintaining its agility. Let me share the two most important elements of that method below.


1. Declarative configuration

This feature maintains the open nature of the Data Lake, while extracting its common functionalities. Further declarative configuration is simple and facilitates the work of analysts, data engineers and other advanced users who can draw reliable insights from huge amounts of data. There is no rigid architecture, but instead small services regulating the inflows and outflows of the lake.


2. Microservices: smooth control

Our experience is that control in a Data Lake can be achieved by establishing the underlying principles and using microservices to organize the whole system of dispersed rules. It’s microservices that make up the Data Lake engine and are the key to reconciling the two approaches (data openness and data regulation). They are configurable, the process of their implementation is simplified, and we can adjust their content to the formal requirements.

How does the microservice approach solve the strict regulation problem?

  • Isolation of central services enables adding new data items as configuration items that are not strictly regulated.
  • Microservices are loosely coupled, which means they evolve independently — they can be scaled individually. By precisely marking their boundaries, testing becomes simpler, faster, and safer.
Source: My own elaboration

A highly regulated operating environment requires companies to be cautious about innovation. But organizations are effectively transforming towards data-driven decision making. While in this process the technology is secondary, the most important thing is proper architecture and its execution. Success in this area means the ability to achieve new business goals and increase competitiveness, as well as a greater opening of the organization to further technological advancements, such as the cloud and machine learning. And all this while maintaining compliance. This requires the kind of agility provided by the compliant Data Lake approach.

Written by

Wojciech Winnicki

Go To Expert Spotlight Page

This Article Tags

All Tags

Our Clients Success Stories

How during the pandemic we optimized vaccine production and delivery process?

During the pandemic, our client, a global pharmaceutical leader, needed to optimise its COVID-19 vaccine production and delivery processes. Our client had to rapidly implement business applications in order to increase production efficiency.

How we boosted a COVID-19 vaccine production by digitally enhancing manufacturing operations

In response to an unprecedented global pandemic, our client – one of the pharma industry leaders- set out to develop, test, mass produce and organize the global distribution of a COVID-19 vaccine.

How to manage national resources to fight COVID-19 in the cloud?

Availability of ICUs is one of the key factors to keep the death toll as low as possible. ICU-M has been one of the reasons Germany has been managing the pandemic relatively well, keeping the number of deaths per 1M population at the low end compared with other EU states.

Bringing a global pharma company out of the dark via a Single Source of Truth (SSOT)

To get the full picture of the global market situation, companies use various sources of data. But to effectively control and steer business activity at all levels on global scale they need a centralized and trustworthy data source.

How to restore effective sales in animal pharma ensuring high quality and reliability of data

Our solutions cover all aspects of customer and sales transactions processing for life science industry companies carrying out their sales activity on many markets in cooperation with wholesalers, distributors and retailers.

Can proper data management help restart a shutdown pharma production plant?

Introducing rigorous quality assurance and management tools into data integration to enable the fastest possible resumption of a production plant’s functioning shut down due to issues with ERP system implementation.

How we helped Boehringer-Ingelheim stabilize their data ecosystem

Data governance, data management and data quality are the basis for the effective integration of IT systems, especially when the organization implements new solutions.

How we made the slow and inaccurate CEESA reporting in an animal pharma company much faster and much more accurate

Abundance of data can have enormous potential for business, but it can also be a source of problems. With large-scale operations – and this is where international animal pharma players operate – automation and advanced analytical methods become essential to building value from data insights.

Improving the workflow of Randstadt’s HR with an employee self-service solution

Low-code is the perfect tool for creating scalable solutions that ensure employee empowerment and self-service for simple activities and tasks. This can significantly relieve HR departments of the overload of administrative matters.

See how we used low-code to enable our pharma client to stay transparent and compliant with European legislation

End-to-end, flexible low-code based solution integrated with all the peripheral systems in company to meet rigorous transparency regulations for the pharma industry.

Companies in regulated industries, can improve their data processing agility using Data Lakes.
The Data Lake environment can accept any number of raw files without the need to structure them, without affecting the existing infrastructure. It does not set limits and does not force you to define the structure while writing. Instead, by collecting raw data, a Data Lake allows it to be modelled while being read.

Latest Articles

Let`s Talk About a Solution

Questions?We’re here to answer them.

Our engineers, top specialists, and consultants will help you discover solutions tailored to your business. From simple support to complex digital transformation operations – we help you do more.

    We will only use the collected data for the following purposes:

    The Controller of your personal data is C&F S.A. with its headquarters in Warsaw, ul. Domaniewska 50, 02-672 Warszawa, Poland. We ask for your consent to the processing of your personal data collected using the form above. We may also collect other data as specified in our Privacy Policy.