Real Time Data Lake

1 min read

The Challenge

Largest Data Lake in Top Pharma Corporation runs 30k batch jobs daily storing data in 2PTB AWS S3 bucket and Snowflake database.

Problems to solve

  • Users need to have data faster, preferably in near real time
  • There are tens of source systems – some of them can publish data as streams some can not
  • SQL transformations must be applied on data before it is published to the users

The solution

  1. Event streaming platform (Kafka) was deployed on AWS Cloud. Data is streamed to it:
    • Directly by source systems
    • With custom connectors running on Kafka Connect cluster
    • With commercial/external connectors
  2. Spark application was deployed on Kubernetes cluster. It can blend data from different topics and from AWS S3 using provided SQL transformation
  3. Starburst Trino SQL Access was deployed in Kubernetes cluster allowing to access data stored on S3 with SQL tools
  4. Snowflake connector allowing to streams directly from Kafka to Snowflake
  5. All the actions are orchestrated with simple declarative configuration provided in git

Results

  • Better quality of data thanks to ability to apply data quality checks in real time – that allows business to find issues at early manufacturing stage
  • Data is provided to Snowflake and SQL Access tool users in near real time
  • Simple transformations can be applied in real time
  • Solution is scalable new sources/streams can be added easily as the git configuration
Real Time Data Lake Type: application/pdf Size: 308 KB