Data Platforms: The Unsung Hero of AI Transformation

Watch the episode
[Maciej Kłodaś – MK]
Hi, my name is Maciej. I’m the leader of Analytics Experience Competency Group at C&F and this is C&F Talks Season 2, a podcast where we discuss our challenges, ideas and trends from the perspective of an IT partner. So if you like our content, please be sure to leave a comment, like and subscribe, because we really love to hear your feedback about the content we create.
And I would like to welcome my guest today, Tobiasz Kroll, Senior Director at C&F. Hello, Tobiasz.
[Tobiasz Kroll: TK]
Hello.
[MK]
So Tobiasz, I know that you are at C&F for ages now. So tell me, what is your expertise? What do you do?
[TK]
So mainly my team is responsible for delivering the data platforms and data products. And that’s my role. I’m the lead of the delivery team in this space for human pharma clients.
[MK]
And data platforms will be the topic of today’s talk. So tell me, what is a modern data platform?
[TK]
It’s probably kind of like a buzzword, which is very popular now. Everyone uses this term, especially in this AI space. From my perspective, we should always start with the business perspective for business purpose, right?
That’s what we as an IT teams really live. That’s our purpose. So from my perspective, the data platform is an ecosystem that helps to drive the businesses or the enterprises to be more data-driven companies, right?
Data-driven meaning you can make a decision fast in an agile manner based on the data that you have in your company. So in a nutshell, that’s my perspective. Of course, then it translates into more technology-related components, which I think we’ll be talking about later.
[MK]
Okay. But everybody or every company nowadays is data-driven, right? So data platform is not a new thing on the market. It’s an industry standard. Everybody has data platforms. So what do you do? You have implemented all the data platforms in the world for our clients. Now, what is your role now?
[TK]
I think if you look into the details or if you try to dig deeper into the structure, so the current situation that many companies have, you may notice that, well, people will tell you that they have data lakes, they have data platforms, they build it for years, right? But when you see the details, you can notice that it’s not necessarily like that. For example, there are companies that, of course, they implement modern data platforms, meaning following the modern patterns, the architectural patterns of implementing not only data ingestion pipelines, but also the data catalogs, data quality tools, data governance tools, right?
They are the top ones. They are bringing the tools to make this competitive advantage, right? That’s what gives you the power to be faster than your competitors.
But on the other hand, you can also notice that, even within the same companies, there are many business units that still stay with this legacy style applications or systems integrating data in a very traditional way, just because of the fact that maybe they are not as, let’s say, critical or crucial for building this competitive advantage. So for example, my vision or my perspective is that if you look into the pharmaceuticals, into the human pharma space, probably you will notice that such business units like research and development, manufacturing, supply chain, these are the ones that have the modern solutions, right? Because that’s the power of the company.
Their impact is the highest. But at the same time, you can notice that there are some business units that still operate in this traditional way, like HR, for example, right?
And it doesn’t mean that the HR data is not important. It’s just a matter that it’s not as important as the one from the units I mentioned about. So my point is that I would say everyone has the data platforms, but not everyone really has all of them transitioned into this modern state, I would say.
[MK]
Okay. But if it works, why do you need to change that?
[TK]
Well, I would say there are a few factors that should drive it, right?
So ultimately it’s better to have majority of your company technology stack following the same pattern. It’s in many cases related to the cost, it’s related to the operational cost unification, you know, having the same similar skillset among your technology teams, and also having the same experience from the user perspective, you have multiple standards, multiple technologies, it slows you down, I think. And ultimately, I believe that the value will come out of the really unified view of the data you have in the company.
So each data set really matters at the end. And again, if you have it all unified, I wouldn’t say compiled, but brought to the same space and, you know, have the same user experience, meaning ability to use the data, according to some common pattern, it’s easier for everyone to build the value.
[MK]
Okay, there are multiple buzzwords and trends on the market like data mesh, like data as a product, right? But for a few episodes now, we’ve been talking about AI and the impact of AI on the business. So my question is, how AI is impacting the data platform space?
[TK] I would say that, from my perspective, well, AI is still, it’s a strategy, right? I mean, implementing AI in the company is a strategy. And everyone will tell you, well, the data, that’s the key enabler, the foundation of the real AI strategy implementation, because the quality of data you get into the AI, let’s call it this way, will give you the right outcome, the right output, right?
So, well, which is, to be honest, something similar to what, you know, how the businesses operated over the years. Everyone will tell you that they need the right quality of data, the data needs to be on time, I need to be able to use the data, I want to use the self-service capabilities to be able to do something with this data by myself. So it’s not really different, but AI enforces it even more because it’s more sensitive in terms of the quality of the data, right?
[MK]
So before AI, the data was good enough, the quality of the data was good enough for you, for people to analyze that, but AI imposes the quality to be perfect.
[TK]
Yes. And if you don’t have this human factor, you need to ensure that really what you, what you bring in is something that, you know, will at the end give you the right result.
[MK]
Okay, so now the companies that have been building those data platforms for years now are in place where they need to optimize it, they need to change it, they need to modernize, modernize, I don’t know if it’s the word, okay, whatever, improve the technology stack of the data platform in order to comply to the needs of AI solutions, is that correct? So we might have two scenarios, either way, you are changing the legacy platforms, legacy elements of the data platform or building new data platforms.
[TK]
My observation is that in many cases, we either follow the roadmap of increasing the maturity of the existing data platforms. For example, the data lake concept is nothing new, right? It’s something that sits with us for, you know, 10 years, probably, right?
So if you look into the past, at the beginning, everyone wanted to just bring as much of the data, as many data sets as possible, just to the data lake. It was, of course, it was business driven, because someone wanted to build some use case on the top of that. But at the end, it was just like, let’s bring, you know, the data to the data lake and cloud hosting, all of the technologies that were raising over the last years, they helped to do that, right?
So you were able to massively push all of the data that you have inside your organization to the data lake, right? So that’s the concept. And now you end up with the situation that you have, like a huge storage with the data and, you know, still business may complain, well, I don’t know what to do with that, right?
So then you think about increasing the value by adding some self-consumption capabilities on the top of that, right? And again, that’s a broader concept. One of them, I think, which I personally like is, it’s just called data mesh.
Again, it’s not, it’s not very new. I think it’s a few years old already, but it’s, it basically tells you that the paradigm of building the data platforms should be based on a few pillars. One of them is to ensure that the ownership of the data is decentralized.
And the second one is that you have the, that you have the data products approach, meaning you treat data as a product. And the third one is that you give the self-service capabilities. And the last one, I believe it’s called federated governance.
So again, it’s to ensure that the data is accessible in the right way. I use, I like this comparison that this data mesh concept is something like, you know, like a huge supermarket where you have, just imagine that you have so many products inside and they are unorganized, like how would you be able to use it as a customer, as a client, right? So, and literally that’s what it is, right?
It’s just a matter of, of bringing in the structure. So you have sections in the supermarket, here’s the food, here are the tools, clothing, and so on. And you have some teams, sub-teams responsible for each of the sections.
You have the products inside, which are tagged, you have the, you know, not just the price tag, you have the information. What are the, I don’t know, best before date for the product, the manufacturing date and so on. So you have some attributes that help you to understand what’s inside.
You have the capabilities to search through the catalogue of products, you know, websites, applications, you can just even walk in and of course, see what you need on the shelf. And finally, for example, you have, I know like a section with, you know, alcohol that you cannot, you know, sell to the non-adults, so that’s the policy. Role-based access policy is applied.
So to me, it’s as simple as that, just to bring in this kind of approach for the data.
[MK]
But this is from the business user’s perspective, right? The usability of the data, they know what to search, where to search for. From the organizational standpoint, it’s about cost optimization, for instance, right?
Because before that, when the data lake started, you were injecting all of the data you had in your company. The problem was that you couldn’t search for the data in a proper way. And the cost of manipulating the data was very, very high.
Now, we are, the companies are trying to narrow down the data sets to be more usable for data users and to reduce the cost of working with the data. Is that correct?
[TK]
I think so. Ultimately, the cost, that’s the main thing that, you know, drives the decisions. But I mean, it’s obviously not the only one, but still, I agree.
[MK]
Okay. We’ve been discussing one of the use cases, recent use cases of building, but improving the legacy data platform, right? Can you elaborate on that topic? Yeah.
[TK]
I think I can also refer back to a question earlier when you asked, you know, like, do we really build new data platforms or just modify or modernize the existing ones? So again, I started saying that I feel that we’ve mainly modernized, but of course it’s a matter of designing some strategies. So in other words, the case I will talk about, it’s related to one of the platforms in the HR space where, you know, that’s one of these examples when HR probably was left at the very end for the implementation.
But it’s not like, you know, all of the others who already build their platforms, they need to start from the scratch. No, they will leverage what we are building for this platform. So that’s one of the points, like think about the strategy that will, you know, help everyone inside their company.
It’s not like, try not to build everything from the scratch. Of course, that’s pretty obvious, but I think it’s worth to repeat that. So the use case I’ll be talking about is more related to modernizing the legacy database, legacy data hub into some modern data platform, following the principle of this data products approach, right?
So the challenge we have there or we had there was about the complexity mainly. So there was like hundreds of downstream systems, applications using the existing legacy solution. And as you know, the HR space is very specific.
You have many business rules, country level business rules, the data is sensitive and you need to ensure that per country you will basically execute this, you know, access in a proper way. So in order to help with that, we started small, meaning that we’ve built like a huge end-to-end mapping of the existing data pipelines to understand what’s the usage of the data, who is using it, how often, and what are the data attributes inside being processed from the source to the outbound. And some people will say it’s a loss of money because you would just spend time doing analysis and, you know, usually still everything, you know, what you’re doing in development, that’s what matters.
But my point is that thanks to this analysis, we were able to identify that around like 20% of these outbounds are really inactive, right? So at the beginning, you spend some time, but then you optimize the cost. You see that people don’t really need this data.
Then you learn that out of the existing huge, you know, data sets that are inside or huge data objects that are inside, you can really distinguish a few major data products. So you start thinking about the data as a data product, meaning not just like a table from the source, but about some business sub-domain thing. For example, in case of Azure, it will be employee, it will be the organization structure, it will be the learning history for the employee.
When you start thinking like that, you can quickly identify those outbounds or those downstreams, which really need, you know, particular sub-domain, right? So we start designing in this data as a product mode, of course, without having the proper information about this data, it’s useless, right? So you need to bring in the approach of this data product in the context of defining the metadata, right?
So the data about data in general, and of course there are like various tools helping with implementing these concepts. For example, you know, you can have a Colibra, which is a data catalog tool, but also it gives you the self-service capabilities. You have the marketplace feature when you just search through the data products published, you can see the information about the data products.
You can even push there some of the data observability attributes. So you can see how fresh your data is, how often it’s refreshed. If there is a new data quality issue, if it’s, again, that’s a matter of some integration and usage of the data quality tools, there is many of them on the market.
But at the end, what you get is a single place where the user can go search through the data products and literally just order the access to that. So again, you just implement the self-provisioning. What really matters in this case is also the aspect of the privacy of data.
In many cases, you need to remember that some of these data sets are very confidential, especially in case of HR space. But again, the tools will bring you the capabilities. You just need to put on the top of that some governance structure, define the approach on how to tag the data, how to tag particular attributes that they are sensitive, and then having very now, I would say, broad capabilities, technical capabilities to implement that.
For example, in Snowflake, you can just set up the proper role-based accesses and that’s all. It will propagate, it will be applicable across the organization if you do this like that. So that’s what we’ve done in this use case.
And I think all of the existing, I mean, other data lakes that are built within the organization, we leverage this approach. For them, it will be just a matter of making this step further to implement the data as a product approach and leveraging the tools that we built.
[MK]
So it looks like there’s a kind of a POC or a pilot project of a low impact, I don’t know, low impact domain or low impact data. And also what you said is that building this metadata and attributes for the data, this is the enabler for the AI products in the company, right? So because the AI needs to know exactly what data to use and how to use it in order to deliver the output, quality output for business users, right?
So again, as you said at the very beginning, data platforms are enablers. This is the fundamental element without which AI products will not deliver the quality that is expected from those solutions. Okay.
So what are the main challenges and risks of modern data platforms, implementing modern data platforms?
[TK]
Yeah, probably many of them and they are similar over the years. So it’s nothing really new because again, if you look into the principles, AI really needs or follows the same needs as earlier. But now I think you have less space to simply fail because you need to be as good as possible to truly enable AI for the perfection.
And even using this example about having this metadata, right? Years back, it was one of the, let’s say, very popular trends was to implement the data catalogs, right? And what you ended up with is that you just put all of the data that you have in your data, like into the data catalog.
And then you ended up with the situation that the metadata was not updated, for example, or even if you selected the data stewardship organization, it didn’t necessarily work very well. So that’s obviously a challenge. And I think what data product helps here is just it’s decomposed into smaller parts, into smaller pieces, right?
So when you have some subdomain that you’re expert in, it’s easier for you to build the data ownership team. But again, that’s a challenge because you need to enforce the culture within your organization. You need to ensure that there are people assigned, and it’s not probably one person, there is a team, there is some strategy on how to keep it updated.
There are, of course, tools that will help you. Now it’s very popular, this, the term of like active metadata. So in other words, you don’t need to manually update it.
That’s what the tools and technology can help to bring the always the fresh information about the data. But still, there must be some ownership. And that’s definitely a challenge.
It’s not like the modern data platforms, I would say they emphasize the need of assuming that data platform is no longer or not only IT project is a business project as well, right? So that’s one of the things. To mitigate this challenge, there is also a nice concept of data democratization, which is enforced in many companies now.
It’s about bringing in the tools that will help you to help the users to understand the data or to use it effectively. And through many various aspects, it could be like, you know, building community, building tools, platforms, knowledge sharing platforms, even AI bots that will tell you what’s the data inside, you know, the data hub you are handling. So these are like the methods to try to ensure that everyone feels the ownership for the data, right?
That’s one of the challenge. I would say the other challenge probably is, I mean, the technology challenges to me as a person who is like delivering the solutions, maybe they’re not so critical, because I feel that there is so many options now that you can always, you know, use various technologies adapted to your stack, to your enterprise architecture. So this one, I will not treat as a challenge at the moment.
I will more focus on this aspect of building the organization and culture, right? Other challenges I think I’ve talked about already a bit are more related to the data quality, data governance, and the aspect of building this user-friendly self-service layer. Okay.
[MK]
What about the volume of data? Because nowadays, the volume of data is getting bigger and bigger. Is this a challenge at the moment?
It’s connected to data quality and governance of course, or not?
[TK]
I wouldn’t say that the volume of data is a challenge now. You know, again, the cloud technologies helps us to truly address that. So we can process massive amount of data.
Of course, the more data you have, the higher risk of having some data quality issues. But for this one, as we talked, we need to ensure that we have the right strategy in place. And there is many of them, but implementing data quality rules is a must and bringing in the data observability tools that will be proactively, continuously measuring the quality of data and, you know, throwing the alerts when the quality drops or having given a self-healing procedures, which is again, one of the things that can be implemented.
What is a self-healing procedure? It’s a matter of implementing as long as you know how to address the issue. In some cases, you simply know that something is a result of, I mean, if you have a source system, for example, that you know that may have some troubles with the source data and you know how to fix it in case the issue occurs, you can try to automate that. That’s one of the ideas, right? Of course, there is many, many examples.
[MK]
I’ve mentioned the volume of data because I remember from Gartner’s reports, like, I don’t know, from a year back or two years back, the companies collecting data were using only a fraction of data. Like 90% or over 90% of data was unused, just stacked. This is the concept of the narrowing down the data lake into data products, usable data products.
So not only collecting all of the data and storing it somewhere in the cloud, but narrowing down to usable chunks that will be used by the business. So how do you start a project like that? Because when you have already, when you’re working already with a very big data platform, which is like 10 years old or more, and it’s connected to multiple different systems and technologies, how do you approach such a project?
[TK]
My recommendation would be to start small. So definitely select some candidates that are the ones to prove that the approach, that design will work for others. I wouldn’t say we need to start with the easiest cases, but the ones that probably have maybe lower impact on the business, for example.
So coming back to my example about the use case we’ve built, we selected, first of all, we build the phased approach just to set some realistic expectations. We need to, again, remember that ultimately what matters is the satisfaction of the data consumers. That’s what the whole thing is built for.
So the phased approach helped us to divide the work into some parts that, according to the importance, right? So first of all, we selected pretty significant amount of the data products to be built at the initial stage. Probably it covers like 70% of the outbounds or the use cases.
And that also, that’s a matter of, if you’re building the data product strategy, you will probably see that there are some of the major data products which are critical for everything. And that’s why you build them. I mean, we build them since the beginning, but we did not expose them for majority of the users at the beginning, just to give us a chance to test it on a smaller group first.
So you give it to the smaller group. You have, usually they should be like trusted partners to build. So they are aware that something may not be perfect.
So they help you to build the maturity, but at the same time, they represent the real audience. So they prepare you for deployments in the next phases. In the next phases, you can definitely push more of this scope into the development work.
What helps obviously is this agile methodologies. They are, I think now standard everywhere, and they help you to adapt quickly to the scope, which is changing in many cases rapidly. So this continuous adjustments to the work style and to the scope, that should be something obvious.
The business cooperation, even the business ownership of the project, that’s a must, right? So build a community around it. You don’t drive the project as IT, right?
You do that with the business ownership. They need to help you to build the community around it.
[MK]
All right. I know that I shouldn’t ask you the question because we have discussed that before the episode. At the end, what do you think is the future of data platforms?
What is the trend? Right now, this is AI and AI readiness, right? So preparing the data to be fed to AI to deliver this quality.
Do you think that it’s still having in mind that, for instance, the technology has new features, like data catalogs have new features, those data platforms in the form as you know it, as you deliver on a daily basis. Are they going to change in the future? Because some of the technologies will take some of the functionalities of the data platforms and it will be more, maybe not centralized, but you know, less elements will be needed to operate on data or what is your perspective on that?
[TK]
I think that definitely there will be a bigger focus on the self-service or the making the data usable from the end user perspective. So in other words, we will probably be needing less of this technical, like engineers, right? Using the data, but more analysts, people having a business understanding of the data with tools given to let them, you know, work more effectively.
So I think that yes, the data platforms are now evolving into more dynamic and more sophisticated solutions where you spend less time, less effort to get the data, the data that you need quickly to your data lake, for example. But at the same time, you put more attention, more focus on how to use the data and AI is one of the things definitely driving it. And we’ll see.
But from my perspective, ultimately everything around data usage and self-service will be the key.
[MK]
Great. Okay. Thank you very much for a very inspiring discussion.
Thanks guys. If you like the content, please comment and subscribe to keep us going and see you soon in the next episode of C&F Talks. Thank you.

From legacy systems to unified data ecosystems
Tobiasz Kroll, the expert in charge of C&F’s data platform delivery for human pharma clients, explains why many organizations still struggle with fragmented, legacy systems. He outlines the business and technical forces driving modernization: cost efficiency, unified user experience, and the strategic goal of making every decision data-driven.

How AI is reshaping data governance
AI doesn’t forgive poor data. When the human factor is gone, data has to be perfect. The conversation dives into the interplay between AI and data quality, the growing importance of metadata and observability, and why governance is now a business responsibility, not just an IT task. A highlight of the episode is a case study on HR data hub modernization, which showcases how defining data products, cataloging them in tools like Collibra, and implementing role-based access controls enable scalability, transparency, and readiness for AI.
Explore how modern data platforms enable enterprise AI implementations
Meet the expert

Tobiasz Kroll
Senior Director, Head of BU, Pharma Supply Chain & Corporate Functions, C&FTobiasz leads C&F’s data platform delivery team for human pharma clients, overseeing large-scale modernization programs that transform how enterprises manage, govern, and extract value from their data. With years of experience bridging business and technology, he advocates for pragmatic, business-driven data strategies that make AI initiatives sustainable.
Let’s connect
Our engineers, consultants, and experts are here to help you uncover solutions tailored to your business needs. Whether you’re looking for targeted support or planning a complex digital transformation, we’re ready to help you achieve more.