The Unsexy yet Fundamental Part of AI Projects: Data

In my years leading data and product initiatives I’ve seen firsthand what really drives successful AI projects. While the focus is often on sophisticated algorithms and cutting-edge models, fancy use cases and cool demos or prototypes, the reality is far less exciting but much more fundamental and critical: data acquisition, preparation, and management typically is about 80% of the effort in most AI initiatives.

The 80/20 Rule of AI Projects

You can discover the truth by yourself (at a significant cost), or just trust me and read through. Most of the time, resources, and frustration aren’t spent on developing advanced algorithms or fine-tuning neural networks. Instead, they’re spent on:

Finding and acquiring relevant data – Often scattered and lost across departmental silos, legacy systems, and various formats
Cleaning and standardizing data – Removing duplicates, handling missing values, correcting errors
Transforming data – Converting formats, normalizing values, and creating consistent taxonomies, identities, hierarchies, features
Enriching data – Augmenting internal data with external sources to create more comprehensive datasets
Building data pipelines – Creating sustainable processes for ongoing data collection and management

Only after these foundational elements are in place can the actual work on AI models begin. And even then, the data challenges continue as models need to be retrained, monitored, and maintained with fresh, high-quality data. An you might need to re-work on your source data.

Why Data Preparation Dominates AI Projects

There are several reasons why data work is so prominent in AI projects:

1. The Reality of Enterprise Data

Even after decades of investment in data warehouses and data lakes, a big chunk of enterprise data remains fragmented, inconsistent, and poorly documented. In many organizations, even basic questions like “how many customers do we have?” can yield different answers depending on which system you query. It happened to me personally in more than one occasion – I have spent weeks in understanding how many customers we had, even starting from the most important thing: the definition of ‘customers’. You’d be surprised how many you can come up with

2. Quality, Quality, Quality

Machine learning models amplify the problems in your data. Poor data quality means poor model performance – it’s that simple. As the saying goes: garbage in, garbage out. This reality forces AI teams to spend significant time ensuring data quality before any modeling can begin. You just have to do it, if you care about good outcomes. In an occasion, despite the customer reassurance their data set was as good as gold, after cleaning and structuring a customer transactional data set, it turned out it there were significantly more customers in its dataset than people living in the country he was operating! How good was the model prediction going to be?

3. Integration Challenges

AI systems require integrating data from multiple systems – often combining structured data (from databases) with unstructured content (like images, text, or voice recordings). Creating cohesive datasets from these diverse sources is complex and time-consuming. All these integrations also need to be maintained. Pipelines break, and need to be fixed.

Real-World Impact

During my time at dunnhumby, our successful retail analytics succeeded because of maniacal and meticulous attention to data preparation. This was the bedrock of our success. All teams invested heavily in creating clean, well-structured data assets that could be effectively used by AI solutions that delivered measurable ROI. This was at the base of our continue success.

How Organizations Can Respond

For executives sponsoring AI initiatives, understanding this reality leads to several strategic imperatives:

Invest in data infrastructure before AI – Build robust data pipelines, governance processes, and quality control mechanisms
Budget realistically – Allocate resources with the understanding that data work, not model tuning, will consume most of the project timeline
Build data expertise – Ensure your teams have strong data engineering capabilities, not just data science skills. You’ll need both
Create reusable data assets – Focus on building data platforms that can support multiple AI use cases, not just one-off projects
Consider data strategy as business strategy – Recognize that data capabilities increasingly determine competitive advantage. And put your money where your mouth is

Looking Forward

As AI becomes more central to business operations, the organizations that succeed won’t necessarily be those with the most advanced algorithms. Instead, the winners will be those that have built robust data foundations – what I call “data monetization capabilities” – that enable rapid and reliable deployment of AI.

The breakthroughs in AI research make headlines, but the quiet, persistent work of building data infrastructure is what truly enables AI success. For executives embarking on AI transformations, embracing this reality early can be the difference between success and failure.