Stop Blaming the Model: How Better Data Foundatio...

AI Data Foundations: Why Models Fail Without the Right Inputs

When AI projects stall, the instinct is often to blame the model or buy a bigger one. In reality, most failures in enterprise data analytics trace back to weak AI data foundations: incomplete, inconsistent or poorly governed data. Industrial generative models are trained on massive public datasets, but manufacturers, banks or HR teams work with smaller, domain-specific information that cannot simply mimic internet-scale training. That makes data quality, structure and context far more important than sheer volume. Analysts note that many enterprises sit on messy “data swamps” made up of data lakes, warehouses, SaaS apps and spreadsheets that demand constant plumbing yet still lack business meaning. As agentic AI systems emerge—tools that not only analyze but also act—they expose these weaknesses. Without a consistent semantic layer, even the most advanced models struggle to navigate siloed ERP, engineering, HR and customer data in a reliable way.

Industrial AI Starts with ERP and Engineering Data, Not Just LLMs

In manufacturing, the biggest productivity gains from AI come from company-specific projects in automation, process control and quality monitoring—not generic chatbots. These use cases depend on high‑quality ERP data for AI and detailed engineering information, such as product structures, maintenance histories and simulation outputs. Because firms typically have limited domain data and must keep training costs manageable, they cannot rely on massive pre‑training. Instead, they must identify the right operational data sources and prepare them carefully. Tools like simus classmate, for example, extract and unify data from multiple systems according to configurable rules, turning scattered records into consistent training and analytics datasets. When product, maintenance and quality data is standardized, AI can detect failure patterns, optimize spare parts, or recommend process changes. Without this foundational work, models are fed noisy or incomplete signals, leading to inaccurate predictions, unreliable automation and frustrated engineers who stop trusting AI recommendations.

From Data Swamps to an Agentic Data Cloud as Semantic Core

Enterprises are shifting away from static, passive data estates toward architectures that are designed for continuous intelligence and action. An emerging pattern is the “agentic data cloud” – a semantic core that unifies structured, semi‑structured and unstructured data while giving AI agents the context they need to navigate it. Google Cloud’s evolution illustrates this shift. Its Knowledge Catalog, building on Dataplex, acts as a dynamic semantic engine that provides business meaning to large language models, while a Cross‑Cloud Lakehouse based on Apache Iceberg lets teams query data across platforms without copying it. Smart Storage in object stores can automatically tag and embed unstructured files as they land, turning once opaque content into searchable assets. On top of this, agentic engineering tools move from manual ETL coding to intent‑driven orchestration. The result is an analytics‑ready environment where AI agents can trigger workflows, support predictive maintenance or surface customer insights directly from a consistent, well-governed data layer.

People Analytics: Strategy and Process Before HR AI

HR is another area where organizations rush to deploy AI for hiring, retention or performance management without first preparing their data foundations. Experts in people analytics stress that analytical intelligence in talent management does not start with dashboards or algorithms. It begins with a clear HR strategy: defining whether the priority is reducing turnover, filling critical vacancies faster, strengthening succession pipelines or spotting burnout risk. Only then should processes be redesigned to consistently capture the required information—job histories, performance reviews, engagement metrics and more—in a structured, comparable way. Data quality and governance come next, ensuring common definitions for roles, skills and metrics. When this sequence—strategy, processes, data, analytics—is respected, AI models can be safely applied to predict attrition, recommend learning paths or optimize workforce planning. When it is ignored, organizations end up with flashy HR dashboards built on inconsistent indicators and models that produce untrustworthy recommendations.

Modern File Formats Like Delta Parquet Supercharge AI Workflows

Even with strong governance, enterprise data analytics will falter if storage formats are inefficient. Modern column‑oriented formats such as Delta Parquet are becoming a cornerstone of AI data foundations because they optimize large‑scale datasets for analytical and AI workloads. Built on the widely used Parquet format, Delta Parquet adds transaction logs, schema enforcement and performance optimizations that bring database‑like intelligence directly to the data layer. A 1TB CSV file can be compressed to roughly 130GB in Delta Parquet, reducing size by 87%. In one benchmark, query times fell from 236 seconds to 6.78 seconds, a 34x speedup, while compute costs dropped from USD 5.75 (approx. RM26.45) per query to USD 0.01 (approx. RM0.05), a 99.7% reduction. Features like columnar storage, advanced compression, row group statistics and parallel processing support make it far cheaper and faster to run AI training, feature engineering and analytics pipelines at scale.

Practical Steps: Build the Foundation Before Buying More AI

For business leaders, the message is clear: stop blaming the model and start fixing the foundations. Begin with a data quality audit across ERP, engineering, HR and customer systems to identify duplicates, gaps and conflicting definitions. Standardize schemas and master data—such as product hierarchies, asset IDs, job families and customer segments—so they mean the same thing everywhere. Invest in an enterprise semantic layer or agentic data cloud that can catalog, classify and describe data in business terms, then expose it to AI agents through governed interfaces. Modernize your storage layer with columnar formats like Delta Parquet or open table formats such as Apache Iceberg to cut costs and accelerate queries. Finally, align each AI initiative with a clear business problem and ensure supporting processes actually capture the necessary data. When these foundations are in place, AI tools cease to be experiments and become reliable engines of automation and insight.

Stop Blaming the Model: How Better Data Foundations Make AI Analytics Actually Work