Why Bad Data Could Be Destroying Your AI—and How to Fix It
While 2025 may be celebrated as the Year of the Snake, we're declaring it the Year of AI at Santa Cruz Works! I won't wax about the pros and cons of AI, as I am sure you have become intimately familiar with them while wading through countless AI-centric articles and newsletters. However, I will center on a critical component of AI systems that often goes overlooked. Sure, the capability of an AI framework, its algorithms, or its computational power all play a role. Yet the most crucial aspect of what makes an AI truly intelligent is its data input.
Greg Dolder, the Head of Engineering at ProductOps, puts it best. In an article detailing the importance of data quality in informing AI systems, he simply explains that "if your data is garbage your outcomes are bound to be garbage." While this comment may sound harsh, the truth is that poor data can reap results riddled with inaccuracies, bias, and gaps. This can result in an AI system perpetuating widely-held stereotypes and prejudices that are hardly considered grounded in fact. In an article released by the United Nations, aptly named "Bias from the past leads to bias in the future," the author describes how supposedly neutral systems have perpetuated racial, socioeconomic, and educational discrimination.
Bad data input can also result in hallucinations, where the AI generates inaccurate, misleading, or entirely fabricated outputs that undermine its reliability. For example, studies have shown that ChatGPT is incorrect an astounding 52% of the time when answering computer programming questions and often fabricates fake references for essays. While OpenAI's system is not considered to be fueled by poor data, these inaccuracies highlight the critical role that high-quality input plays in ensuring the accuracy of AI outputs. Likewise, when AI systems are used in financial and business forecasting, hallucinations can lead to flawed risk assessments that leave businesses unprepared for market volatility. Without a strong bedrock of reliable data, even the most advanced AI systems can lose credibility.
Understanding "Poor" Data and How to Address It
You may be wondering what constitutes "poor" data. Dolder argues that it can include inconsistently formatted entries, such as variations in how states are listed ('New York,' 'N.Y.,' or 'NY'); incorrect labels, like miscategorized images or products that disrupt computer vision or recommendation systems; gaps or duplicates in datasets, which can skew metrics and create confusion; reliance on unverified or low-quality third-party sources, leading to flawed predictions; and outdated or irrelevant data, which can result in misguided recommendations or ineffective marketing strategies.
Dolder also offers the solution(s) to the pesky problem of poor data, including a specialized ProductOps Approach and OODA Loop. The Product Ops Approach synthesizes data audit & cleansing, a data governance framework, scalable data architecture, and continuous optimization to ensure that data remains accurate, reliable, and adaptable to support robust AI systems. The OODA Loop—an iterative cycle of observing, orienting, deciding, and acting—involves gathering insights, analyzing them to identify improvements, creating a plan, and implementing changes while closely monitoring the outcomes. While the ProductOps Approach is a service available to customers through their website, the OODA Loop is a process that anyone can implement independently. To learn more about these strategies and how they can transform your data processes, visit the ProductOps website or apply the OODA Loop principles to your AI systems.
Amidst the year of AI, we must begin to refine the systems we have come to rely so heavily on. By prioritizing quality data and embracing the ProductOps Approach and OODA Loop, we can ensure that AI meets and exceeds our future expectations!