Ongoing Data Readiness for DIY AI
Data Readiness in the Era of DIY AI
By now, it’s certain we’re all aware that training and using AI requires high quality data. We must all have read the phrase “data readiness for AI” a hundred times.
High-quality data is vital for training any machine learning system. Machine learning (ML) systems use data to which they’ve been previously exposed to recognise patterns and generalise their responses to data they have not yet seen. It’s a subset of technologies that includes generative AI systems, which use a powerful combination of ML and natural language processing technologies. It also powers systems like our social media algorithms and recommendations feeds, automated trends analyses and predictions, or classification algorithms that support cybersecurity.
But the term “readiness” can itself be misleading. It implies a static state that can be achieved once and set aside. Unfortunately, businesses cannot just “get data ready” for AI and then wash their hands of the matter.
Data readiness is an ongoing process.
Let us explain.
Types of deployment
Data quality matters in two ways: one, it matters that data upon which a ML model is trained is well-curated, accurate and fit for purpose; two, it matters that the data to which a model refers is high quality.
If you’re using an out-of-the-box deployment like Salesforce’s Agentforce AI, the training process is not your responsibility. But you will still encounter data readiness as an ongoing matter, because the data your AI refers to has to be correct and accurate in order for it to perform its task. In short, these AI applications still follow the old “garbage in, garbage out” adage: if you tell the AI incorrect information, it will produce incorrect outputs. (We’ve discussed data readiness for this kind of deployment previously — and you can check out our AI readiness checklist here.)
However, there’s a secondary type of AI deployment to address, and it’s one for which an organisation has a greater responsibility. We have entered an era of DIY model training and deployment. There are numerous tools now (such as Replicate) available to help businesses select the right model, curate their own training data, and train their own models.
This represents a slightly different challenge for the business. Such a deployment requires you to “teach” it how to respond to novel information — and you need data for that, too.
The problem of drift
When using models you’ve trained yourself, you’re going to encounter the problem of drift.
The data you give your AI during training is static and necessarily limited. The data it works upon in a production environment will be both more varied and, over time, likely different to the training data.
In ML, concept drift and covariate shift are the most commonly cited types of model drift. Concept drift occurs when the task you’re executing has changed. Covariate shift occurs when the inputs you’re making to execute the task change.
But both of these have one real cause: the relationship between static data and data in the real world (and the linear nature of time) provides ample opportunity for models to produce less useful and accurate responses over time.
Model drift can occur slowly, or all at once. For example, there are world events that may create an obvious reason for very sudden model drift, such as changes to industry regulation or the introduction of a new currency that make model outputs unreliable. But equally, significant changes in vernacular language, for example, could cause slow and unforeseen drift.
What can you do to detect model drift in ML systems?
The good news is that you can still undertake precautionary and remedial measures to address model drift.
- Test your model
Firstly, it’s vital to continually monitor performance so it is clear when your model may be drifting. Test your ML systems, especially generative AI models, to ensure that they’re still fit for purpose. This testing can take the form of statistical tests, error analysis, or, perhaps easiest for a smaller organisation, by using performance metrics
- Pay attention to end user feedback
This won’t replace testing your model, but whether your end users are customers, dedicated reviewers, or internal staff, paying attention to the trends in kinds of problems they report will ensure that you keep your finger firmly on the pulse of your model.
How do you address drift?
As long as models are intended for deployment using data from the real world, we can never fully prevent model drift.
You can help preserve the functioning of your model by vetting all the data it ingests via data quality activities such as validation, verification, deduplication and automated cleansing.
Periodically, you can also retrain your model. This means injecting new, high-quality data to help ensure the model remains on track. Retraining is a standard part of the ML lifecycle, and regular retraining is a necessary strategy to counter the inevitability of model drift over time.
How often you need to retrain a model will depend on a number of different factors: what kind of data it uses, what kind of tasks it needs to execute, how fast the data it needs changes — there are a great many elements to consider.
Key to retraining, however, remains data quality: it’s a process that demands a foundation of carefully-curated, labelled, accurate data.
Have questions about data for AI? DCA’s data experts have the answers.