Stay informed with free updates
Simply sign up to the Artificial intelligence myFT Digest — delivered directly to your inbox.
The use of computer-generated data to train artificial intelligence models risks causing them to produce nonsensical results, according to new research that highlights looming challenges to the emerging technology.
Leading AI companies, including OpenAI and Microsoft, have tested the use of “synthetic” data — information created by AI systems to then also train large language models (LLMs) — as they reach the limits of human-made material that can improve the cutting-edge technology.
Research published in Nature on Wednesday suggests the use of such data could lead to the rapid degradation of AI models. One trial using synthetic input text about medieval architecture descended into a discussion of jackrabbits after fewer than 10 generations of output.
The work underlines why AI developers have hurried to buy troves of human-generated data for training — and raises questions of what will happen once those finite sources are exhausted.
“Synthetic data is amazing if we manage to make it work,” said Ilia Shumailov, lead author of the research. “But what we are saying is that our current synthetic data is probably erroneous in some ways. The most surprising thing is how quickly this stuff happens.”
The paper explores the tendency of AI models to collapse over time because of the inevitable accumulation and amplification of mistakes from successive generations of training.
The speed of the deterioration is related to the severity of shortcomings in the design of the model, the learning process and the quality of data used.
The early stages of collapse typically involve a “loss of variance”, which means majority subpopulations in the data become progressively over-represented at the expense of minority groups. In late-stage collapse, all parts of the data may descend into gibberish.
“Your models lose utility because they are overwhelmed with all of the errors and misconceptions that are introduced by previous generations — and the models themselves,” said Shumailov, who carried out the work at Oxford university with colleagues from Cambridge, Imperial College London, Edinburgh and Toronto.
The researchers found the problems were often exacerbated by the use of synthetic data trained on information produced by previous generations. Almost all of the recursively trained language models they examined began to produce repeating phrases.
In the jackrabbit case, the first input text examined English church tower building during the 14th and 15th centuries. In generation one of training, the output offered information about basilicas in Rome and Buenos Aires. Generation five digressed into linguistic translation, while generation nine listed lagomorphs with varying tail colours.
Another example is how an AI model trained on its own output mangles a data set of dog breed images, according to a companion piece in Nature by Emily Wenger of Duke University in the US.
Initially, common types such as golden retrievers would dominate while less common breeds such as Dalmatians disappeared. Finally, the images of golden retrievers themselves would become an anatomic mess, with body parts in the wrong place.
Mitigating the problem had not proved straightforward so far, said Wenger. One technique already deployed by leading tech companies is to embed a “watermark” that flags AI-generated content for exclusion from training data sets. The difficulty is that this requires co-ordination between technology companies that may not be practical or commercially viable.
“One key implication of model collapse is that there is a first-mover advantage in building generative AI models,” said Wenger. “The companies that sourced training data from the pre-AI internet might have models that better represent the real world.”
Read the full article here