Large language models require training data sets in order to continuously improve. However, given the rate at which models are growing we are soon going to run out of training data. And synthetic data is not the solution we thought it might be.
This is a link-enhanced version of an article that first appeared in the Mint. You can read the original here.
Unless you have been living under a rock for the past year or so, you, by now, have not only heard of the many wonders of generative AI, it is more than likely that you have already experimented with its many manifestations. Few technologies in human memory have, over such a short span of time, demonstrated their potential to so radically transform the way we live and work. But, as rapid as this progress has been, it is fast becoming apparent that there is a limit to how long this exponential improvement will continue.
Large language models (and their image and video generation counterparts) need access to vast amounts of training data in order to improve. This is what gives successive generations of AI the ability to compose prose in an increasing variety of literary formats, and poetry and songs in the style of more and more artists—and why even a novice like me can generate increasingly complex code for a range of different use cases. The trouble is that the availability of high-quality content needed for training these models is fast dwindling.
Running out of Words
According to a recent paper by Pablo Villalobos, the size of training data-sets has been increasing exponentially—at a rate greater than 50% per year. The volume of available language data needed to satisfy this appetite is, however, growing at a rate of just 7% per year and is expected to steadily slow down to just 1% in 2100. As a result, even though the total stock of language data available today is somewhere between 70 trillion and 700 quadrillion words, given the rate at which it’s being consumed, our supply of high-quality language data is likely to run out four years from now. Image data suffers from similar challenges and is estimated to run out somewhere between 2030 and 2070.
All this has been further exacerbated by the new constraints under which AI companies are being forced to operate. Since almost all the data in these data-sets have been scraped off popular internet platforms, social media companies and the like have taken to implementing rate limiters and other technical restrictions to limit the volume of scraping that takes place. On the other hand, artists have begun to institute copyright infringement lawsuits to prevent them from incorporating their works in training data-sets, just as controllers of personal data are looking to clamp down on their use of this data without the consent of the persons to whom they pertain.
Synthetic Problem
An approach that is being increasingly suggested as a way to deal with this shortage is the use of ‘synthetic data’ for training. So what is this synthetic data and how will it help?
Generative adversarial networks can already create content that is virtually indistinguishable from what’s generated by humans, whether it is text, images, video or music. Synthetic data-sets accumulate vast quantities of this sort of AI-generated content to form training data-sets that are then used to train AI models in exactly the same way that human generated content was being used previously. If successful, not only does this give us a virtually infinite supply of training data, it suffers from none of the intellectual property and data protection concerns that scraped content must contend with.
As promising as this sounds, it seems our enthusiasm was misplaced. According to a paper released earlier this month, when synthetic data is repeatedly used to train multiple generations of AI models, over time, both the quality and diversity of these underlying models have shown evidence of substantial degradation, a phenomenon the authors call Model Autophagy Disorder (or MAD for short). Which suggests that the more we rely on synthetic data for training our AI models, the greater is the likelihood that our artificial intelligence systems will in the fullness of time literally go mad.
Various different autophagous loops were identified by researchers depending on how much synthetic data had been included in the mix. In a fully synthetic loop, for example, the training data-set for each subsequent generation of the model consists solely of synthetic data sampled from previous generations. In this case, the quality and diversity of generative models was found to degrade noticeably with every subsequent generation. Where synthetic data is used for augmentation, the data-sets are made up of a combination of synthetic data sampled from previous generations of the model as well as some real data. In these cases, there is evidence that the use of real training data delays (but does not eliminate) degradation. It is only when training data-sets have a sufficient volume of fresh, real data mixed in with synthetic data that both the quality and diversity of these generative models remain stable over generations.
New Alternatives?
One way that has been suggested to mitigate these concerns is to have humans curate the synthetic data manually before it is deployed for training. The trouble is that when we do that, the humans carrying out this curation inevitably introduce their own biases into the data-set while making their selection. As a result, even though this approach would likely improve the quality of the AI models, it impacts diversity.
As remarkable as the progress of generative AI algorithms has been over the past year or so, unless we quickly do something to address these AI-training concerns, our rate of progress is likely to disappoint us. Unless some brand-new technology comes along to change things, our AI advances may grind to an unseemly halt.