Science and Technology

Science and Technology

AI Trained by AI Produces Nonsensical Output and Gibberish

models, such as those provided by Google and OpenAI,
For large language models, such as those provided by Google and OpenAI, to function, enormous amounts of training data are necessary. Some worry that there might not be enough fresh data left to train future generations of these algorithms because the most recent versions have already combed through a large portion of the internet. Several well-known voices in

 

AI trained on AI churns out gibberish garbage

For large language models, such as those provided by Google and OpenAI, to function, enormous amounts of training data are necessary. Some worry that there might not be enough fresh data left to train future generations of these algorithms because the most recent versions have already combed through a large portion of the internet. One way to solve the data problem, according to some well-known industry leaders like Meta CEO Mark Zuckerberg, is to just train new AI systems using the outputs of older AI systems.


However, recent research indicates that consuming previous model outputs would soon produce long, garbled AI nonsense and would ultimately result in "model collapse." In one instance, they fed an artificial intelligence (AI) an innocuous phrase on church architecture, and over generations, the AI rapidly deteriorated. The last, most "advanced" model just kept saying "black@tailed jackrabbits" over and over.

This week, a paper published in Nature tested the AI-trained-on-AI scenario. The scientists created their own language model, to which they first fed text that was originally created by humans. Nine further generations of models were then created, each of which was trained using the text output produced by the model that came before it. In the end, the final generation produced unnecessary, surrealist-sounding nonsense that was essentially unrelated to the original material. The researchers claim that their model "becomes poisoned with its own projection of reality" over time and via consecutive generations.

 The more AI models practice on themselves, the more they lose meaning.

The peculiar instance of AI that appears to be collapsing on itself is referred to by the researchers as "model collapse," a degenerative process that can manifest in both early and late stages. On the early side, collapse starts when AI models that have been separated from the original training data by multiple generations appear to forget outliers, or rarities, in the original text. The most likely outcomes become more frequent as a result of this. In the real world, that would be problematic since it might lead to a stifling of minority opinions or expression. An LLM exhibiting early indications of collapse can offer an overly homogeneous and undifferentiated picture of reality.

Later phases of collapse see an even stranger turn of events. The models trained on models in those last generations are so divorced from the initial training set that they start to forget important details of the first training and completely lose the plot. At this point, models start producing completely meaningless nonsense. The model's "indiscriminate" self-cannibalization of its own prior outputs, according to the researchers, "causes irreversible defects in the resulting model" when this occurs.


According to the researchers, for big models trained on their own data, this cascading effect and eventual model collapse are unavoidable. It's vital to remember that this research only considered language models; it makes no considerations about the potential consequences of training multimodal models, such as picture and video generators, on themselves. This work also focuses on the optimal behavior of a model trained on its own data. It's unclear precisely what would happen if one model—let's say from Meta—trained using OpenAI output.

Maintaining the original human text might prevent collapse.

It is not impossible for the model to collapse in the real world. Many websites are currently up and feature blog entries and articles that were created solely by LLMs. It's not out of the question that a large portion of the AI-generated garbage could end up making its way into training sets in the rush to develop new models as quickly as possible.

Maintaining the original human text might prevent collapse.

To prevent AI-generated content from being unintentionally included to training sets, it could be helpful to promote a watermarking standard that is consistent across all platforms and indicates whether or not content is real and machine-generated. With a unique "content credential" badge they are working to standardize as part of The Coalition for Content Provenance and Authenticity (C2PA), Google, Adobe, and other major IT companies are attempting to do precisely that.

Maintaining the original human text might prevent collapse.
That, however, would only apply to visuals. Additionally, it is far more difficult to reliably watermark or even recognize AI-generated text with current detection software. A more practical strategy would call for AI developers to carefully review content for indications of AI manipulation and perhaps even pay respectable human sources for access so that their high-quality data can be used for training. A wave of AI puke might fold the internet if human training data isn't there to protect it. Nobody desires that.

 You must see: China's Kling Stuns as OpenAI’s Sora Faces Fierce Competition



Post a Comment

0 Comments