Will ChatGPT ever stop learning?
Anyone who has ever turned to ChatGPT for answers may have wondered at some point where this popular artificial intelligence (AI) tool gets all the information it provides. The simple answer is that ChatGPT voraciously scrapes all the English-language data that is to be found on the open web and synthesizes these for you. Everything that is posted on the internet—from news articles to digital books, blogs, podcasts, documentaries, computer programs, etc.—is raw material for this AI system.
The unauthorized use of such content has triggered concerns of intellectual property rights violations. Indeed, multiple lawsuits have been filed against Open AI (the developer of ChatGPT) by computer programmers, book authors, publishing companies, news organizations, and various other content producers who assert their legal rights over copyrighted works on which these powerful AI systems are trained.
But there is an equally important issue that is troubling many observers of the AI world. Given the ever-growing scale of computing power that is now available to AI systems, it is just a matter of time before all the human-generated digital material that can possibly be mined on the internet runs out. When that point is reached, what will the likes of ChatGPT do? What happens when they have exhausted all the texts on which they could continue to train?
Cade Metz, a New York Times technology reporter, and author of the book “Genius Makers: The Mavericks Who Brought AI to Google, Facebook, and the World,” says that, in fact, this is already happening. To make up for this growing scarcity in raw material supply, Open AI has begun transcribing audio and video files from various platforms using a speech-recognition tool it has developed. The digital transcripts are then fed into their machine-learning systems to build newer and more powerful versions of ChatGPT.
But even this may only be a stopgap measure. In time, these AI tools may have to turn to non-English digital material on the internet and translate these into readable English text. Right now, according to Matteo Wong in an Atlantic Monthly article (April 12, 2024), “Nearly 90 percent of websites are written in just 10 languages (English, Russian, Spanish, German, French, Japanese, Turkish, Portuguese, Italian, and Persian).” It is these—and the worldviews they represent—that will dominate the world of the chatbot. Wong believes that the rest of the world’s 7,000 languages, particularly those that are not served by Google Translate, will be further pushed to the periphery.
But Metz thinks that all the amount of digital content that human beings can post on the web will never be enough to match the virtually inexhaustible computing capability and memory that is now at the disposal of AI systems. AI systems will then have no choice but to train themselves on machine-generated content or “synthetic data.” Metz believes this may be a dangerous turn for AI chatbots as it may only amplify the errors that result from their tendency to “hallucinate” or fabricate false information when they try to replicate patterns of human thinking.
All this may sound surreal, but perhaps we are getting a glimpse here of a fully digitized future with all its obsession with speed. Are we humans just being alarmist? So, I asked ChatGPT itself for its own thoughts on these issues. ChatGPT opens with a denial: “It’s unlikely that large language models will run out of human-made digital data to learn from anytime soon. However, if that were to happen, the models would likely plateau in terms of their ability to learn new information. Researchers would then need to find alternative sources or develop new techniques to keep improving models.”Me: “Is so-called synthetic data one of such alternatives?” ChatGPT: “Synthetic data is artificially generated data that mimics real data but is created by algorithms. It can be used to augment existing datasets or to create entirely new data sets for training machine learning models.” Me: “Machines learning from machine-generated data? Doesn’t that only amplify errors that may be committed in the process?” ChatGPT: “It’s crucial for researchers to carefully design and validate the algorithms generating synthetic data to minimize these risks. Additionally, combining synthetic data with real data and employing rigorous validation techniques can help mitigate these issues. Validation techniques for synthetic data can include Cross-validation, Statistical analysis, Domain expert review, Adversarial testing, and Fine-tuning.”
The chatbot offered further elaboration of these validation measures, most of which take the form of mathematical operations performed by machines. I came away feeling that what AI systems are offering is not truth as we understand it, but a model of real-world phenomena.
ChatGPT is no doubt extremely capable of responding to almost any request for information, solution, or advice. One can easily get addicted to it, mistaking it for a real person who is always there. But beware. As Simone de Beauvoir warned through one of her characters in the novel “Les belles images,” “Soon technology will seem to us like nature itself, and we will live in a completely inhuman world.”