OpenAI, Google, and other tech companies train their chatbots with massive amounts of data collected from books, Wikipedia articles, news stories, and other sources on the Internet. But in the future they hope to use something called synthetic data.
That's because tech companies may be running out of high-quality text the Internet has to offer for AI development. And companies are facing copyright lawsuits from authors, news organizations and computer programmers for using their works without permission. (In one of these lawsuits, The New York Times sued OpenAI and Microsoft.)
According to them, synthetic data will help reduce copyright issues and increase the supply of training materials needed for artificial intelligence. Here's what to know about it.
What is synthetic data?
It is data generated by artificial intelligence.
Does this mean that tech companies want AI to be trained by AI?
YES. Instead of training AI models with text written by people, tech companies like Google, OpenAI, and Anthropic hope to train their technology with data generated by other AI models.
Does synthetic data work?
Not exactly. AI models make mistakes and make things up. They have also been shown to capture biases that appear in the Internet data from which they were trained. Therefore, if companies use AI to train it, they may end up amplifying its own flaws.
Is synthetic data widely used by tech companies right now?
No. Tech companies are experimenting with it. But due to the potential flaws of synthetic data, it is not a major part of how AI systems are built today.
So why do tech companies say synthetic data is the future?
Companies think they can refine how synthetic data is created. OpenAI and others have explored a technique in which two different AI models work together to generate more useful and reliable synthetic data.
An artificial intelligence model generates the data. Then a second model judges the data, just like a human would, deciding whether the data is good or bad, accurate or not. AI models are actually better at judging text than writing it.
“If you give technology two things, it's pretty good at picking which one looks best,” said Nathan Lile, chief executive of artificial intelligence start-up SynthLabs.
The idea is that this will provide the high-quality data needed to train an even better chatbot.
Does this technique work?
A type of. It all comes down to that second AI model. How good is he at judging the text?
Anthropic has been the most vocal in its efforts to make this work. She develops the second artificial intelligence model using a “constitution” curated by the company's researchers. This teaches the model to choose a text that supports certain principles, such as freedom, equality and brotherhood, or life, liberty and personal security. Anthropic's method is known as “constitutional AI”
Here's how two AI models work in tandem to produce synthetic data using a process like Anthropic's:
Even so, humans are needed to ensure that the second AI model stays on track. This limits the amount of synthetic data this process can generate. And researchers disagree on whether a method like Anthropic's will continue to improve AI systems.
Does synthetic data help companies avoid using copyrighted information?
The same AI models that generate synthetic data were trained on human-created data, much of which was copyrighted. So copyright holders can still claim that companies like OpenAI and Anthropic have used copyrighted text, images, and videos without permission.
Jeff Clune, a computer science professor at the University of British Columbia who previously worked as a researcher at OpenAI, said artificial intelligence models could eventually become somewhat more powerful than the human brain. But they will do it because they learned from the human brain.
“To borrow from Newton: AI sees further by standing on the shoulders of giant human datasets,” he said.