AI’s Next Step: Synthetic Data

Elon Musk, the CEO of xAI, agrees with other AI experts: we’ve practically run out of real-world data to train AI models. He commented during a livestream with Mark Penn, "We’ve used up basically all the knowledge humans have created to train AI. " This’s a significant shift that happened just last year. Musk echoes what former OpenAI scientist Ilya Sutskever predicted at NeurIPS last December. Sutskever talked about reaching "peak data, " meaning the AI industry may need to change how models are built due to a lack of new data. So, what’s next? Musk believes the answer lies in synthetic data, which is data generated by AI models themselves. "The future is in synthetic data, " he said. "AI will create its own data, improve itself, and keep learning. "

Companies like Microsoft, Meta, and Google are already using synthetic data. Microsoft's Phi-4 model and Google's Gemma models were trained using both real and synthetic data. Even Meta’s Llama models were fine-tuned with AI-generated data. Synthetic data offers big benefits, like cost savings. Writer’s Palmyra X 004 model, built mostly with synthetic data, cost only $700, 000 compared to the millions some other models require. However, synthetic data has downsides too. It can lead to model collapse, making the AI less creative and more biased. If the training data has biases, the AI’s outputs will too. This shift towards synthetic data is exciting but comes with challenges. As AI starts to create its own learning material, it’s crucial to ensure that the data is unbiased and effective.

Actions