According to ChainCatcher, in a live conversation with Stagwell Chairman Mark Penn, Elon Musk stated that the training of current AI models has essentially exhausted real-world data, saying, 'We have exhausted the cumulative sum of human knowledge, which happened last year.' Musk agrees with former OpenAI Chief Scientist Ilya Sutskever, who suggested at the NeurIPS machine learning conference that the AI industry has reached a 'data peak' and may need to change its model development approach in the future.
Musk believes that synthetic data will be a way to complement real data, and AI will achieve self-learning through generating and self-evaluating data. This trend has been adopted by tech giants including Microsoft, Meta, OpenAI, and Anthropic, with models like Microsoft's Phi-4 and Google's Gemma combining real and synthetic data for training. Gartner predicts that by 2024, about 60% of data in AI and analytics projects will be synthetic.
The advantages of synthetic data include cost savings; for example, the AI startup Writer spent only about $700,000 to develop its nearly entirely synthetic data-based Palmyra X 004 model, whereas the development cost for a similarly sized OpenAI model is about $4.6 million. However, synthetic data also poses risks, including a decrease in model creativity, increased output bias, and potential model collapse, especially if the training data itself is biased, as the generated results may also be affected.