technologyneutral
The Mystery of AI's Secret Reading List
USAWednesday, April 2, 2025
The group behind the findings is co-founded by Tim O’Reilly, who is also the CEO of O’Reilly Media. They used a clever method to test the AI models. They checked if the models could tell the difference between original texts and AI-generated versions. If the models could do this, it meant they had likely seen the original texts before. They tested this with 13, 962 paragraph excerpts from 34 O’Reilly books. The results showed that GPT-4o recognized more paywalled content than older models. This is a strong hint that GPT-4o was trained on this restricted data.
The group was careful to point out that their method isn't perfect. They acknowledged that OpenAI might have gotten the paywalled content from users copying and pasting it into ChatGPT. They also didn't test OpenAI's most recent models, so it's possible these newer models weren't trained on the same data. But the findings still raise important questions about how AI models are trained and what data they use.
OpenAI has been pushing for less strict rules around using copyrighted data to train AI. They've even hired experts to help fine-tune their models. This is a trend in the industry, with AI companies recruiting specialists to feed their knowledge into AI systems. OpenAI does pay for some of its training data and has licensing deals with various sources. They also offer ways for copyright owners to opt out of having their content used for training. But with lawsuits and criticisms piling up, the O’Reilly paper adds to the growing concerns about OpenAI's data practices.
Actions
flag content