OpenAI just released its first ever open-source1 large language models, called gpt-oss-120b and gpt-oss-20b. You can talk to them here. Are they good models? Well, that depends on what you’re looking for. They’re great at some benchmarks, of course (OpenAI would never have released them otherwise) but weirdly bad at others, like SimpleQA.
Some people really like them. Others on Twitter really don’t. From what I can tell, they’re technically competent but lack a lot of out-of-domain knowledge: for instance, they have broad general knowledge about science, but don’t know much about popular culture. We’ll know in six months how useful these models are in practice, but my prediction is that these models will end up in the category of “performs much better on benchmarks than on real-world tasks”.
Phi models and training on synthetic data
In 2024, Sebastien Bubeck led the development of Microsoft’s open-source Phi-series of models2. The big idea behind those models was to train exclusively on synthetic data: instead of text pulled from books or the internet, text generated by other language models or hand-curated textbooks. Synthetic data is less common than normal data, since instead of just downloading terabytes of it for free you have to spend money to generate each token. But the trade-off is that you have complete control over your training data. What happens when you train a model on entirely high-quality synthetic and curated data?
As it turns out, it does very well on model benchmarks but disappoints in practice. Searching for the reception to each Phi model shows the same pattern: very impressive benchmarks, lots of enthusiasm, and then actual performance far weaker than the benchmarks would suggest.
I think the impressive benchmark results come from the fact that these models are very easy to train for specific tasks, because you generate much of the training data yourself. If you’re training on synthetic data, you’d be foolish not to generate some synthetic data that matches the kind of problems people are benchmarking on. But since you’re “teaching for the test”, you should expect to do worse than other language models who are training on broad data and end up being good at the benchmarks by accident.
Why am I talking about Phi models? At the end of 2024, Sebastien Bubeck left Microsoft to join OpenAI. We don’t know who was involved in making the new OpenAI gpt-oss
models. The model card doesn’t provide much detail about the pretraining stage. However, I’d bet that Sebastien Bubeck was a part of the effort, and that these models were trained on a heavily filtered or synthetic dataset.
Synthetic data is safer
Why would OpenAI train Phi-style models, knowing that they’ll perform better on benchmarks than in real-world applications? For the same reason that Microsoft probably continued to train Phi-style models: safety. Releasing an open-source model is terrifying for a large organization. Once it’s out there, your name is associated with it forever, and thousands of researchers will be frantically trying to fine-tune it to remove the safety guardrails.
It’s not discussed publically very often, but the main use-case for fine-tuning small language models is for erotic role-play, and there’s a serious demand. Any small online community for people who run local models is at least 50% perverts.
If you release a regular closed-weights model that stays in your own infrastructure, people can’t fine-tune it. If you make a mistake, you can always update the model in-place. But open-source models are out there forever.
Training on synthetic data (or highly-controlled data such as textbooks) makes it much easier to produce a safe model. You can produce as much “you asked me to do X, but as a sensible language model I am declining to do so” content as you like. If there’s no subversive or nasty content in the training data, the model never learns to behave in subversive or nasty ways (at least, that’s the goal).
For OpenAI, it must have been very compelling to train a Phi-style model for their open-source release. They needed a model that beat the Chinese open-source models on benchmarks, while also not misbehaving in a way that caused yet another scandal for them. Unlike Meta, they don’t need their open-source model to be actually good, because their main business is in their closed-source models.
That’s why I think OpenAI went down the synthetic data route for their new gpt-oss
models. For good or ill, they may as well be Phi-5 and Phi-5-mini.
If you liked this post, consider subscribing to email updates about my new posts, or sharing it on Hacker News.
August 7, 2025 │ Tags: ai