Testing locally hosted gpt-oss-20b for user research synthesis: How it compares to GPT-4o and Claude 3.5 Sonnet
OpenAI’s new gpt-oss models are a big deal for user research from a data privacy perspective, but they still produce a ton of errors. Here are the results from my first tests ⤵️
Background
Two days ago, OpenAI released their first open weight models since 2019, and they should be very exciting to people who do customer research. These are the first modern OpenAI models you can run entirely on your own computer–which means you can get o3-mini-quality results without your users’ data ever having to leave your device.
This is a major data security win, and it also makes it a lot easier to experiment with AI safely and inexpensively.
Testing
Since it launched, I’ve been testing how well gpt-oss-20b (the smaller model) does at qualitative research synthesis. I’ve been able to run dozens of giant, 52K-token prompts on my standard-issue MacBook Pro in just a minute or two apiece, and the results have been surprisingly good overall–albeit, with the same egregious errors that we’ve come to expect in LLM citations.
I asked gpt-oss-20b to identify themes across 13 interview transcripts (the same 13 I use in my workshops). Across 30 runs, it produced a median of 9 themes, ranging from 7 to 12 per run. These themes were pretty consistent among themselves (and pretty comparable to what I’ve gotten with GPT 4o, Claude 3.5 Sonnet, and o3-mini), but still different from themes produced by a human researcher.
Issues
Despite producing consistent themes overall, gpt-oss-20b was very bad at extracting verbatim quotes to support those themes. Of the runs I tested:
100% contained misquotes
100% contained misattributed quotes
90% had the same quote duplicated across different participants
80% contained irrelevant quotes that did not support the theme
Some misquotes were pretty subtle, while others had a major impact on meaning.
But the bigger issue was quote misattribution, where the AI claimed that one person’s quote came from another participant. This resulted in some user behaviors looking much more widespread than they actually were–which can cause product teams to overinvest in building low-value features.
But despite its flaws, I think this is the most exciting new AI development for user research in quite a while. Being able to run a model of this quality on device makes it a lot safer for our users–and cheaper for us to run the kinds of tests at scale that we need to tell us what AI can–and can’t–do for user research.
I’m already incorporating my learnings from gpt-oss-20b into my “AI Data Privacy for User Research” workshop module. If you’re looking to bring safer AI practices into your product team’s research workflows, we should talk.