“Shallow rabbit holes”: Here’s how the AI underlying NotebookLM performs at user research synthesis

Last week we ran our final “AI for UX Researchers” workshop of the year (through Rosenfeld Media). As in our previous workshops, we put one of the major LLMs to the test for qualitative research analysis.

We’ve previously tested GPT-4o (OpenAI) and Claude 3.5 Sonnet (Anthropic) because they were the most popular models underlying commercial research tools at the time.

This time, we tested Google’s Gemini 2.5 Flash, because it is the same model used by NotebookLM under the hood, which is growing increasingly popular among research teams–making it important to understand its strengths and limitations.

The test itself is simple: we run a controlled prompt and dataset many times, asking the LLM to generate themes and provide supporting quotes. We then manually evaluate the output across 20+ runs of that same prompt.

🍎 First the good news: Google Gemini 2.5 Flash performed significantly better than GPT-4o and Claude Sonnet 3.5 at sourcing quotes to support its claims.

  • GPT-4o fabricated quotes in every run, but with Gemini 2.5 Flash this was dramatically reduced: workshop participants found fabricated quotes in only 4 out of 21 runs.

  • Claude Sonnet 3.5 attributed quotes to the wrong participant in every run, but none of our workshop participants saw this issue at all with Gemini 2.5 Flash.

🪱 Now the bad news: Just because quotes are accurate, doesn’t mean they’re relevant or insightful.

  • Workshop participants were consistently disappointed with the quality of the supporting quotes. Nearly every run contained irrelevant quotes, ranging from slightly off to completely nonsensical.

  • Superficiality was also a major problem across the board. One participant called the LLM-generated themes “shallow rabbit holes” (a compelling metaphor!). And surface-level reading was also an issue for the quotes themselves–participants felt the LLM often “ignored the larger context/meaning of what the participant was describing.”

So despite the accuracy of the quotes, they did not really save much time in terms of validating the themes generated by the LLM.

So what does this mean when we’re working with tools like NotebookLM in user research?

It means that LLM-based tools, while efficient, cannot replace a trained researcher in high-risk or highly innovative domains. Because:

🧪 There’s still no easy way to validate their analysis and protect against error; and

🐰 We risk missing bigger insights while we’re working in “shallow rabbit holes.”

📓 I recently wrote up a report of findings from our first several LLM research synthesis tests, with more to come as we test more models. Give it a look if you’re interested in UX AI benchmarks (email required to download).

AI, Methods & toolsLlewyn Paine