Can deep research fulfill the promise of AI?

Can deep research fulfill the promise of AI?

Putting AI claims to the test.

One of the original promises of AI was that it could serve as a superintelligent robot librarian. AI would contain all human knowledge within itself, which it would use to instantly synthesize a response to any query. It would be far superior to search engines, because instead of just delivering relevant web pages, it would give answers.

As you probably know, that promise fell flat. Tech companies tried to make AI chatbots serve this role, but time after time, they made egregious errors. Whenever their knowledge falls short, they simply fabricate facts. In one famous case, lawyers were sanctioned for using chatbots to write a brief which contained invented citations to non-existent precedents.

This shouldn't have been a surprise. AI chatbots learn by scanning massive volumes of text, building up statistical associations about which words are more likely to follow which other words. They have no built-in notion of truth or falsehood; all they have to rely on is these associations. It would be more accurate to say that they know what an answer sounds like, not necessarily whether it's right or wrong. As Apple researchers put it, chatbots offer "the illusion of thinking".

Deep research

However, the technology is still improving. OpenAI, the company behind ChatGPT, claims to have addressed this problem with their new deep research mode:

Deep research is OpenAI's next agent that can do work for you independently—you give it a prompt, and ChatGPT will find, analyze, and synthesize hundreds of online sources to create a comprehensive report at the level of a research analyst. Powered by a version of the upcoming OpenAI o3 model that's optimized for web browsing and data analysis, it leverages reasoning to search, interpret, and analyze massive amounts of text, images, and PDFs on the internet, pivoting as needed in reaction to information it encounters.

...Every output is fully documented, with clear citations and a summary of its thinking, making it easy to reference and verify the information. It is particularly effective at finding niche, non-intuitive information that would require browsing numerous websites. Deep research frees up valuable time by allowing you to offload and expedite complex, time-intensive web research with just one query.

Deep research mode is based on OpenAI's o3 model, which is claimed to possess improved reasoning power for answering factual and logical queries. (Other tech companies have released their own versions of this technology, including Google Gemini, Microsoft Azure, Anthropic, and Perplexity.)

OpenAI doesn't hesitate to claim that this is a step toward an artificial general intelligence, or AGI, which is a hypothetical AI that can do anything a human can do—including original research and the discovery of new knowledge.

These claims present a formidable problem for people, like me, reporting on this technology. I don't want to be an uncritical booster or a salesman. On the other hand, if this is genuinely a breakthrough, people should know about it.

The case for—or against—AI depends very much on its capabilities. If AI is only good for creating unreliable, low-quality slop, all the energy and resources that went into creating it were a waste.

On the other hand, if AI can accelerate the pace of research and discovery, then there's a real benefit to weigh against the admittedly large amounts of energy it consumes.

It can make us collectively smarter, bringing everyone up to date on what's already known, so no one will miss important knowledge that bears on their own research. It can flag connections that a human might not have noticed. It can keep moving us upward along the exponential curve of progress—if not into a utopian singularity, then at least toward a better world, where technology solves more of the problems that technology is capable of solving.

But all that depends on one very big "if". So, how does deep research mode fare?

Experiment #1: Recreating a column

As the first experiment, I asked ChatGPT to perform deep research on a topic I already know something about: the in vivo gene editing breakthrough I wrote about earlier this year. I wanted to see how well it would accord with the research I did for that column.

Here are the results.

Reading the report, I was impressed. It correctly described the concept and explained some of the methods that have been tried, from zinc finger nucleases to CRISPR-Cas9 to the more recent base editing and prime editing, as well as delivery systems such as viral vectors or lipid nanoparticles. It listed some important milestones (first in vivo gene editing experiment, first CRISPR success, first clinically successful outcome, first use of personalized genetic therapy).

It described the story that prompted me to write about this topic, the use of gene therapy to treat the normally lethal CPS1 deficiency syndrome in a baby, as well as several ongoing clinical trials targeting different parts of the body (including some I hadn't heard about!).

I'm not a professional scientist, but from my understanding of the topic, its research was accurate. There were no obvious hallucinations or egregious mistakes. It was undeniably good at breaking a large and complicated question down into sub-queries, researching each of those independently, and synthesizing the results.

Experiment #2: Testing source quality

For my second foray, I wondered: what would happen if I asked about a subject where pseudoscience abounds? AIs often struggle to tell truth from falsehood. Would it rise to the challenge, or would it be prone to treating good and bad sources as equally credible?

With that in mind, I asked ChatGPT to perform deep research on psychic powers in humans. In case this affected its output, I instructed it to consider all perspectives and "be evidence-based but open to all viewpoints" (yes, this makes no sense, that's the point).

Here are the results.

I was expecting it to fail this challenge, or at best, return an uncritical both-sides report. It didn't. Its conclusion was unambiguous:

"No psychic phenomenon has been reliably demonstrated under controlled, repeatable experimental conditions—a fact underscored by numerous reviews of the data... The general scientific consensus today remains that there is insufficient evidence to support the reality of ESP or other psychic powers."

The AI's report discussed Zener cards, the Ganzfeld experiment, government trials of remote viewing, and other famous experiments from the history of parapsychology. It said that many early studies seemed to yield positive results, but these apparent successes invariably vanished under stricter experimental controls. It discussed religious and shamanic beliefs, 19th-century spiritualism, and even the James Randi foundation's million-dollar prize for proof of the paranormal.

Experiment #3: Trying to fool the AI

Given chatbots' tendency to confabulate, I wondered what would happen if I asked the AI to research a completely made-up subject. Would it cheerfully oblige me by inventing facts about something that doesn't exist and never did?

I asked ChatGPT to write a report about "the history of America's Golden Mountain National Park between 1800 and 1900". However, it wasn't fooled:

I couldn't find any official record of a U.S. national park named 'Golden Mountain National Park'... Could you please clarify if this is for a fictional or speculative scenario, or if you're referring to a specific historical or current park under a different name (such as Yosemite or another)?

Experiment #4: Researching deep research

For my final experiment, I asked ChatGPT to research itself. Specifically, I wanted a report on the strengths and weaknesses of deep research mode as a technology.

Here are the results.

Of all the reports, this is the one I was least impressed with. In my view, it didn't offer adequate skepticism or engage in any critical analysis of AI companies' puffery. It read like a glorified press release. Most of the articles it cited were from tech reporting sites repeating the companies' own claims.

This points to a problem with deep research mode: it's only as good as the available sources. It doesn't have a threshold for saying that there isn't enough information to answer a question. It uses what it can find, regardless of quality.

And, of course, it can only use what's on the open web. It can't bypass paywalls or look inside published books, so there are many sources that it can't access. When information is readily available, this is less of an issue, but I can believe it would perform poorly on obscure topics.

Conclusions

I went into this experiment prepared to be skeptical. As I said, it's intrinsic to the design of chatbots that they're giant word-association machines. They're not capable of genuine reasoning or truth-seeking deduction. Nothing about this new model changes that.

OpenAI agrees. In its introduction to deep research mode, it acknowledges that the AI can still make up facts or draw incorrect inferences. No matter how thorough its reports are, that possibility can't be completely eliminated.

However, the fact that these reports cite their sources is a major improvement. It makes it much easier to check their claims for reliability—but of course, you still have to check. It would be unwise for any human to rely on raw AI output, however scholarly it seems. There's always a chance that it will use poor-quality sources, or say something other than what the AI claims they do.

Nevertheless, I have to agree that it's an improvement on AI performance in the past. I'm still not convinced that we're getting close to true AGI, but for people who want a rapid summary of what's already known about a topic, it's a potentially useful tool.

The biggest threat that deep research poses isn't the AI itself. It's that people will come to rely on it as a crutch, mindlessly believing whatever it tells them, and forgetting (or never acquiring) the skill to find sources, extract relevant information, and synthesize a conclusion on their own. In the worst case, people will forget how to think for themselves, delegating that task to a black-box AI whose creators have their own biases or nefarious motives.

Whatever abilities it possesses, AI isn't—at the moment—a superintelligence that stands above or apart from us. It's a tool created by humans, and it's defined and bounded by the human skill and intelligence that went into creating it. If we remember that, we'll be less prone to blindly trusting the technology, and it will be easier to keep a balanced view of both its promise and its limitations.

Comments