Diagnostic interviews widely used to identify depression, anxiety, bipolar disorder, personality disorders and substance use disorders do not perform with equal reliability across conditions, according to a new study published in JAMA Network Open.

The immediate consequence is practical: researchers and clinicians may need to be more cautious about treating these interviews as a universal benchmark, Laura Duncan of McMaster University in Ontario, Canada, said in remarks described in reports, arguing they are often used as a “gold standard” even though they do not provide a definitive benchmark with excellent validity and reliability.

Background

Diagnostic interviews sit at the center of modern psychiatry. They are the structured or semi-structured assessments used to decide whether a person meets criteria for a mental disorder, both in routine care and in studies that test treatments, estimate prevalence or set eligibility rules. That gives them unusual power. If the interview is inconsistent, the rest of the evidence chain can wobble with it.

The new paper challenges that assumption in a focused way. It does not say psychiatric diagnosis is impossible, and it does not show that all interview-based diagnoses are weak. It says reliability varies by condition. That is a narrower claim, but a serious one, because mental health services, drug trials and public health estimates all depend on repeatable case definitions. Readers of BreakWire have seen similar concerns surface when diagnostic pathways shape care at scale, from maternity safety failures linked to culture and oversight to screening shifts such as NHS hospitals switching to home bladder cancer tests.

Psychiatric diagnosis is built around manuals and interview tools rather than blood tests or imaging markers for most disorders. In the United States, that usually means categories set out in the Diagnostic and Statistical Manual of Mental Disorders. The World Health Organization maintains the International Classification of Diseases, which is also used globally. Standardized interviews were designed to make those criteria more consistent from one assessor to the next. But standardization is not the same as proof that different interview formats will agree strongly across every diagnosis.

That distinction matters. Reliability asks whether a tool produces consistent results under comparable conditions. Validity asks whether it is measuring what it claims to measure. Peer review raises the bar for a paper before publication; it does not certify that a method is flawless or that the findings will survive replication. One study can identify a problem. It cannot settle the field.

What this means

The first implication is for research. If interview reliability is strong for some disorders and weaker for others, then studies that compare prevalence, risk factors or treatment response may be combining cleaner signals with noisier ones. That can blur results, especially in smaller trials. It also means investigators should be more explicit about which interview was used, who administered it and how reliability was assessed, rather than treating “diagnosed by interview” as self-explanatory. That level of reporting is not academic fussiness. It is basic measurement hygiene.

For clinicians, the message is less dramatic but more useful. A diagnostic interview remains an essential tool. It just should not be mistaken for a perfect arbiter. The study does not support abandoning interviews, and it does not justify telling patients their diagnosis is meaningless. But it does support a more careful approach: repeated assessment over time, collateral history where appropriate, and humility when symptoms straddle categories. Mental disorders evolve. So do the labels attached to them.

And for patients, there is a hard truth here that medicine does not always communicate well: a diagnosis in psychiatry is often the start of inquiry, not the end of it. Overconfidence is the real risk. When a field calls something a gold standard, people hear certainty. This study says certainty should be earned disorder by disorder, tool by tool.

That matters beyond psychiatry. Health systems are already grappling with how much confidence to place in screening and classification tools, whether for infectious disease surveillance or chronic illness risk. BreakWire recently reported on CDC modeling on a growing Ebola outbreak in central Africa and on metabolic drug evidence in a retatrutide trial cutting blood sugar and weight. Different field, same principle: if the underlying measurement is shaky, policy and care can drift in the wrong direction.

When a field calls something a gold standard, people hear certainty.

Key Facts

  • The study was published in JAMA Network Open and reported on June 6, 2026.
  • It examined the reliability of diagnostic interviews used for substance use and mental disorders, including depression, anxiety, bipolar and personality disorders.
  • Laura Duncan, a psychiatry professor at McMaster University in Ontario, Canada, was identified as one of the study’s authors.
  • Duncan said diagnostic interviews are often treated as a “gold standard” in clinical settings and research.
  • The study’s core finding was that interview reliability varies from condition to condition, rather than remaining uniformly high across diagnoses.

There are limits here, and they matter. The source material available publicly in reports does not provide the study’s sample size, effect estimates or full methodological details, so any broader claim would outrun the evidence presented. A reliability study can reveal inconsistency, but it cannot by itself explain whether the problem comes from the interview format, the diagnostic criteria, the training of interviewers or the instability of the disorders themselves. One clean sentence is enough: this paper raises a measurement warning, not a referendum on psychiatry.

The next thing to watch is whether professional groups, researchers and journal editors start demanding more disorder-specific reporting on interview performance in future papers. The study is now in the literature. The real test comes when the next wave of psychiatric research decides whether to treat that warning as a footnote or as a method problem that needs fixing.