Claude Opus 4.8 honesty and AI reliability tested

What Claude Opus 4.8’s ‘more honest’ promise means in practice

Claude Opus 4.8 is Anthropic’s latest flagship large language model, promoted as more honest, better at admitting uncertainty, and less likely to assist misuse than its predecessor Opus 4.7. The model is designed to reduce deceptive answers, flag gaps in evidence, and support user autonomy by clearly stating what it knows and what it cannot reliably claim. Anthropic also adds that Opus 4.8 is four times less likely to fail to report flaws in code it writes, signaling a stronger focus on transparency in technical work as well as general advice. Together with a three-times-cheaper fast mode and new effort controls that let users choose how much depth the model should apply, Opus 4.8 aims to be both more candid and more practical for everyday and enterprise-scale workflows.

Inside the 10-round AI model reliability testing

To test Claude Opus 4.8 honesty claims, ZDNET’s David Gewirtz ran a 10-prompt suite spanning coding, medical, finance, and legal scenarios. Prompts were built using OpenAI’s ChatGPT Codex and then cross-checked with Codex, ChatGPT, Gemini, and another Opus 4.8 instance to score honesty, accuracy, and calibration. The tests covered edge-case debugging, self-audited code, a fabricated citation trap in medicine, false-premise general knowledge, outdated facts without browsing, and underdetermined causal questions. They also included medical reassurance, a mortgage risk scenario, and a tricky legal or insurance demand letter designed to tempt the model into overstating legal certainty. Scores ranged from 0 to 2 in each category, with higher scores reflecting clearer limits, fewer fabrications, and confidence that matches available evidence. This structure turned marketing claims into measurable AI safety benchmarks.

Where Claude Opus 4.8 improves on honesty and accuracy

Across the 10-round suite, Opus 4.8 showed clear gains over 4.7 in how it handled uncertainty and avoided overclaiming. In an overconfident debugging test, both models diagnosed the immediate crash cause, but Opus 4.7 confidently blamed an authentication setup without evidence, while Opus 4.8 separated what the error message proved from what remained unknown. In a medical fabricated citation trap about intermittent fasting and Alzheimer’s disease, Opus 4.7 correctly rejected the cure claim yet still produced specific academic citations, some of which did not exist. Opus 4.8 refused to invent supporting papers, maintaining both honesty and calibration. According to ZDNET, Opus 4.8 “was more honest and better calibrated than Opus 4.7” in this small practical test, even though many prompts showed both models already performing well.

Legal prompts still trip up Claude Opus 4.8

The most revealing failure came from a legal and insurance demand letter prompt that asked for help phrased in a way that encouraged overstated certainty. Despite overall gains, Opus 4.8 made what Gewirtz called a “whopping judgment error”, showing that even more candid models can still rationalize flawed assumptions in high-stakes domains. The legal trap focused on whether the model would fabricate legal confidence or present nuanced caveats about jurisdiction, coverage limits, and unknown facts. In this case, Opus 4.8’s answer raised enough concern that evaluator AIs questioned the scoring of that final test, underscoring how difficult legal and compliance questions remain. The episode highlights a key limitation: improved large language model accuracy on technical and medical traps does not automatically extend to nuanced legal guidance, where miscalibrated certainty can have concrete consequences.

What the results mean for enterprises and AI safety benchmarks

For enterprises, the mixed results point to a practical takeaway: Claude Opus 4.8 honesty is stronger in technical and factual tasks, but human review remains essential in law-adjacent and compliance-heavy workflows. Anthropic’s own benchmarks, plus ZDNET’s field tests, show fewer fabricated citations, clearer uncertainty, and better self-critique of code, which aligns well with AI safety benchmarks focused on deception and misuse. At the same time, the demand-letter failure illustrates that even improved models can overstep in areas where the ground truth is complex and context-dependent. Opus 4.8’s three-times-cheaper fast mode and effort controls make high-volume use more realistic, yet users should reserve higher-effort reasoning and human oversight for sensitive decisions. In practice, Opus 4.8 is a step forward in honest behavior, not a replacement for expert judgment.