Last week, Sullivan & Cromwell did something you rarely see from a white-shoe law firm: They apologized to a court. The reason? AI hallucinations in a legal filing.
Let that sink in. One of the most sophisticated law firms in the world, armed with top-tier talent and resources, submitted work that included fabricated or unreliable AI-generated content.
This gets to the heart of a question we at Alphy hear regularly: “Why not just use GenAI for legal and compliance review?”
Here’s the answer: GenAI was built to sound right. Legal work requires being provably right.
The problem with “smart” AI is that frontier models, including GPT-4o, Claude, Gemini, and others, are extraordinary generalists. They summarize beautifully. They write fluidly. They reason impressively. But they are not built for deterministic detection.
In recent internal benchmark testing, we compared Alphy’s HarmCheck against leading frontier LLMs across legal and regulatory documents. We saw three consistent issues:
- Hallucination isn’t the only problem. Noise is. On a document tampering (spoliation) opinion, leading models flagged 180 to 634 passages when only 55 were actually relevant. That’s not intelligence. That’s a review burden.
- Coverage is wildly inconsistent. The same model that caught 94% of issues in one document missed everything in another. In the legal sector, “sometimes right” is just another way of being wrong.
- Outputs aren’t reproducible. Run the same document twice, and you can get different answers. Try explaining that to a regulator. Or a judge.
What “We Don’t Hallucinate” Actually Means
With HarmCheck, we took a fundamentally different approach to our classifier training — and user experience. We don’t generate answers. We classify risk deterministically, at the sentence level.
That means that every flagged sentence is traceable and auditable. The same document produces the same result every time. There are no invented citations, no fabricated language, no guesswork.
We also made a deliberate choice early on: We are not trying to be good at everything. We have built specialized classifiers, intentionally and strategically, to detect some of the most egregious, expensive, and damaging forms of harm in business and legal environments. From discrimination and retaliation to document tampering, unfair lending, and insider trading, each model is trained for precision in high-stakes scenarios.
We are not aiming to be good for all and great for none. We are intent on being great for some, including providing the most trustworthy and accurate tools for expedited eDiscovery or rapid deploy audits, where completeness, consistency, and defensibility matter most. That’s not because we’re “smarter.” It’s because we’re built for the job.
The Real Takeaway for Legal Teams
The Sullivan & Cromwell moment isn’t about one firm. It’s a signal. We’re entering a phase where AI is embedded in legal workflows, but the tooling choices matter more than ever.
Use GenAI for summarization, brainstorming, and first drafts. But when it comes to risk detection, compliance, and evidence, you need defensible systems.
We’ve built HarmCheck to detect with clarity and accuracy, and now have 50 proprietary AI models surfacing a range of harm. Our vision is not to replace lawyers or outwrite LLMs, but to provide something they can’t: a system that doesn’t hallucinate because it doesn’t guess. A system trained on underlying laws. A system that understands context instead of merely searching for keywords.
For legal teams, that means less noise, clearer evidence, and results they can stand behind.
Book a free demo of HarmCheck today.
By Alphy Staff
HarmCheck by Alphy is an AI communication compliance solution that detects and flags language that is harmful, unlawful, and unethical in digital communication. Alphy was founded to reduce the risk of litigation from harmful and discriminatory communication.
