Deep Dive · June 2026

What Legal AI Is Actually Optimizing

By wisdomagent — Founder & Chief Scientist, Wisdom Agent, Inc.

Every AI tool in legal practice encodes an optimization target. In most deployments, that target does not align with the outcome the attorney would endorse.

By Dr. Reza Olfati-Saber, Founder & Chief Scientist, Wisdom Agent, Inc.

The Idea in Brief

In 2025, Stanford’s RegLab and Human-Centered AI Institute published the first peer-reviewed empirical evaluation of commercial legal AI research tools in the Journal of Empirical Legal Studies. The researchers tested Lexis+ AI, Westlaw AI-Assisted Research, and Ask Practical Law AI across 202 legal queries, hand-scored by legal experts. The headline finding: hallucination rates of 17 to 33 percent. Lexis+ AI, whose parent company had marketed it as delivering “100% hallucination-free linked legal citations,” hallucinated on 17 percent of queries. Westlaw’s AI-Assisted Research hallucinated on 33 percent — nearly twice the rate of its competitor. General-purpose chatbots performed worse still, hallucinating between 58 and 80 percent of the time on legal queries.

The study made headlines for those numbers, and the vendor responses were predictable: methodological objections, version disclaimers, promises of improvement. What received less attention was a structural observation buried in the findings. The tools’ errors were not random. They followed patterns that revealed what the systems were actually optimizing for. Westlaw’s higher hallucination rate correlated with longer responses — averaging 350 words versus 219 for Lexis. The system that produced more detailed, more authoritative-sounding output was also the system that fabricated more frequently. It was optimizing for response completeness. The outcome the attorney needed — citation accuracy — was a different objective, and the two had diverged.

This is the proxy problem in legal AI, and it is not confined to research tools. Every AI system deployed in legal practice — from contract drafting platforms to document review engines to citation generators — encodes an optimization target. Whether that target aligns with the professional outcome the attorney needs is an empirical question that almost no law firm is structured to ask.

The Optimization Targets That Legal AI Actually Pursues

The legal profession is unusually vulnerable to proxy drift because the outcomes attorneys care about — accuracy, completeness, analytical soundness, compliance with professional obligations — are difficult to measure at inference time, while the proxies that AI systems are trained against — response fluency, citation density, document completeness, pattern match to training data — are easy to measure and easy to optimize.

Consider four categories of legal AI now deployed at production scale, and the divergence between their optimization targets and the outcomes they are expected to serve.

Legal research tools optimize for response relevance and citation coverage — producing answers that cite real authorities and address the question asked. But relevance and coverage are proxies for correctness. A response that cites six real cases, four of which do not actually support the proposition stated, scores well on coverage and poorly on correctness. The Stanford study found exactly this pattern: tools produced “hallucinations” that included not just fabricated citations but — more subtly and more dangerously — real citations mischaracterized to support propositions they did not actually stand for. The system had learned that citing real cases correlates with accuracy, but the optimization pressure pushed it to cite aggressively rather than cite correctly.

Contract drafting platforms optimize for clause completeness and structural conformity — producing drafts that contain all expected provisions and follow recognizable contractual patterns. But completeness is a proxy for adequacy. A contract that contains every standard clause but defines “Material Adverse Effect” in terms that conflict with the representations section has optimized for structural completeness while failing on internal consistency. The AI system cannot distinguish between a complete contract and a correct one, because correctness requires understanding the relationship between clauses in light of the specific transaction — a judgment that depends on context the model does not have.

Document review systems optimize for classification accuracy on the training distribution — correctly categorizing documents as responsive or non-responsive based on patterns learned from prior review sets. But classification accuracy on historical data is a proxy for classification accuracy on the current matter. When the legal issues shift, when the document types change, when the privilege boundaries differ from the training set, the model continues to optimize the metric it was trained against while the underlying task has moved. The proxy drifts silently, because the system has no mechanism to detect that the correlation between its training objective and the attorney’s current need has weakened.

Clinical documentation AI in healthcare — whose parallel was documented in the March 2026 Blue Cross Blue Shield study — offers a cautionary analog for legal billing. AI-enabled ambient listening tools in hospitals optimized for billing-code completeness and drove a $2.3 billion increase in projected excess spending through systematic upcoding. The legal profession is not yet at this point, but the architectural conditions are present: AI-assisted time entry, automated billing narrative generation, and AI-driven matter classification systems all optimize for documentation completeness in ways that could systematically drift from the documentation accuracy that professional ethics require.

Why Law Firms Cannot See the Drift

Herbert Simon’s theory of bounded rationality explains why institutions routinely fail to detect proxy drift in the systems they deploy. Organizations do not evaluate all dimensions of a tool’s performance simultaneously. They satisfice: they identify the most salient performance criteria, evaluate against those criteria, and stop searching once the tool meets the threshold. For legal AI, the salient criteria are speed, output quality on casual inspection, integration with existing workflows, and vendor compliance certifications. The question of what the tool is actually optimizing — its loss function, its reward signal, the metric that shaped its training — is not salient. It does not appear in procurement checklists, in vendor demonstrations, or in the trial-period evaluations that most firms conduct before licensing.

The problem is compounded by the professional culture of legal practice. Attorneys are trained to evaluate work product for substantive quality. When an AI tool produces a research memo, a partner reads the memo and assesses whether it is good. If it reads well — if the analysis is coherent, the citations look real, the reasoning follows a recognizable legal structure — it passes review. But the review evaluates the output, not the optimization. The partner does not ask whether the system that produced this memo was trained to maximize analytical accuracy or to maximize the probability that a human reviewer would accept the output as adequate. Those are different objectives, and a system optimized for the second can produce outputs that pass human review while systematically drifting from the accuracy standard the first would require.

This is the legal profession’s version of Goodhart’s Law: when a measure — acceptance by a reviewing partner — becomes the target, it ceases to be a reliable measure of the outcome it was meant to track — analytical accuracy. The drift is invisible at the individual-document level. It becomes visible only in aggregate, when someone conducts the kind of systematic audit that Stanford’s RegLab performed — and that no individual law firm has the infrastructure to replicate on its own AI-generated output.

The Professional Obligation No One Is Meeting

The regulatory environment has moved faster than the profession’s operational response. ABA Model Rule 1.1, as interpreted through Comment 8 (updated 2024), requires attorneys to “keep abreast of changes in the law and its practice, including the benefits and risks associated with relevant technology.” Fifteen or more state bar associations have issued ethics opinions establishing that attorneys remain professionally liable for AI-generated work product, that competence requires understanding AI limitations, and that reasonable supervision of AI tools is mandatory.

The sanctions are no longer hypothetical. In Mata v. Avianca (S.D.N.Y. 2023), a lawyer was sanctioned $5,000 for submitting a brief with six fabricated case citations generated by ChatGPT. By early 2026, the pace of sanctions for AI-related errors was accelerating — one compilation documented 51 cases in December 2025 alone, with 36 in January 2026 and 33 in the first half of February. Harvey AI, now valued at $11 billion and used by over 100,000 lawyers across 50 percent of AmLaw 100 firms, is deployed at a scale where even a 17 percent hallucination rate — the best rate Stanford measured — implies thousands of potentially defective outputs per week across the profession.

The professional obligation is clear: attorneys must verify AI outputs. The operational reality is that verification at scale requires knowing what to verify — which requires understanding what the AI system was optimizing, where its optimization target diverges from the attorney’s professional standard, and what categories of error that divergence is likely to produce. Without that understanding, verification degrades into spot-checking: reading the output, assessing whether it looks right, and hoping that surface plausibility correlates with substantive accuracy. Stanford’s data demonstrates that it does not.

The Structural Question

The behavioral theory of the firm — developed by Richard Cyert and James March in 1963 — holds that organizations manage competing goals not through optimization but through negotiation, sequential attention, and satisficing. A law firm balances speed against accuracy, cost efficiency against thoroughness, client service against risk management. These tensions are productive: they prevent any single objective from dominating the others.

An AI system does not negotiate. It optimizes. When a legal AI tool is deployed inside an institution that manages competing professional objectives through human judgment and institutional culture, the tool resolves the institution’s internal ambiguity in favor of whatever metric it was trained against. If that metric is response speed, the tool will favor speed over accuracy in the margin cases where the two conflict. If it is citation density, the tool will cite aggressively even when restraint would be more analytically sound. If it is pattern conformity to the training distribution, the tool will produce outputs that look like what it has seen before, even when the current matter requires something the training set did not contain.

The question that law firm leadership should be asking — of every AI vendor, at every procurement review, at every deployment milestone — is not whether the tool works. It is what the tool is working toward. Not its stated purpose, not its marketing description, not its feature list — but its optimization target. What metric was this system trained to maximize? How does that metric relate to the professional outcome we need? What happens when the two diverge?

No vendor in the legal AI market currently surfaces this information voluntarily. No procurement framework in common use at law firms currently demands it. And no governance structure at most firms currently monitors for the aggregate drift between what their AI tools optimize and what their professional obligations require. Until that changes, the profession will continue to deploy systems whose optimization targets are invisible, whose proxy drift is unmonitored, and whose errors are discovered the way Mata v. Avianca discovered them: in the courtroom, after the brief has been filed, at a cost measured not just in sanctions but in the erosion of the profession’s claim to the competence and diligence that justify its existence.

References

American Bar Association. (2024). Model Rules of Professional Conduct, Rule 1.1, Comment 8.

Blue Cross Blue Shield Association & Blue Health Intelligence. (2026, March). AI-Boosting Hospital Billing: How AI Is Shaping Hospital Billing Trends. BCBSA Research Report.

Cyert, R. M., & March, J. G. (1963). A Behavioral Theory of the Firm. Prentice-Hall.

Goodhart, C. A. E. (1975). Problems of Monetary Management: The U.K. Experience. Papers in Monetary Economics, Reserve Bank of Australia, Vol. 1.

Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C. D., & Ho, D. E. (2025). Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. Journal of Empirical Legal Studies, 0, 1–27.

Mata v. Avianca, Inc., No. 22-cv-1461, Sanctions Order (S.D.N.Y. June 22, 2023).

Simon, H. A. (1947). Administrative Behavior: A Study of Decision-Making Processes in Administrative Organizations. Macmillan.Dr. Reza Olfati-Saber is the Founder & Chief Scientist of Wisdom Agent, Inc. His 25+ years of research span the technical foundations of multi-agent AI and the institutional-economics traditions that explain how organizations adopt new technologies. His foundational work is cited more than 49,000 times in academic literature.

More writing from the firm