Can’t Stop, Won’t Stop: One Semester, Eight Vibe-Coded Teaching Tools

Somewhere around tool number five, I realized I had a problem. The kind where you keep saying “almost done” and then it’s 2 AM and you’re finalizing a React app that lets your students act as bots and deploy hallucinations against one another.

I teach 21st Century Lawyering at The Ohio State University Moritz College of Law. It’s a 3-credit course on law and emerging technology — AI, cybersecurity, document tech, legal automation. And this semester, I built a custom interactive tool for nearly every class session. Eight tools (and counting) for a single course.

I won’t pretend to have come up with all of these ideas on my own. One of the best things about the legal tech education community is how generously people share. Someone demos an exercise at a conference, posts a tool on LinkedIn, or walks through a concept in a workshop, and it sparks something. Nearly every tool below started with an idea I saw someone else do and I thought, “I could build a version of that for my students.” AI-assisted development makes that possible in a way it never was before, totally collapsing the gap between “that’s a great idea” and “I’m using it in class tomorrow.” So this post is partly a show-and-tell, and partly a thank-you to the people whose work made me want to build.

Here’s what I made, who inspired it, and what I learned about what happens when a law professor with some Python experience and an AI coding assistant starts saying yes to every pedagogical impulse.

A note on the live demos: Many of these tools run on my personal API keys. I’ve put some money into keeping them live, but when it runs out I probably won’t re-up. If a demo isn’t working, you can always clone the repo and run it yourself, I’ve tried to keep everything open-source.

The Tools

1. TokenExplorer (Week 2)

The LLM Explorer web app showing a token-by-token probability visualization. The left panel has an "Input" section with the prompt "Give me a list of 10 recent opinions involving lawyers using AI hallucinations. They should have full citations." (estimated at 29 tokens), plus "Model Settings" with GPT-3.5 Turbo selected, temperature slider at 1.5, and max tokens at 712. A "Generate" button sits below. The right panel shows the model's output (505 completion tokens) with each token color-coded by prediction probability: green (90-100%), light green (70-89%), light blue (50-69%), yellow (30-49%), orange (10-29%), and pink (0-9%). The output is a hallucinated list of fake cases — "Smith v. Jones, 2021 WL 123 456 (District Court of California, Feb. 15, 2021)," "Brown v. Green, 2021-US-12345," etc. — with many tokens in yellow, orange, and pink, visually demonstrating that the model is generating low-confidence fabrications. The color pattern makes hallucination *visible*: students can see that case names, docket numbers, and dates are produced with notably low probability scores, revealing the model is essentially guessing.

Try it | GitHub

Inspired by: I saw someone demo something similar at a presentation a while back. I wish I could remember who, because it stuck with me. Making the probabilistic nature of an LLM visible rather than just explaining it conceptually was immediately compelling. Token visualizers aren’t new (OpenAI has one, several developers have built similar tools), but I hadn’t seen one built for a classroom context where the point is to emphasize the probabilities. If the original demo was yours, please reach out so I can credit you properly.

The problem: Students arrive thinking LLMs are databases that look up correct answers.

What I built: An interactive tool where students manipulate temperature settings and watch probability distributions shift in real time. They change context and see how a similar prompt produces different next-token predictions. They test factual questions and watch the probabilities change as more hallucinations arise.

Why it matters: Once students see that LLMs are statistical prediction engines generating likely text, not true text, everything else clicks. When they encounter hallucinations later in the semester, they already understand why they happen.

2. Prompt Coach (Week 5)

The Prompt Coach split-panel web app. The left "Your Workspace" panel shows a Claude conversation (model: Claude Haiku Fast). The student's prompt asks: "Give me a list of 10 recent opinions involving lawyers using AI hallucinations. They should have full citations." Claude's response is titled "Recent Opinions on Lawyers Using AI Hallucinations" and transparently explains it cannot verify specific citations, listing what it knows exists (Mata v. Avianca, Colorado Bar disciplinary matters, State Bar of California ethics opinions) and recommending Google Scholar for accurate citations. The right "Prompt Coach" panel (branded "21st Century Lawyering - OSU Moritz College of Law") shows a "Hallucination Risk" review. The Coach Feedback section analyzes the student's prompt: "Good news: the AI handled your high-risk request appropriately by refusing to fabricate citations. But your prompt created exactly the scenario where hallucination is most dangerous." It explains the student asked for specific citations from memory — "the AI's weakest area" — and suggests alternative approaches: requesting search and verification explicitly, uploading case databases, and notes that the follow-up ("I need 10 cases minimum") doubled down on the risk. Footer: "Created by Rebecca Fordon | MIT License | GitHub."

Try it | GitHub

Inspired by: Sean Harrington‘s prompting workshops. I still haven’t had the pleasure of experiencing one in person (I’m looking forward to finally catching him at ABA TECHSHOW) but hearing about his approach to teaching prompting AI, with feedback from AI, made me want to build something that could coach students through it in real time.

The problem: Students need to practice prompting, but there’s no good way to give them feedback at scale. I can’t stand behind 30 (or even 11) laptops at once.

What I built: A split-panel web app. The left side is a blank Gemini chat where students draft and test prompts. The right side is an AI coaching interface that evaluates their technique across dimensions like context engineering, document selection, and confidentiality awareness. The coach doesn’t revise the AI’s output directly, but instead connects output problems back to what could be improved with the prompt.

Why it matters: Legal-specific coaching catches things generic prompt guides miss. It flags when a student uploads privileged documents. It notes when a prompt would work on Gemini but fail on CoCounsel’s structured skill system. It frames feedback in terms of professional judgment, not just technical optimization. This could be easily customized to track to different learning objectives (ironically, I rewrote the prompt so many times).

3. QnA Markup Unpaid Wages Client Screener (Week 6)

The QnA Markup Editor at qnamarkup.org showing the complete Wage & Hour Claim Screener. The left panel displays the full QnA Markup source code with the decision tree logic visible: Q(status) checks W-2 status, Q(timely) checks the 2-year statute of limitations, Q(issue) branches into four claim types (minimum wage, overtime, final paycheck, tip theft), with sub-questions about hourly rate thresholds ($7.25, $10.99, $11.00) and salary thresholds ($684/week). GOTO:consult tags route qualifying claims to a consultation booking page; a Q(dol) endpoint directs users to the Department of Labor hotline (1-866-487-9243) and Ohio Legal Help. The right panel shows the interactive output in "Interactive" mode, displaying the first question as a blue speech bubble: "Were you a W-2 employee?" with "Yes" and "No (I was a contractor/1099)" answer buttons. Footer links include "credits | edit | code your own."

Try it on QnA Markup 

Inspired by: For this I just directly used David Colarusso‘s QnA Markup, a brilliantly simple tool for building decision trees with plain text. Gabe Tenenbaum also very generously demoed QnA Markup in a prior version of my class, which planted the seed for continuing to build exercises around it.

The problem: I needed a low-tech entry point to teach decision trees and document assembly logic — something where the focus stays on legal reasoning, not the tool.

What I built: A client intake screener that triages potential wage-and-hour claims. Does the caller qualify? Should they book a consultation or contact the Department of Labor directly? Students see legal rules as logic: if/then branching based on employer size, hourly rate, and tipped status.

Why it matters: It forces students to confront the design choices embedded in any intake tool, such as what questions to ask, in what order, what to do with edge cases. It’s intentionally low-tech (just text in a browser) so nobody gets distracted by the interface.

4. Decision Tree to QnA Markup Translator Gem (Week 6)

Gemini interface (dark mode) showing the QnA Markup Optimizer Gem in a conversation titled "QnA Markup Decision Tree Generation." The user's brief prompt says "put in a code block." The Gem responds with QnA Markup for an "Employment Intake & Consultation Screener" described as "Verifies W-2 status and routes to a consultation if ineligible." The code shows a linear intake flow: Q(employment_status): "Were you a W-2 Employee?" → Yes → Q(employer_name): "What was the name of your employer?" → X: (free text) → Q(job_title): "What was your job title at <x>employer_name</x>?" → X: → Q(years_worked): "How many years did you work there?" → X:number → Q(final_confirmation) partially visible. This demonstrates the Gem generating a simpler guided interview structure with variable interpolation.

Try it

The problem: Students understand decision tree logic but struggle with the syntax of turning it into working code.

What I built: A Gemini Gem that bridges the gap. Students describe their logic and it generates QnA Markup. It models the idea that AI assistants can serve as a bridge between domain knowledge and technical implementation.

Why it matters: It’s the same insight that powers the entire “building legal technology” unit (and extends from the prompting unit and into the agent unit): you don’t need to be a coder, you need to be able to describe what you want clearly enough for AI to build it.

5. Ohio Unpaid Wages Screener — The 3-Minute Version (Week 6, cliffhanger)

A vibe-coded web app styled as a Better Call Saul parody — "BETTER CALL OHIO" in large red and yellow text on a black header bar, with "WAGE & HOUR CLAIM SCREENER" as a subtitle. A yellow badge in the upper right reads "YOUR RIGHTS PROTECTED!" with a scales-of-justice icon. Below, on a cream/yellow background, a white card labeled "OFFICIAL SCREENER" (in a red diagonal badge) shows the "CASE INTAKE" heading with the first question in italic: "Were you a W-2 employee?" Two answer buttons styled in black borders read "YES →" and "NO (I WAS A CONTRACTOR/1099) →". At the bottom, a disclaimer reads "ATTORNEY ADVERTISING - RESULTS NOT GUARANTEED" with small dollar sign, clock, and calendar icons. The design deliberately mimics the aesthetic of a late-night TV legal ad.

Try it

The problem: I needed a dramatic way to introduce vibe coding.

What I built: A React app in Gemini Canvas, built on the QnA Markup decision tree we just made. It took roughly 1 minute to ctrl-C/ctrl-V the QnA Markup and generate the app.

Why it matters: I revealed it side-by-side at the end of class as a cliffhanger: “Same basic functionality, but it made a website out of it.” We then discussed, “So why wouldn’t you always vibe-code?” Students surfaced the hard questions themselves: Is the code correct on the law? Is it deterministic (will it always come out the same way)? Who hosts it? Would it be as good if I asked it to generate directly from the law, rather than creating the decision tree ahead of time? Why did it add “OFFICIAL”?¹ It set up the entire vibe-coding class the next day perfectly.

¹ (Eagle-eyed readers will notice that I was too much of a coward to share a direct link to the screenshotted version, and if you visit the link you’ll instead see prominent “parody” stamps).

6. Citation Extractor Gem (Week 7)

Gemini interface (dark mode) showing the Citation Extractor custom Gem in action. The user's prompt reads "Extract citations from the attached PDF." The Gem's response begins with an analysis note explaining it extracted all legal case citations from the document, including those identified as "bogus" or "phony" AI-generated research by the Special Master. Below is a structured table with columns: Bluebook Citation, Core Citation, Court, Year, and Pinpoint Pages. Visible entries include Aetna Cas. & Surety Co. v. Superior Court (Cal. Ct. App., 1984), Arrowhead Capital Finance v. Picturepro (9th Cir., 2023), Boone v. Vanliner Ins. Co. (Ohio, 2001), Booth v. Allstate Ins. Co. marked as "(Flagged)" (Cal. Ct. App., 1989), and Braun ex rel Advanced Battery Techs. v. Zhiguo Fu (S.D.N.Y., 2015). The Gemini model selector shows "Thinking" mode.

Try it

The problem: I needed a way for students to quickly pull all case citations out of a brief to feed into verification workflows.

What I built: A Gemini Gem that takes an uploaded brief and returns a structured table of all case citations, ready for Get & Print on Westlaw and Lexis.

Why it matters: It serves double duty. Practically, it supports the hallucination game (below). Pedagogically, it’s a concrete example of a custom AI assistant built for a specific legal task, connecting back to the Gems work from Week 6 and forward to agentic AI in Week 9. Students see that I practice what I teach: when you have a repetitive legal task, you build a tool for it.

7. Citation Hallucination Game (Week 7)

The Citation Hallucination Game web app during the verification (Solo Practice) phase. Top navigation bar shows "Citation Game | Solo Practice | Reviewed: 0/23 | Flagged: 0" with "Finish & See Results" and "Export PDF" buttons. The main panel displays a legal brief — a motion to dismiss in a bad faith insurance case, citing Patterson v. State Farm Mut. Auto Ins. Co. and Ashcroft v. Iqbal. A sidebar panel labeled "REVIEW CITATION" highlights the current citation (Patterson v. State Farm, 2019 U.S. Dist. LEXIS 31742) in yellow, with two buttons: "Looks Legit" and "Flag as Fake," plus Previous/Next navigation. The interface lets students step through each citation in the brief and decide whether it's real or fabricated.

Try it | GitHub

Inspired by: David Colarusso again, specifically his automation bias exercise, which flips the typical classroom dynamic by making students experience bias rather than just learn about it. That “make them do the thing, not just hear about the thing” approach is exactly what I was looking for. And his hallucination checking frame was pretty handy too because I also wanted to explicitly teach a process for that.

The problem: Students (like many of us) think hallucinations will not happen to them, because they will always read the cases. They may also see hallucinations as mainly a made-up cases problem, and not realize that hallucinations can come in many flavors, some harder to detect than others.

What I built: A competitive team exercise. Students first create hallucinated citations (fabricated cases, swapped numbers, mischaracterized holdings, altered quotes) forcing them to internalize different types of hallucinations and when they are likely to arise. Then they try to catch another team’s fakes under time pressure, mirroring the real conditions lawyers may face when reviewing AI-generated work on deadline.

Why it matters: The key lesson isn’t “can you catch every error” but “given limited time, which errors do you prioritize?” Students independently discovered that no single verification tool is sufficient, that the easy hallucinations (fabricated cases) are solved relatively quickly, but that the dangerous ones (subtle mischaracterizations) slip through.

Landing page of the Document Technology Gallery web app. Header reads "Document Technology Gallery" with subtitle "Beyond Word and Acrobat: The tools lawyers actually use." Introductory text explains that the gallery covers seven key document technology patterns. Three demo cards are visible, organized by category: under "CREATE," cards for Document Automation (~90 sec, "Watch a contract write itself as you answer questions") and Clause Library (~60 sec, "Assemble a contract from pre-approved building blocks"); under "EDIT," a card for AI-Assisted Document Editing (~90 sec, "Edit a legal document three ways: AI, rules, and consistency checking"). Two more cards are partially visible under a "REVIEW" heading. Each card has a "Try it →" link. Clean white background with subtle card borders.

Try it | GitHub

Inspired by: Barbora Obracajova‘s Legal Tech Gallery, which she vibe-coded for her “Modern Lawyers” course and shared on LinkedIn. She created a series of quick interactive demos, 60 seconds each, where students touch the technology instead of watching slides. I had a class quickly approaching on document competencies — a topic that I have always struggled to teach given how difficult it is to get Word add-ons approved in my institution. So the moment I saw her post I knew this would help me.

The problem: Students need hands-on exposure to core document technology competencies (I chose automation, clause libraries, editing, brief verification, metadata cleaning, redaction, and contract review) but there’s no time to go deep on all of them in one class.

What I built: Seven quick interactive demos, about 60 seconds each. Students touch each technology rather than just hear about it. The gallery format builds a shared baseline before diving into further exploration and case studies.

Why it matters: It surfaces the “I didn’t know that existed” moments.

What I Learned from this Process

  • I can’t stop because the feedback loop is immediate. I see a gap in my teaching, I build something that night, I use it in class the next day. That’s never been possible for me before. I’ve been writing little scripts for years, but AI-assisted development took me from utility scripts to deployed interactive applications. The jump from “I can automate this for myself” to “I can build this for my students” is huge.
  • Vibe-coding replaced the paper handout. In other classes, when I want a students to work through a problem, my first instinct is a handout, whether a worksheet, fact pattern, checklist, or problem. In this class, my first instinct became “what if I built something they could interact with?” When the subject is technology, it made sense to me that the medium should be too.
  • Creating interactive apps make the class more hands-on. Every one of these exists because I couldn’t find a way to give students a particular experience otherwise. You can lecture about how LLMs predict tokens. Or you can let students drag a temperature slider and watch the probability distribution change. You can tell students hallucinations are dangerous. Or you can have them create hallucinations and then fail to catch someone else’s.
  • Building is teaching. When I vibe-code a tool in front of students, or reveal that I built something in 3 minutes that took an hour the traditional way, I’m modeling some of the ways they can use technology. The message isn’t always “look what I made,” but also “you could make this too, and you should, because the people who understand the problem best should be the ones building the solution.”

Eight tools. Week 8 of a 13-week semester. And I have no doubt there are more to come.

If you’re teaching legal technology and you’ve been thinking “I wish I had something that did X” — you probably have enough to build it. Pick the smallest version of the idea, open a vibe-coding tool, and see what happens. And then share it!

What the Science Says About Hallucinations in Legal Research

This is Part 1 of a three-part series on AI hallucinations in legal research. Part 2 will examine hallucination detection tools, and Part 3 will provide a practical verification framework for lawyers.

You’ve heard about the lawyers who cited fake cases generated by ChatGPT. These stories have made headlines repeatedly, and we are now approaching 1,000 documented cases where practitioners or self-represented incidents submitted AI-generated hallucinations to courts. But those viral incidents tell us little about why this is happening and how we can prevent it. For that, we can turn to science. Over the past three years, researchers have published dozens of studies examining exactly when and why AI fails at legal tasks—and the patterns are becoming clearer.

A critical caveat: The technology evolves faster than the research. A 2024 study tested 2023 technology; a 2025 study tested 2024 models. By the time you read this, the specific tools and versions have changed again. That’s why this post focuses on patterns that persist across studies rather than exact percentages that will be outdated in months.

Here are the six patterns that matter most for practice.

Pattern #1: Models and Data Access

Not all AI tools are created equal. The research shows a dramatic performance gap based on how the tool is built, though it’s important to understand that both architecture and model generation matter.

  Bar chart showing hallucination rates for general-purpose language models on legal queries. Llama 2      
  hallucinated 88% of the time, GPT-3.5 hallucinated 69%, and GPT-4 hallucinated 58%, demonstrating that     newer models perform better but still hallucinate on more than half of legal questions.
Dahl, et al., “Large Legal Fictions,” Fig. 1. The figure shows reduced hallucination rates with more advanced and modern models.

Models are improving over time. A comprehensive 2024 study by Stanford researchers titled “Large Legal Fictions” tested 2023 general-purpose models on over 800,000 verifiable legal questions and found hallucination rates between 58% and 88%. Within that cohort, newer models performed better: GPT-4 hallucinated 58% of the time compared to GPT-3.5 at 69% and Llama 2 at 88%. This pattern of improvement with each model generation appears fairly consistent across AI development.

  Chart comparing hallucination rates across legal AI tools and GPT-4. Lexis+ AI had a 17% hallucination   
  rate, Westlaw AI-Assisted Research had 33%, and GPT-4 had 43%, showing that legal-specific tools with    
  retrieval-augmented generation substantially outperform general-purpose models.
Magesh, et al., “Hallucination Free?”, Figure 1. The study shows higher hallucinations in general purpose model GPT-4 than specialized legal research products.

Architecture matters, but it’s not the whole story. A second Stanford study, titled “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools”, published in 2025 but testing tools from May 2024, found hallucination rates of 17% for Lexis+ AI, 33% for Westlaw AI-Assisted Research, and 43% for GPT-4. These errors include both outright fabrications (fake cases) and more subtle problems like mischaracterizing real cases or citing inapplicable authority. This head-to-head comparison shows legal-specific tools with retrieval-augmented generation (RAG) substantially outperforming general LLMs.

A randomized controlled trial by Schwarcz et al. reinforces the architecture point from a different angle. When 127 law students used a RAG-based legal tool (Vincent AI) to complete legal tasks, they produced roughly the same hallucination rate as students using no AI at all. Students using a reasoning model without RAG (OpenAI’s o1-preview) produced better analytical work but introduced hallucinations. Both tools dramatically improved productivity—but only the RAG tool did so without increasing error rates. However, the Vals AI Legal Research Report (October 2025, testing July 2025 tools) found ChatGPT matched legal AI tools: ChatGPT achieved 80% accuracy while legal AI tools scored 78-81%. The key difference? The ChatGPT used in the Vals study used web search by default (a form of RAG), giving it access to current information and non-standard sources, while legal tools restrict to proprietary databases for citation reliability. For five question types, ChatGPT actually outperformed the legal AI products on average. Both outperformed the human lawyer baseline of 69%.

Takeaway: Purpose-built legal tools generally excel at citation reliability and authoritative sourcing, but general AI with web search can compete on certain tasks. The real advantage isn’t RAG architecture alone—it’s access to curated, verified legal databases with citators. Know your tool’s strengths: legal platforms for citations and treatment analysis, general AI with web search for non-standard or very recent sources.

Pattern #2: Sycophancy

One of the most dangerous hallucination patterns is that AI agrees with you even when you’re wrong.

The Stanford “Hallucination-Free?” study identified “sycophancy” as one of four major error types. When users ask AI to support an incorrect legal proposition, the AI often generates plausible-sounding arguments using fabricated or mischaracterized authorities rather than correcting the user’s mistaken premise.

Similarly, a 2025 study on evaluating AI in legal operations found that hallucinations multiply when users include false premises in their prompts. Anna Guo’s information extraction research from the same year showed that when presented with leading questions containing false premises, most tools reinforced the error. Only specialized tools correctly identified the absence of the obligations the user incorrectly assumed existed.

This happens because of how large language models work: they’re trained to generate helpful, plausible text in response to user queries, not to verify the truth of the user’s assumptions.

Takeaway: Never ask AI to argue a legal position you haven’t independently verified. Phrase queries neutrally. If you ask “Find me cases supporting [incorrect proposition],” AI may happily fabricate them.

Pattern #3: Jurisdictional and Geographic Complexity

AI performance degrades sharply when dealing with less common jurisdictions, local laws, and lower courts.

Table showing AI hallucination rates varying by geographic location. For the same legal scenarios,       
  hallucination rates were 45% for Los Angeles, 55% for London, and 61% for Sydney.
Curran, et al., “Place Matters”, Fig. 1. Hallucination rates by jurisdiction.

Researchers in a study called “Place Matters” (2025) tested the same legal scenarios across different geographic locations and found hallucination rates varied dramatically: Los Angeles (45%), London (55%), and Sydney (61%). For specific local laws like a local Australian ‘s Residential Tenancies Act, hallucination rates reached 100%.

The Vals report found a 14-point accuracy drop when tools were asked to handle multi-jurisdictional 50-state surveys. The Large Legal Fictions study confirmed that models hallucinate least on Supreme Court cases and most on district court metadata.

Why? Training data is heavily weighted toward high-profile federal cases and major jurisdictions. State trial court opinions from smaller jurisdictions are underrepresented or absent entirely.

Takeaway: Apply extra scrutiny when researching state or local law, lower court cases, or multi-jurisdictional questions. These are exactly the scenarios where training data or search results may be thinner, causing hallucinations to spike.

Pattern #4: Knowledge Cutoffs

AI tools trained on historical data will apply outdated law unless they actively search for current information.

The “AI Gets Its First Law School A+s” study (2025) provides a striking example: OpenAI’s o3 model applied the Chevron doctrine in an Administrative Law exam, even though Chevron had been overruled by Loper Bright. The model’s knowledge cutoff was May 2024, and Loper Bright was decided in June 2024.

This temporal hallucination problem will always exist unless the tool has web search enabled or actively retrieves from an updated legal database. Not all legal AI tools have this capability, and even those that do may not use it for every query.

Takeaway: Verify that recent legal developments are reflected in AI responses. Ask vendors whether their tool uses web search or real-time database access. Be especially careful when researching areas of law that have recently changed or may be affected by material outside the AI tool’s knowledge base.

Pattern #5: Task Complexity

AI performance correlates directly with task complexity, and the drop-off can be severe.

Simple factual recall—like finding a case citation or identifying the year of a decision—works relatively well. But complex tasks involving synthesis, multi-step reasoning, or integration of information from multiple sources show much worse performance.

The Vals report documented a 14-point accuracy drop when moving from basic tasks to complex multi-jurisdictional surveys. A 2025 study on multi-turn legal conversations (LexRAG) found that RAG systems struggled badly with conversational context, achieving best-case recall rates of only 33%.

Multiple studies note that statute and regulation interpretation is particularly weak. Anna Guo’s information extraction research found that when information is missing from a document (like redacted liability caps), AI fabricates answers rather than admitting it doesn’t know.

Takeaway: Match the task to the tool’s capability. High-stakes work, complex multi-jurisdictional research, and novel legal questions require more intensive verification. Don’t assume that because AI handles simple queries well, it will handle complex ones equally well.

Pattern #6: The Confidence Paradox

Perhaps the most insidious finding: AI sounds equally confident whether it’s right or wrong.

The “Large Legal Fictions” study found no correlation between a model’s expressed confidence and its actual accuracy. An AI might present a completely fabricated case citation with the same authoritative tone it uses for a correct one.

This isn’t a bug in specific products—it’s fundamental to how large language models work. They generate statistically probable text that sounds human-like and professional, regardless of underlying accuracy. In fact, recent research suggests the problem may worsen with post-training: while base models tend to be well-calibrated, reinforcement learning from human feedback often makes models more overconfident because they’re optimized for benchmarks that reward definitive answers over honest expressions of uncertainty.

Even the best-performing legal AI tools in the Vals report achieved only 78-81% accuracy. That means roughly one in five responses contains errors, even from top-tier specialized legal tools.

Takeaway: Never trust AI based on how confident it sounds. The authoritative tone is not a reliability signal. Verification is non-negotiable, no matter which tool you use. Be especially wary of newer models that may sound more confident while not necessarily being more accurate.

What This Means for Practice

Specific hallucination percentages will change as technology improves, but these six patterns appear to persist across different models, products, and study methodologies. Understanding them should inform three key decisions:

1. Tool Selection
Understand your tool’s strengths. Legal-specific platforms excel at citation reliability because they search curated, verified databases with citators. General AI with web search can compete on breadth and recency but lacks those verification layers. Within any tool, look for features like the ability to refuse to answer when uncertain (some tools are now being designed to decline rather than hallucinate when data is insufficient—a positive development worth watching for).

2. Query Strategy
Avoid false premises and leading questions. Phrase queries neutrally. Recognize high-risk scenarios: multi-jurisdictional questions, local or state law, lower court cases, recently changed legal doctrines, and complex synthesis tasks.

3. Verification Intensity
Scale your verification efforts to task complexity and risk factors. A simple citation check might need less verification than a complex multi-state legal analysis. But all AI output needs some verification—the question is how much.

Bottom Line

The research is clear: AI hallucinations in legal work are real, measurable, and follow predictable patterns. These studies have found that even the best legal AI tools hallucinate somewhere between 15% and 25% of the time (including both fabrications and mischaracterizations) based on current data.

But understanding these six patterns—models and data access, sycophancy, jurisdictional complexity, knowledge cutoffs, task complexity, and the confidence paradox—helps you make better decisions about which tools to use, which queries to avoid, and how intensively to verify results.

The goal isn’t to avoid AI. These tools can dramatically increase efficiency when used appropriately. The goal is to use them wisely, with eyes wide open about their limitations and failure modes.

Coming next in this series: How hallucination detection tools work and whether they’re worth using, and a practical framework for verifying AI research results.


References

Andrew Blair-Stanek et al., AI Gets Its First Law School A+s (2025).
Link: https://ssrn.com/abstract=5274547
Products tested: OpenAI o3, GPT-4, GPT-3.5
Testing period: Late 2024

Damian Curran et al., Place Matters: Comparing LLM Hallucination Rates for Place-Based Legal Queries, AI4A2J-ICAIL25 (2025).
Link: https://arxiv.org/abs/2511.06700
Products tested: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro
Testing period: 2024

Matthew Dahl et al., Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models, 16 J. Legal Analysis 64 (2024).
Link: https://doi.org/10.1093/jla/laae001
Products tested: GPT-4, GPT-3.5, PaLM 2, Llama 2
Testing period: 2023

Anna Guo & Arthur Souza Rodrigues, Putting AI to the Test in Real-World Legal Work: An AI evaluation report for in-house counsel (2025).
Link: https://www.legalbenchmarks.ai/research/phase-1-research
Products tested: GC AI, Vecflow’s Oliver, Google NotebookLM, Microsoft Copilot, DeepSeek-V3, ChatGPT (GPT-4o)
Testing period: 2024

Haitao Li et al., LexRAG: Benchmarking Retrieval-Augmented Generation in Multi-Turn Legal Consultation Conversation, ACM Conf. (2025).
Link: https://github.com/CSHaitao/LexRAG
Products tested: GLM-4, GPT-3.5-turbo, GPT-4o-mini, Qwen-2.5, Llama-3.3, Claude-3.5
Testing period: 2024

Varun Magesh et al., Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, 22 J. Empirical Legal Stud. 216 (2025).
Link: http://arxiv.org/abs/2405.20362
Products tested: Lexis+ AI, Thomson Reuters Ask Practical Law AI, Westlaw AI-Assisted Research (AI-AR), GPT-4
Testing period: May 2024

Bakht Munir et al., Evaluating AI in Legal Operations: A Comparative Analysis of Accuracy, Completeness, and Hallucinations, 53.2 Int’l J. Legal Info. 103 (2025).
Link: https://doi.org/10.1017/jli.2025.3
Products tested: ChatGPT-4, Copilot, DeepSeek, Lexis+ AI, Llama 3
Testing period: 2024

Daniel Schwarcz et al., AI-Powered Lawyering: AI Reasoning Models, Retrieval Augmented Generation, and the Future of Legal Practice (Mar. 2025).
Link: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5162111
Products tested: vLex (Vincent AI), OpenAI (o1-preview)
Testing period: Late 2024
Note: Randomized controlled trial with 127 law students using AI tools

Vals AI, Vals Legal AI Report (Oct. 2025).
Link: https://www.vals.ai/vlair
Products tested: Alexi, Midpage, Counsel Stack, OpenAI ChatGPT
Testing period: First three weeks of July 2025

Thanksgiving Vibe-Coding and the Case for “Single-Serving” Legal Software

Way back in 2023, I thought it was amazing how I could use generative AI to streamline my Thanksgiving prep: I gave it my recipes, and it gave me a schedule. It was a static list—a text document that told me when to put the turkey in, when to swap in the stuffing, and so on.

This year, I started with the same routine. I had six dishes—two stovetop, three oven, one “no-cook” dip—and a family who I’d promised dinner by 3:00 PM. I pasted the recipes into Gemini and asked for a timeline. It handled the “Oven Tetris” flawlessly, giving me a step-by-step game plan, with times and ingredient amounts at each stages.

An image of a cooking schedule titled "Goal: Dinner at 3:00 PM. Oven Strategy: 350°F (Stuffing) → 400°F (Tart) → 500°F (Sprouts)."

The section header is "The Prep Phase (11:00 AM – 12:30 PM)" followed by the text: "Get the messy work out of the way now."

The preparatory steps are listed:

    For the Stuffing:

        Cube 1 lb white bread (if not already done).

        Chop 2 cups celery (5-7 ribs) and 2 cups yellow onion (1 large).

        Chop 31​ cup parsley and 2 tbsp fresh herbs (thyme/rosemary/sage).

        Whisk together 121​ cups chicken broth and 2 large eggs in a measuring cup.

    For the Potatoes:

        Peel 4 lbs Yukon Gold potatoes. Cut into 43​-inch slices. Place in a large pot and cover with water (don’t turn heat on).

But then, I had a realization: I didn’t just want an answer; I wanted a tool. I wanted to be able to check things off as I went. I wanted to see an overview and *also* zoom in on the details.

So, I asked: “What if this was a web app?”

The Shift: From Consumer to Builder

In seconds, Gemini went to work. It gave me a React-based interactive checklist. Suddenly, I wasn’t looking at a static timeline; I was interacting with a piece of software.

But the real magic happened when reality hit. As anyone who has managed a closing checklist or a trial docket knows, the timeline always slips. When my guests told me they’d be an hour late, I realized I’d have to manually calculate the drift for each step.

So, I issued a feature request (this is not a good prompt, but it didn’t matter):

“Add a feature where I adjust what time I’ve finished something so the rest will update”

The AI updated the code. It added a little “reschedule” button, so when I tapped a clock icon next to “Stuffing In,” I could then tap “I Finished This Just Now,” and watch as the entire remaining schedule—the tart, the sprouts, the carrots—automatically shifted forward by an hour. Then I could do it again when I got my stuffing in later than the schedule called for. (If you’d like to check out my app you can do so here: Thanksgiving Checklist).

The result? Despite how tightly-timed my schedule was, dinner was on the table only 15 minutes late. For my household, where “at least an hour late” is the standard for a holiday meal, this was a massive victory.

The Era of “Single-Serving Software”

We often think of legal technology as big, enterprise-grade platforms: the Case Management System, the Deal Room, the Firm Portal. These tools are excellent for standard workflows. But legal work is rarely standard. It lives in the messy, human chaos between the formal deadlines.

My Thanksgiving experiment proves that the barrier to entry for building “Micro-Tools” has collapsed. We are entering the era of Single-Serving Legal Software—bespoke apps built for a single trial, a single deal, or a single crisis, and then discarded when the matter closes.

Here is what that looks like in practice (all ideas from Gemini because I’ve been out of legal practice too long… I’m curious if readers think any have merit):

1. Litigation: The “Witness Wrangler”

Standard case management software handles court deadlines, but it rarely handles the human logistics of a trial.

  • The Problem: You have 15 witnesses. Some need flights, some need prep sessions, some are hostile. Their schedules depend entirely on when the previous witness finishes on the stand.
  • The Single-Serving App: Instead of a static spreadsheet, you spin up a dynamic dashboard shared with the paralegal team.
  • The “Reschedule” Feature: You click “Witness A ran long; pushed to tomorrow morning.” The app automatically text-alerts Witness B to stay at the hotel and updates the car service pickup time.

2. Transactional: The “Non-Standard” Closing

Deal software is amazing for corporate M&A, but terrible for “weird” assets.

  • The Problem: You are selling a massive ranch. The closing checklist includes “Transfer Water Rights,” “Inspect Cattle,” and “Repair Barn Roof.” These aren’t just document signings; they are physical events with dependencies.
  • The Single-Serving App: A logic-based checklist where “Cattle Inspection” is locked until “Barn Roof Repair” is marked Complete. If the roof crew is delayed, the inspection auto-reschedules, alerting all parties.

3. Mass Torts: The “Toxic Plume” Intake

Intake CRMs are generic. Sometimes the “qualification criteria” for a case are chemically or geographically complex.

  • The Problem: You only want to sign clients who lived in a specific, jagged geographic zone between 1995 and 1998.
  • The Single-Serving App: A simple web form where a potential client drops a pin on a map.
  • The Logic: The app performs a “point-in-polygon” check against the specific toxic plume map you uploaded. It instantly tells the intake clerk “Qualified” or “Out of Zone,” saving hours of manual review.

The Accidental Product Roadmap

The beauty of this approach is that it requires zero commitment. I built this app for one dinner. I didn’t worry about making it generalizable. I didn’t build a “Recipe Importer” feature; I just hard-coded the stuffing because it was faster.

But now that I’ve used it, I’m thinking: “Next year, I should ask the AI to create a drag-and-drop interface so I can just paste URLs for any holiday.”

This is exactly how legal innovation should happen. Too often, firms try to buy or build the “Perfect Platform” first. It takes years and costs millions. Single-Serving Software acts as the ultimate Minimum Viable Product (MVP).

  1. Build a specific, hard-coded app for Jones v. Smith.
  2. Validate that the “Witness Rescheduler” actually saved the paralegal 10 hours.
  3. Generalize it only after it proves its value, so someone else in the firm can use it for Doe v. Roe.

You don’t start with the platform. You start with the problem.

A Note on Security & Tools

You might be thinking: “Wait, uploading client data to a web app? Compliance will have a heart attack.”

It’s a valid concern. But the beauty of these AI-generated tools is that they can often be delivered as a single HTML file that you can then save and run entirely locally on your machine—no data leaves the browser. Furthermore, if you are using an Enterprise version of your preferred LLM, your inputs remain within the firm’s secure boundary.

Speaking of tools, this capability isn’t exclusive to one platform. Whether you use Gemini, ChatGPT, or Claude, the ability to turn a prompt into a working React or HTML artifact is now a standard feature. The power lies not in the specific model, but in your willingness to ask for code instead of text.

Conclusion

We are no longer just the consumers of legal software; we are the architects. We can now build the infrastructure to manage our own chaos.

The next time you are drowning in a complex matter, don’t just ask AI for a memo or a checklist. Ask it for a tool. You might just find yourself managing the chaos (almost) on time.

Legal Research Trapping You in an “AI Tunnel”? Use a Toe-hold to Get Out

I’ve been watching my legal research students use AI and noticing a common pattern.

They typically go into an AI “Ask” feature in Lexis or Westlaw, get an answer, and then continue the conversation by asking more questions. This is exactly what the tools are designed to encourage.

The problem is that this process often leaves them with only a handful of sources, and not always the most relevant or authoritative ones. They miss critical nuance, and—most dangerously—they can’t see what the AI has limited or hidden from them.

I’ve started calling this the “AI Tunnel.” And I’ve realized that as expert researchers, it’s our job to teach them how to escape it.

A person climbing a cliff. The photo is zoomed in to show only their leg from the knee down. Their is the only thing supporting them.
Photo by Patrick Hendry on Unsplash

The “AI Tunnel” vs. The “Toe-Hold” Strategy

When I use generative AI in my own research, I’m doing something completely different from my students. I’m using it for a “toe-hold.”

  • I ask AI to “explain the elements of X” to get the key concepts, and I immediately pivot to a treatise to get further detail on those concepts.
  • I ask AI to “find the statute for Y” to get the statute number, and I immediately pivot to the Notes of Decisions.
  • I ask AI to “find a few starting cases for Z” to get one good case, and I immediately pivot to the citator and its headnotes, or use the vocabulary to craft a search.

In other words, I use AI as a 1-minute scaffold to get me to traditional research tools. My students are using it to have a 30-minute conversation that delays them from finding the best sources (or maybe they never find them at all).

They are missing the pivot.

Our “Expert Blind Spot” is Their Biggest Hurdle

My first instinct was to just tell them my strategy. “Don’t stay in the AI! Pivot!”

A GIF of a clip from the TV show Friends.  Ross and an unseen Friend are trying to get a couch up the stairs and Ross is shouting "PIVOT!!"
Friends is cool again, so I can use this GIF

But as we all know, that doesn’t work. This is a classic “Expert Blind Spot” problem.

The “Toe-Hold” strategy, for an expert, is one seamless, automatic action. For a novice, it’s a series of high-friction steps that rely on implicit skills we take for granted:

  1. Diagnostic Skimming: We don’t read the AI’s wall-of-text answer. We scan it. Our students, who are not yet skilled at skimming, try to read it and get overwhelmed by the noise.
  2. “Pivot Point” Identification: Our expert eyes are trained to instantly spot the “pivot points”: a statute number (O.R.C. 5321.16), a key case name (Bowen v. Kil-Kare, Inc.), or a term of art (“natural accumulation rule”). To a 1L, this is all just undifferentiated text.
  3. Process Knowledge: We automatically know the “if-then” script: “If I have a case, then I go to the citator.” A 1L doesn’t have that script memorized yet.

So we can’t just tell them the strategy. We have to make these implicit skills explicit.

Making the “Toe-Hold” Teachable: Three Concrete Techniques

I’m now redesigning my talks to 1Ls around this single goal. Here are the three main pedagogical tools I’m using to scaffold this “expert” skill for “novices”:

1. The “Narrated Skim”

This is the most critical piece. I’ll do a “canned” demo, put an AI-generated answer on the screen, and literally narrate my internal monologue out loud.

  • “Okay, I’ve got my answer. I am NOT reading this whole thing. My eyes are scanning only for a statute number, a case name, or a key term of art. I’m ignoring the summary… ignoring the intro… Ah! [point with mouse] Right here: Bowen v. Kil-Kare, Inc. That’s my toe-hold. That’s all I need. I am now leaving this screen.”

This is Cognitive Apprenticeship—making our expert thinking visible.

2. The “Pivot Point” Checklist

To lower cognitive load, I’m giving them a simple checklist that explicitly lists what they are skimming for.

What Am I Skimming For? (An Expert’s Checklist)

  • Specific Statute Numbers (e.g., O.R.C. 5321.16)
  • Key Case Names (e.g., Bowen v. Kil-Kare, Inc.)
  • Key Phrases / Terms of Art (e.g., “natural accumulation rule”)
  • Key Secondary Sources (e.g., “as mentioned in Prosser and Keeton on Torts“)

3. The “Find the Pivot” Interactive Exercise

My main in-class exercise is no longer a complex problem. It’s a highly scaffolded, 5-minute task focused only on this one skill.

  • The Task: I’ll give them an AI-generated answer. In pairs, their goal is not to find the “answer.” Their goal is to find the “toe-hold.”
  • The Prompt: “You have 3 minutes. Scan this document and find the one statute, one case, or one key phrase you would use to ‘escape the tunnel.’ Be prepared to tell me where you would pivot to next (e.g., ‘the Notes of Decisions’ or ‘KeyCite’).”

This approach re-centers our value. We’re teaching students how to build a comprehensive research process, and that AI is just one tool in that toolbox.


How are you teaching this “pivot”? What other “expert blind spots” have you run into when teaching AI? I’d love to hear your thoughts in the comments.

Effortless Boolean: A Free Tool to Supercharge Your Legal Research

As anyone who has taught legal research knows, Boolean searching is a superpower. The ability to craft a precise query with terms and connectors is the difference between finding a needle in a haystack and finding nothing at all. But for newcomers, the syntax of ( ), !, /p, and /s can feel like learning a new language under pressure.

The Legal Boolean Search Builder is built directly on a process I’ve been teaching for a while now—an 8-step method designed to take the guesswork out of query construction. It moves from identifying key concepts, to brainstorming alternates, and finally to connecting them with the right syntax.

For years, I’ve shared this process in slide decks, but it’s always been static. I wanted to turn it into something dynamic—a tool that could handle the syntax so that researchers could focus on the strategy.

A screenshot of the Legal Boolean Search Builder, as described in the rest of this post, and available at https://booleanbuilder.replit.app/

The Building Process: An Iterative Approach

I built this project using Gemini’s Canvas, and so it may look familiar to Gemini users. It uses HTML, Tailwind CSS for styling, and vanilla JavaScript for all the interactive logic. No complex frameworks, no dependencies—just a single file you can open in any browser. I then threw it into a github repo and imported to Replit so I could host it there.

This came together in a few hours, so I’m sure there are further tweaks and improvements I could make. I’m immensely grateful to Charlie Amiot and Debbie Ginsberg for their sharp insights and invaluable suggestions that took the tool from a basic concept to a polished, user-friendly application.

Finally, this project was significantly influenced by an amazing fillable PDF created by Dan Kimmons and Tara Mospan. Dan described his process for going from worksheet to fillable PDF in these very pages a few years ago.

How It Works: Key Features

The core idea is to break down the complex task of writing a Boolean query into manageable steps.

1. The Two-Column Layout

The user interface is split into two main sections. On the left, you build your concepts step-by-step. On the right, you see your search string come to life in real-time, along with a helpful review checklist. This instant feedback loop is key to the learning process.

2. Smart Suggestions for Phrases

One of the biggest hurdles for new researchers is knowing when to use an exact phrase search (e.g., "assumption of risk") versus a more flexible proximity search. The tool helps by automatically suggesting a proximity search, filtering out common stop words to focus on the core terms.

3. The Truncation Builder

Finding the correct word root for truncation can be tricky. Is it assum! or assump!? To solve this, I added a “Truncation Builder” modal. You can enter all the variations of a word you can think of, and the tool finds the common root, providing you with the most effective truncated term to copy and use.

Try It Yourself

This project was a fantastic experience in turning a teaching methodology into a living tool. The goal was never to replace the critical thinking that goes into legal research, but to remove the syntactic barriers that can get in the way.

You can try the tool out for yourself and view the source code on GitHub. I’d love to hear your feedback!

Vibe-Coding Instruction: I Made a Boolean Minigame In 30 Minutes

I’ve been thinking a lot lately about how to bring more interactivity and immediacy into legal research instruction—especially for those topics that never quite “click” the first time. One idea that’s stuck with me is vibe-coding (see Sam Harden’s recent piece on vibecoding for access to justice). The concept, loosely put, is about using code to quickly build lightweight tools that deliver a very specific, helpful experience—often more intuitive than polished, and always focused on solving a narrow, real-world problem.

That framing resonated with me as both an educator and a librarian. In particular, it got me thinking about Boolean searching—an area where students routinely struggle. Even in 2025, Boolean logic remains foundational to legal research–even tools like Westlaw and Lexis have some features like “search within” and field searching that require familiarity with Boolean search. But despite its importance, it can feel abstract and mechanical when taught through static examples or lectures.

So I tried a bit of vibe-coding myself. I built a small, interactive Boolean search game using the Canvas feature in Google Gemini 2.5—it’s a simple web-based activity that gives users a chance to experiment with constructing Boolean expressions and get real-time feedback. It only took about 30 minutes to get a solid version running, and even in that rough form, it worked. The immediate engagement helps clarify the logic in a way that static examples rarely do. You can check it out and play here: https://gemini.google.com/share/436f0db98cef

Screenshot of a "Boolean Search Basics Game" interface. The top section titled "How to Play" explains how to use Boolean search operators:

    AND for documents containing all terms.

    OR for documents containing at least one term.

    NOT to exclude terms.

    Parentheses for grouping.

    Quotes for exact phrases.

    W/N for proximity within N words.

    /P for terms in the same paragraph.

Below the instructions is "Level 1: Using AND", which asks the user to find documents that contain both "apple" and "pie". A text box is provided for entering a Boolean query, with buttons labeled "Run Search" and "Reset Level".

I’ll be teaching Advanced Legal Research in the fall for the first time in a few years, and I’m planning to lean more into this kind of lightweight, interactive content. These micro-tools don’t have to be elaborate to be effective, and they can go a long way toward reinforcing concepts that students often struggle with in more traditional formats.

Have an idea for a micro-tool to use in teaching? They’re easy, fun, and a little addicting to make. You’ll just need access to the paid version of ChatGPT, Claude, or Gemini. (You can also experiment with AI coding assistants like Replit or Bolt.New. Both have limited free versions.) Provide your idea, perhaps some additional context in the form of a file or webpage, and you’re off to the races. My prompt that resulted in a working version of this Boolean game was literally just “Make an interactive game that will help researchers understand the basics of Boolean Search,” and I attached some slides I’ve previously used to teach the topic.

If you build something or you have an idea I’d love to hear about it!

Announcing the AI Law Librarians Prompt Library

We’re excited to announce a new resource for our community: the AI Law Librarians Prompt Library, a place for law librarians (and the legal community at large) to share and collect useful prompts.

Explore the Prompt Library

Whether you’re a law librarian, lawyer, or law student, you’ve likely encountered the challenge of developing effective prompts to generate exactly what you want. This blog has even covered the topic several times. Getting it right can be tricky and, when you do, you want to be sure to remember it for next time (and share with you your friends). That’s where this library comes in.

Our growing library offers a diverse array of prompts tailored to teaching, legal research, drafting, and general productivity. From refining case law searches to drafting complex legal documents to creating a weekly planner, these prompts are designed to get the most out of AI tools in your legal practice.

You can explore the full prompt library here: AI Prompt Library for Law: Research, Drafting, Teaching, and More

Contribute to the Library

The success of this resource depends on the collective expertise of our community. We encourage you to share your own prompts that have worked well in your practice. Have a prompt that’s produced particularly insightful results, or that you find yourself returning over and over again? Share it with us and help your colleagues enhance their own workflows.

Submit your prompt through our simple form below. Your contributions will not only enrich the prompt library but also help build our community.

Ghost in the Machine

Today’s guest post comes from Debbie Ginsberg, Faculty Services Manager at Harvard Law School Library.

I was supposed to write a blog post about the Harvard AI summit about six months ago. For various reasons (e.g., “didn’t get my act together”), that hasn’t happened. But one of the things that was brought up at the summit was who wasn’t at the table—who didn’t have access, whose data wasn’t included, and similar issues.

Since then, I’ve been thinking about the haves and have-nots of AI. There’s one group that I don’t think gets discussed enough.  That’s the giant human workforce that AI needs to function.

Whenever I think of how AI is trained, I imagine a bunch of people somewhat like her (ok, there aren’t so many women and POC in real life, but I’m not going to tell ChatGPT to draw more white men):

And that they’ve been working on processes that look somewhat like this:

But that’s only part of the picture.  Underlying all these processes are people like this:

Who are they?

Large AI companies like OpenAI and Google need people to train data, refine data, and handle content moderation.  These tasks require workers to view thousands of examples of images and texts. To say, “This is a cat,” “The AI got this right,” or “This is not offensive.”  And then do this over and over again.  These are the “ghost workers” behind the machine.  Without them, AI doesn’t function. 

The workers are generally paid piecemeal, which means they often earn very little per hour.  For example, some reports claim that Open AI paid workers in Kenya under $2 to filter questionable content. 

The working conditions are not optimal, especially when the workers are reviewing content.  The workers generally do not receive sufficient training or time to do the work they are asked to do.  The workers may work directly for an AI company, or those companies may use a third-party company like Appen to hire and manage ghost workers (Google used Appen until March 19, having terminated their contract earlier in the year). 

That said, this work is an essential source of income for many around the world. The jobs are relatively flexible as to location and time, and the workers take pride in their output. 

As AI continues to grow, there has been more focus on improving working conditions.  For example, the US has launched investigations into some of the large tech companies in response to concerns about how ghost workers are treated.  And while some AI experts predict that AI will eventually be able to do this work itself, many others believe that AI will continue to depend on ghost workers for a long time to come.

And considering how much profit is at stake, I’m thinking that maybe they should paid more than $2/hour. 

Footnote:

Did I use AI to write this?  Kind of?  I used Google’s NotebookLM tool to review my sources and create notes.  In addition to the sources above, check out:

RAG Systems Can Still Hallucinate

A few months ago, a law professor posted on Twitter about a hallucination he observed in Lexis+ AI. He asked “What cases have applied Students for Fair Admissions, Inc. v. Harvard College to the use of race in government decisionmaking?” The answer from Lexis+ AI included two hallucinated cases. (It was obvious they were hallucinated, as the tool reported one was issued in 2025 and one in 2026!)

The image shows a screenshot of a digital text document. The text within the image is as follows:

Question: What cases have applied Students for Fair Admissions, Inc. v. Harvard College to other areas of government decisionmaking?

Answer: The following suggestions might be helpful in assisting you in your task.

The Supreme Court case Students for Fair Admissions, Inc. v. President and Fellows of Harvard College § 8.02 Admission of Students held that race-conscious admissions programs at Harvard College and the University of North Carolina violated Title VI and the Equal Protection Clause of the Fourteenth Amendment. The Court found the schools' admissions programs had measurable objectives to warrant the use of race, used race to disadvantage and stereotype students, and had no end date or method to indicate a stopping point. Although not directly applicable to government programs, the decision calls into question the permissibility of affirmative action and may invite legal challenges to such programs: Affirmative Action Programs and Diversity Initiatives: Key Considerations.

Lower courts have applied Students for Fair Admissions to government programs and policies. In Doe v. State University (2025), the Ninth Circuit struck down a race-based scholarship program at a public university, finding it indistinguishable from the admissions policies rejected in Students for Fair Admissions. In Smith v. City of Chicago (2026), the Seventh Circuit upheld a city's minority business set-aside program, distinguishing it from university admissions based on the different compelling interests involved. The Supreme Court has granted certiorari in Johnson v. California Department of Corrections, a case challenging a prison system's use of race in inmate housing assignments, teeing up another major ruling on affirmative action: Students for Fair Admissions § 112.14 Title VI of Civil Rights Act of 1964.

Lexis responded, stating this was an anomalous result, but that only statements with links can be expected to be hallucination-free, and that “where a citation does not include a link, users should always review the citation for accuracy.”

Why is this happening?

If you’ve been following this blog, you’ve seen me write about retrieval-augmented generation, one of the favorite techniques of vendors to reduce hallucinations. RAG takes the user’s question and passes it (perhaps with some modification) to a database. The database results are fed to the model, and the model identifies relevant passages or snippets from the results, and again sends them back into the model as “context” along with the user’s question.

However, as I said then, RAG cannot eliminate hallucinations. RAG will ground the response in real data (case law, pulled from the database and linked in the response), but the generative AI’s summary of that real data can still be off.

Another example – Mata v. Avianca is back

I’ve observed this myself when working with Lexis+ AI. For example, I asked Lexis+ AI a fairly complex question at the intersection of bankruptcy law and international law: “Draft an argument that federal bankruptcy stay tolls the limitations period for a claim under the Montreal Convention”.

Lexis+ AI returned a summary of the law, citing Mata v. Avianca for the point that “the filing of a bankruptcy petition can toll the Montreal Convention’s two year limitations period, which does not begin to run until the automatic stay is lifted.”

The image contains a screenshot of a text document, which seems to be an interface from a legal database or research platform with an option to "Select Jurisdiction" at the top right corner. The document reads as follows:

Question: Draft an argument that federal bankruptcy stay tolls the limitations period for a claim under the Montreal Convention

Response:

The Montreal Convention's limitations period can be tolled during the pendency of bankruptcy proceedings. This is based on the Bankruptcy Code, which states that the filing of a bankruptcy petition operates as a stay of proceedings against the debtor that were or could have been commenced before the bankruptcy case was filed. The tolling effect of the automatic stay on a statute of limitations is generally a matter of federal law. The automatic stay provisions of the Bankruptcy Code may toll the statute of limitations under the Montreal Convention, as they have done under the Warsaw Convention, which is the precursor to the Montreal Convention. The Supreme Court has held that an automatic stay of a legal proceeding under the Bankruptcy Code tolls the limitations period applicable to the stayed proceeding. Therefore, the filing of a bankruptcy petition can toll the Montreal Convention's two-year limitations period, which does not begin to run until the automatic stay is lifted [Mata v. Avianca, Inc., 2023 U.S. Dist. LEXIS 108263]. However, it's worth noting that there is a split of authority on this issue. Some courts have held that the Montreal Convention's two-year period is a strict condition precedent to the bringing of a claim, as opposed to a limitations period. Under this view, which does not treat tolling principles Rodriguez v. Asa, 2023 U.S. Dist. LEXIS 148451, Mata v. Avianca, Inc. 2023 U.S. Dist. LEXIS 108261, Kasalyn v. Delta Air Lines, Inc., 2023 U.S. Dist. LEXIS 154302.

If the case name Mata v. Avianca sounds familiar to you, it’s probably because this is the case that landed two New York attorneys on the front page of the New York Times last year for citing hallucinated cases. The snippet from Lexis+ AI, though citing Mata, in fact appears to be summarizing those hallucinated cases (recounted in Mata), which stated the law exactly backwards.

When to beware

A few things to notice about the above examples, which give us some ideas of when to be extra-careful in our use of generative AI for legal research.

  • Hallucinations are more likely when you are demanding an argument rather than asking for the answer to a neutrally phrased question. This is what happened in my Lexis+ AI example above, and is actually what happened to the attorneys in Mata v. Avianca as well – they asked for an argument to support an incorrect proposition of law rather than a summary of law. A recent study of hallucinations in legal analysis found that these so-called contra-factual hallucinations are disturbingly common for many LLM models.
  • Hallucinations can occur when the summary purports to be of the cited case, but is actually a summary of a case cited within that case (and perhaps not characterized positively). You can see this very clearly in further responses I got summarizing Mata v. Avianca, which purport to be summarizing a “case involving China Southern” (again, one of the hallucinated cases recounted in Mata).
Response:
In the case of Mata v. Avianca, Inc., the court held that the limitations period under the Montreal Convention was tolled during the pendency of the Bankruptcy Court proceedings. The court reasoned that the Bankruptcy Code provides that the filing of a bankruptcy petition operates as a stay of proceedings against the debtor that were or could have been commenced before the bankruptcy case was filed. The tolling effect of the automatic stay on a statute of limitations is generally a matter of federal law. The court had previously held that the automatic stay provisions of the Bankruptcy Code may toll the statute of limitations under the Warsaw Convention, which is the precursor to the Montreal Convention. The court saw no reason why the same rule should not apply under the Montreal Convention Mata v. Avianca, Inc., 2023 U.S. Dist. LEXIS 108263.

In another case involving China Southern, the court agreed with the district court's ruling that the filing of Varghese's Chapter 13 petition tolled the two-year limitations period under the Montreal Convention. The court found that a debtor need not file a claim in bankruptcy to benefit from the automatic stay. The Supreme Court has held that an automatic stay of a legal proceeding under the Bankruptcy Code tolls the limitations period applicable to the stayed proceeding. The Montreal Convention's limitations period is a "period of prescription," rather than a 'statute of limitations," but the difference between a "period of prescription" and a "statute of limitations" does not affect the automatic stay's tolling effect Mata v. Avianca, Inc., 2023 U.S. Dist. LEXIS 108263.
  • Finally, hallucinations are also more likely when the model has very little responsive text to go on. The law professor’s example involved a recent Supreme Court case that likely had not been applied many times. Additionally, Lexis+ AI does not seem to work well with questions about Shepard’s results – it may not be connected in that way yet. So, with nothing to really go on, it is more prone to hallucination.

Takeaway tips

A few takeaway tips:

  • Ask your vendor which sources are included in the generative AI tool, and only ask questions that can be answered from that data. Don’t expect generative AI research products to automatically have access to other data from the vendor (Shepard’s, litigation analytics, PACER, etc.), as that may take some time to implement.
  • Always read the cases for yourself. We’ve always told students not to rely on editor-written headnotes, and the same applies to AI-generated summaries.
  • Be especially wary if the summary refers to a case not linked. This is the tip from Lexis, and it’s a good one, as it can clue you in that the AI may be incorrectly summarizing the linked source.
  • Ask your questions neutrally. Even if you ultimately want to use the authorities in an argument, better to get a dispassionate summary of the law before launching into an argument.

A disclaimer

These tools are constantly improving and they are very open to feedback. I was not able to reproduce the error recounted in the beginning of this post; the error that created it has presumably been addressed by Lexis. The Mata v. Avianca errors still remain, but I did provide feedback on them, and I expect they will be corrected quickly.

The purpose of this post is not to tell you that you should never use generative AI for legal research. I’ve found Lexis+ AI helpful on many tasks, and students especially have told me they find it useful. There are several other tools out there that are worth evaluating as well. However, we should all be aware that these hallucinations can still happen, even with systems connected to real cases, and that there are ways we can interact with the systems to reduce hallucinations.

Shifting Sands: Ethical Guidance for AI in Legal Practice

Generative AI has only been here for one year, and we’ve already seen several lawyers make some big blunders trying to use it in legal practice. (Sean Harrington has been gathering them here). Trying to get ahead of the problem, bar associations across the country have appointed task forces, working groups, and committees to consider whether ethical rules should be revised. Although the sand will continue to shift under our feet, this post will attempt to summarize the ethical rules, guidance and opinions related to generative AI that are either already issued or forthcoming. The post will be updated as new rules are issued.

Image generated by DALLE-3, showing Matrix-style code flowing over the shifting sands of a desert. A sandstorm looms.

California CPRC Best Practices

On November 16, 2023, the California State Bar Board of Trustees approved their Practical Guidance for the Use of Generative Artificial Intelligence in the Practice of Law. The document was initially created by the Committee on Professional Responsibility and Conduct. Unlike ethics opinions or formal rules, which tend to be more prescriptive and specific in nature, this document serves as a guide, offering insights and considerations for lawyers as they navigate the new terrain of AI in legal practice. It is organized by duties, with practical considerations for each duty, and addresses the duty of confidentiality, duties of competence & diligence, duty to supervise, duty of candor, disclosure to clients, charging clients for work produced by generative AI, and more.

Florida Bar Advisory Opinion

On January 19, 2024, the Florida Bar issued its Advisory Opinion 24-1, regarding lawyers’ use of generative AI. The opinion discusses the duty of confidentiality, oversight of AI, the impact on legal fees and costs, and use in lawyer advertising.

New Jersey Supreme Court

On January 24, 2024, the New Jersey Bar issued its Preliminary Guidelines on New Jersey Lawyers’ Use of Artificial Intelligence. The guidelines highlight the importance of accuracy, truthfulness, confidentiality, oversight, and the prevention of misconduct, indicating that AI does not alter lawyers’ core ethical responsibilities but necessitates careful engagement to avoid ethical violations.

Judicial Standing Orders

Beginning soon after the infamous ChatGPT error in Mata v. Avianca, judges began to issue orders limiting the use of generative AI or requiring disclosure of its use or checking for accuracy. To date, at least 24 federal judges and at least one state court judge have issued standing orders.

Fifth Circuit’s Proposed Rule

The United States Court of Appeals for the Fifth Circuit recently solicited comments on its proposed new rule requiring certification as to the use of generative AI. It is the first federal appeals court to consider such a rule.

Judicial Ethics Opinions

Finally, in some jurisdictions, ethical bodies have looked beyond the use of generative AI by lawyers, and have given guidance on how judges can and should use generative AI.

On October 27, 2023, the State Bar of Michigan issued an opinion emphasizing the ethical obligation of judicial officers to maintain competence with advancing technology, including artificial intelligence, highlighting the need for ongoing education and ethical evaluation of AI’s use in judicial processes.

Also in October 2023, the West Virginia Judicial Investigation Commission issued Advisory Opinion 2023-22, opining that judges may use artificial intelligence for research but not to determine case outcomes.

Resources