Audit Trails for AI in Legal Research

LLMs have come a long way even in the time since I wrote my article in June.  Three months of development time with this technology feels like three years – or maybe that’s just me catching up.  Despite that, there are still a couple of nagging issues that I would like to see implemented to improve their usage to legal researchers.  I’m hoping to raise awareness about this so that we can collectively ask vendors to add quality-of-life features to these tools for the benefit of our community. 

Audit Trails

Right now the tools do not have a way for us to easily check their work.  Law librarians have made a version of my argument for over a decade now. ‌The legendary Susan Nevelow Mart famously questioned the opacity of search algorithms in legal research and evaluated their impact on legal research.  More recently, I was in the audience at AALL2023 when the tenacious and brilliant Debbie Ginsburg from Harvard asked Fastcase, BLaw, Lexis, and Westlaw how we (law librarians) could evaluate the inclusivity of the dataset of cases that the new AI algorithms are searching.  How do we know if they’ve missed something if we don’t know what they’re searching and how complete it is?

As it stands, the legal research AI that I’ve demoed do not give you a summary of where they have gone and what they have done.  An “audit trail” (as I’m using this expression) is a record of which processes were used to achieve a specific task, the totality of the dataset, and why they chose the results to present to the user. This way if something goes wrong, you can go back and look at what steps were taken to get the results. This would provide an extra layer of security and confidence in the process.

Why Do We Need This?

These tools have introduced an additional layer of abstraction that separates legal researchers from the primary documents they are studying, altering how legal research is conducted. While the new AI algorithms can be seen as a step forward, they can undermine the precision that boolean expressions once offered, which allowed researchers to predict the type of results they would encounter with more certainty. Coverage maps are still available to identify gaps in the data for some of these platforms but, there is a noticeable shift towards less control over the search process, calling for a thoughtful reassessment of the evolving dynamics in legal research techniques.  

More importantly, we (law librarians) are deep enough into these processes and technology to be highly skeptical and evaluate the output with a critical eye.  Many students and new attorneys may not.  I have told this story at some of my presentations but a recent graduate called me with a Pacific Reporter citation for a case that they could not find on Westlaw.  This person was absolutely convinced that they were doing something wrong and had spent around an hour searching for this case because “this was THE PERFECT case” for their situation.  It ended up being a fabrication from ChatGPT but the alumni had to call me to discover that.  This is obviously a somewhat outdated worry, since Rebecca Fordon has gamed all of us up on the steps being taken to reduce hallucinations (and OpenAI got a huge amount of negative publicity from the, now infamous, ChatGPT Lawyer). 

My point is less about the technology and more about the incentives set in place – if there is a fast, easy way to do this research then there will inevitably be people who are going to uncritically accept those results.  “That’s their fault and they should get in trouble,” you say?  Probably, but I plan to write about the duty of technological competency and these tools in a future post, so we’ll have to hash that out together later.  Also, what if there was a fast, easy way to evaluate the results of these tools…

What Could Be Done

Summarizing the steps involved in research seems like it would be a feasible task for Westlaw, Lexis, Blaw, et al. to implement.  They already have to use prompting to tell the LLM where to go and how to search; we’re just asking for a summary of those steps to be replicated somewhere so that we can double-check it.  Could they take that same prompting and place a prompt around that says something to the effect of, “Summarize the steps taken in bullet points” and then place that into a drop-down arrow so that we could check it?  Could they include hyperlinks to coverage maps in instances where it would be useful to the researcher to know how inclusive the search is?  In instances where they’re using RAG, could they include a prompt that says something to the effect of, “Summarize how you used those underlying documents to generate this text?” 

As someone who has tinkered with technology, all of these seem like reasonable requests that are well within the ability of these tools. I’m interested to hear if there are reasons why we couldn’t have these features or if people have other features they would like. Please feel free to post your ideas in the comments or email me.