LLMs have come a long way even in the time since I wrote my article in June. Three months of development time with this technology feels like three years – or maybe that’s just me catching up. Despite that, there are still a couple of nagging issues that I would like to see implemented to improve their usage to legal researchers. I’m hoping to raise awareness about this so that we can collectively ask vendors to add quality-of-life features to these tools for the benefit of our community.
Audit Trails
Right now the tools do not have a way for us to easily check their work. Law librarians have made a version of my argument for over a decade now. The legendary Susan Nevelow Mart famously questioned the opacity of search algorithms in legal research and evaluated their impact on legal research. More recently, I was in the audience at AALL2023 when the tenacious and brilliant Debbie Ginsburg from Harvard asked Fastcase, BLaw, Lexis, and Westlaw how we (law librarians) could evaluate the inclusivity of the dataset of cases that the new AI algorithms are searching. How do we know if they’ve missed something if we don’t know what they’re searching and how complete it is?
As it stands, the legal research AI that I’ve demoed do not give you a summary of where they have gone and what they have done. An “audit trail” (as I’m using this expression) is a record of which processes were used to achieve a specific task, the totality of the dataset, and why they chose the results to present to the user. This way if something goes wrong, you can go back and look at what steps were taken to get the results. This would provide an extra layer of security and confidence in the process.
Why Do We Need This?
These tools have introduced an additional layer of abstraction that separates legal researchers from the primary documents they are studying, altering how legal research is conducted. While the new AI algorithms can be seen as a step forward, they can undermine the precision that boolean expressions once offered, which allowed researchers to predict the type of results they would encounter with more certainty. Coverage maps are still available to identify gaps in the data for some of these platforms but, there is a noticeable shift towards less control over the search process, calling for a thoughtful reassessment of the evolving dynamics in legal research techniques.
More importantly, we (law librarians) are deep enough into these processes and technology to be highly skeptical and evaluate the output with a critical eye. Many students and new attorneys may not. I have told this story at some of my presentations but a recent graduate called me with a Pacific Reporter citation for a case that they could not find on Westlaw. This person was absolutely convinced that they were doing something wrong and had spent around an hour searching for this case because “this was THE PERFECT case” for their situation. It ended up being a fabrication from ChatGPT but the alumni had to call me to discover that. This is obviously a somewhat outdated worry, since Rebecca Fordon has gamed all of us up on the steps being taken to reduce hallucinations (and OpenAI got a huge amount of negative publicity from the, now infamous, ChatGPT Lawyer).
My point is less about the technology and more about the incentives set in place – if there is a fast, easy way to do this research then there will inevitably be people who are going to uncritically accept those results. “That’s their fault and they should get in trouble,” you say? Probably, but I plan to write about the duty of technological competency and these tools in a future post, so we’ll have to hash that out together later. Also, what if there was a fast, easy way to evaluate the results of these tools…
What Could Be Done
Summarizing the steps involved in research seems like it would be a feasible task for Westlaw, Lexis, Blaw, et al. to implement. They already have to use prompting to tell the LLM where to go and how to search; we’re just asking for a summary of those steps to be replicated somewhere so that we can double-check it. Could they take that same prompting and place a prompt around that says something to the effect of, “Summarize the steps taken in bullet points” and then place that into a drop-down arrow so that we could check it? Could they include hyperlinks to coverage maps in instances where it would be useful to the researcher to know how inclusive the search is? In instances where they’re using RAG, could they include a prompt that says something to the effect of, “Summarize how you used those underlying documents to generate this text?”
As someone who has tinkered with technology, all of these seem like reasonable requests that are well within the ability of these tools. I’m interested to hear if there are reasons why we couldn’t have these features or if people have other features they would like. Please feel free to post your ideas in the comments or email me.
I agree that these tool ‘add a layer of abstraction’ which we should be careful not to accept at face value. Of course, there’s also value in understanding how these tools work. But are we confusing research with search? I never actually used Fastcase’s advanced search tool which allowed us to adjust their relevancy algorithm. Did anyone? And even if I understood what an AI-based audit trail might be trying to tell me, wouldn’t it have to describe a machining learning function, not a rules-based function? If so, would the audit trail address any of my research concerns? This is a very limited application of AI for lawyering. With search, I’m mostly looking for the AI to surface highly-relevant content so I can move on to the next phase of my research. I get that that’s not everyone’s purpose and these memo writing tools are cool but those aren’t research applications and I think I would argue that their users care about accuracy (or the tool’s reputation for accuracy), not process. That’s because if you have to verify every memo, you lose the efficiency gains of the AI. Researchers have to demonstrate methodology. There may be fewer of us as AI tools become more trustworthy but we’re not losing key word, taxonomies, citation networks, indexes, …all the other stuff researchers rely on. If we lose those tools, then we have a transparency problem. Otherwise, it’s a user issue.
I would push back because I think that you get two very valuable pieces of information from the audit trail:
1) You know where it has gone so you can spend less time doing your deep-dive if you need to do it (for search and retrieval).
2) You learn the motivation, perspective, and potential bias of the AI in the case where you’re using it for RAG.
Imagine if the RAG prompt said something along the lines of, “I used that document instead of this document because that author has a higher citation count” but you knew that it was an author speaking outside their area of expertise.
What if the prompt said something like, “I chose that document because it represents a consensus of the majority opinion in that area of law” but you were arguing a minority viewpoint? Wouldn’t that be exceptionally useful information to you?
If “Every Algorithm has POV” as Susan Nevelow Mart famously said, then every AI Algorithm has a whole personality and perspective, making it even less trustworthy (and more worthy of interrogation). You could interrogate the AI manually for each element of this process but it seems like it might be easier to generate a little report that you can view to alleviate many of these concerns right from the gate. I think we’ll get to a point in law where these algorithms will give you a range of viewpoints and this will be less necessary but right now they’re still replicating a lot of what they’re trained on.
Thanks for your thoughtful response, Sean. I should say, I’m all for a better understanding of an algorithm’s POV. I don’t mean to say we should abandon that as a project. I also neglected to point out that there are plenty of places where traditional research methods aren’t typically efficient enough. I’m thinking about something like Fiscal Note’s capacity to analyze sentiment of public comments submitted as responses to proposed federal regs. That’s not something I’m going to ‘manually’ confirm unless the stakes are high enough. Plus, not all texts we rely on to ‘perform justice’ are legal texts. So an audit trail would be valuable if it’s trustworthy. But I sorta suspect — to your last point — AI tools will likely learn (and perhaps expose) my bias before I learn theirs.