Unlocking the Power of Semantic Searches in the Legal Domain

The language of law has many layers. Legal facts are more than objective truths; they tell the story and ultimately decide who wins or loses. A statute can have multiple interpretations, and those interpretations depend on factors like the judge, context, purpose, and history of the statute. Legal language has distinct features, including rare legal terms of art like “restrictive covenant,” “promissory estoppel,” “tort,” and “novation.” This complex legal terminology poses challenges for normal semantic search queries.  

Vector databases represent an exciting new trend, and for good reason. Rather than relying on traditional Boolean logic, semantic search leverages word associations by creating embeddings and storing them in a vector database. In machine learning and natural language processing, embeddings depict words or sentences as dense vectors of real numbers in a continuous vector space. This numerical representation of text is typically generated by a model that tokenizes the text and learns embeddings from the data. Vectors capture the contextual and semantic meaning of each word. When a user makes a semantic query, the search system works to interpret their intent and context. The system then breaks the query into individual words or tokens, converts them into vector representations using embedding models, and returns ranked results based on their relevance. Unlike Boolean search which requires specific syntax, (“AND”, “OR”, etc.) semantic search allows for queries in natural language and opens up a whole new world of potential when searches are not constrained by the rules of exact matching of text. 

However, legal language differs from everyday language. The large number of technical terms, the careful precision, and the fluid interpretations inherent in law mean that semantic search systems may fail to grasp the context and nuances of legal queries. The interconnected and evolving nature of legal concepts poses challenges in neatly mapping them into an embedding space representation. One potential way to improve semantic search in the legal domain is by enhancing the underlying embedding models. Embedding models are often trained on generalized corpora like Wikipedia, giving them a broad but shallow understanding of law. This surface-level comprehension proves insufficient for legal queries, which may seem simple but have layers of nuance. For example, when asked to retrieve the key facts of a case, an embedding model might struggle to discern what facts are relevant versus extraneous details.  

The model may also fail to distinguish between majority and dissenting opinions due to a lack of legal background needed to make such differentiations. Training models on domain-specific legal data represents one promising approach to overcoming these difficulties. By training on in-depth legal corpora, embeddings could better capture the subtleties of legal language, ideas, and reasoning. For example, Legal Bert, which stands for Bidirectional Encoder Representations was pre-trained on the CaseHold dataset. The size of this corpus (37GB) is large, representing 3,446,187 legal decisions across all federal and state courts. The CaseHold data set is larger than the size of the Book Corpus/Wikipedia corpus originally used to train the BERT model.  When tested on the LexGlue benchmark- a benchmark dataset to evaluate the performance of NLP methods in legal tasks, Legal Bert performed better than ChatGPT.  

Semantic search shows promise for transforming legal research, but realizing its full potential in the legal domain poses challenges. Legal language is complex and can make it difficult for generalized embedding models to grasp the nuances of legal queries. However, recent optimized legal embedding models indicate these hurdles can be overcome by training on ample in-domain data. Still, comprehensively encoding the interconnected, evolving nature of legal doctrines into a unified embedding space remains an open research problem. Hybrid approaches combining Boolean and vector models are a promising new frontier that many researchers are exploring. 

Realizing the full potential of semantic search for law remains an ambitious goal requiring innovative techniques. But the payoff could be immense – responsive, accurate AI assistance for case law research and analysis. While still in its promising infancy, the continued maturation of semantic legal search could profoundly augment the capabilities of legal professionals. A shift from generic to domain-specific models holds promise.  

Audit Trails for AI in Legal Research

LLMs have come a long way even in the time since I wrote my article in June.  Three months of development time with this technology feels like three years – or maybe that’s just me catching up.  Despite that, there are still a couple of nagging issues that I would like to see implemented to improve their usage to legal researchers.  I’m hoping to raise awareness about this so that we can collectively ask vendors to add quality-of-life features to these tools for the benefit of our community. 

Audit Trails

Right now the tools do not have a way for us to easily check their work.  Law librarians have made a version of my argument for over a decade now. ‌The legendary Susan Nevelow Mart famously questioned the opacity of search algorithms in legal research and evaluated their impact on legal research.  More recently, I was in the audience at AALL2023 when the tenacious and brilliant Debbie Ginsburg from Harvard asked Fastcase, BLaw, Lexis, and Westlaw how we (law librarians) could evaluate the inclusivity of the dataset of cases that the new AI algorithms are searching.  How do we know if they’ve missed something if we don’t know what they’re searching and how complete it is?

As it stands, the legal research AI that I’ve demoed do not give you a summary of where they have gone and what they have done.  An “audit trail” (as I’m using this expression) is a record of which processes were used to achieve a specific task, the totality of the dataset, and why they chose the results to present to the user. This way if something goes wrong, you can go back and look at what steps were taken to get the results. This would provide an extra layer of security and confidence in the process.

Why Do We Need This?

These tools have introduced an additional layer of abstraction that separates legal researchers from the primary documents they are studying, altering how legal research is conducted. While the new AI algorithms can be seen as a step forward, they can undermine the precision that boolean expressions once offered, which allowed researchers to predict the type of results they would encounter with more certainty. Coverage maps are still available to identify gaps in the data for some of these platforms but, there is a noticeable shift towards less control over the search process, calling for a thoughtful reassessment of the evolving dynamics in legal research techniques.  

More importantly, we (law librarians) are deep enough into these processes and technology to be highly skeptical and evaluate the output with a critical eye.  Many students and new attorneys may not.  I have told this story at some of my presentations but a recent graduate called me with a Pacific Reporter citation for a case that they could not find on Westlaw.  This person was absolutely convinced that they were doing something wrong and had spent around an hour searching for this case because “this was THE PERFECT case” for their situation.  It ended up being a fabrication from ChatGPT but the alumni had to call me to discover that.  This is obviously a somewhat outdated worry, since Rebecca Fordon has gamed all of us up on the steps being taken to reduce hallucinations (and OpenAI got a huge amount of negative publicity from the, now infamous, ChatGPT Lawyer). 

My point is less about the technology and more about the incentives set in place – if there is a fast, easy way to do this research then there will inevitably be people who are going to uncritically accept those results.  “That’s their fault and they should get in trouble,” you say?  Probably, but I plan to write about the duty of technological competency and these tools in a future post, so we’ll have to hash that out together later.  Also, what if there was a fast, easy way to evaluate the results of these tools…

What Could Be Done

Summarizing the steps involved in research seems like it would be a feasible task for Westlaw, Lexis, Blaw, et al. to implement.  They already have to use prompting to tell the LLM where to go and how to search; we’re just asking for a summary of those steps to be replicated somewhere so that we can double-check it.  Could they take that same prompting and place a prompt around that says something to the effect of, “Summarize the steps taken in bullet points” and then place that into a drop-down arrow so that we could check it?  Could they include hyperlinks to coverage maps in instances where it would be useful to the researcher to know how inclusive the search is?  In instances where they’re using RAG, could they include a prompt that says something to the effect of, “Summarize how you used those underlying documents to generate this text?” 

As someone who has tinkered with technology, all of these seem like reasonable requests that are well within the ability of these tools. I’m interested to hear if there are reasons why we couldn’t have these features or if people have other features they would like. Please feel free to post your ideas in the comments or email me.

Why Law Librarians?

Some of you reading this may be skeptical that these new AI technologies are 1) within your skillset and/or 2) worth the effort to learn. I’m the congenital optimist who is here to win you over. These tools are on the verge of revolutionizing the field of law (once they get out of their prototype phase) and I can’t think of a better group of people on law school campuses, in government organizations, and in law firms to evaluate and implement these technologies. Law Librarians (traditionally) have two crucial skill sets that make us well-suited to take the lead here:

  • We understand how information is organized and
  • We understand how information is used in the research and practice of law.

This is an AI Youtuber with ~70k subscribers who develops and trains LLMs from scratch. Do you see what he has listed as the number one discipline that people need to learn to use these tools? Computer Science skills rank third on his list compared to “Librarianship and Information Science” at #1.

This dude gets it.

Many of the tips that David Shapiro provides in that video for people creating custom LLMs will be absolutely obvious to law librarians because we live and breathe these every day at our jobs: taxonomies, data organization, “source of truth,” etc. Whether in the tech services department or research instruction, we are well-versed in organizing and finding information.

We already have many of the data structures in place that could be easily used by these technologies. Besides constructing the initial models, our role will be pivotal in continuously updating and assessing their effectiveness. Moreover, we will provide vital guidance on the proper utilization of these tools.

Does this list look like something your Technical Services department does? Can you think of anyone else in your organization who would be better at making knowledge graphs, indexes, or tables of contents for legal materials? Who would be better suited than your Research and Instruction team to teach newcomers how to interact with these tools to get the information that they need? Who in your organization is best positioned to teach (or already teaches) information literacy? I would argue that nobody can do it better than law librarians (not even computer science people).

Now What?

Let’s mobilize a push to collaborate on these tools. We need to get groups of law librarians together who are interested in rolling up their sleeves and digging into the nitty-gritty of creating, auditing, and using LLMs. I am a member of LIT-SIS in AALL and maybe we need a special caucus to address this specific technology. Additionally, we can get consortiums of schools together in each state to develop our own LLMs – outside of the subscription-based products that will roll out for Lexis and Westlaw. Anything we build ourselves will have the needs of our community at the forefront. We can build in all of the transparency, privacy, and accuracy that may be lacking in commercial models. Schools can build tools that would not be commercially viable at firms. Firms and courts could build specialized tools to achieve their unique workflows. It opens up many options that are not available if we’re stuck with the one-size-fits-all nature of Lexis and Westlaw subscriptions.

This is an open-source model that is close to competing with GPT4 (ChatGPT’s underlying model). There are many of these and new models show up every day.

There are many options to create, train, and locally run custom LLMs as long as you have the data. As David Shapiro said in the video, “data is the oil of the information age” and law libraries are deep wells of the type of data that could be used to accurately train these services. Additionally, when you are locally hosting an LLM many of the concerns surrounding privacy, permissions, and student data completely evaporate because you are in control of what information is being sent and stored.

To do all of this, we need organization, collaboration, and funding. Individually this could be difficult but if we band together in consortium, we can get a lot done.

Students

Students are an incredible resource in this area. Many of them come to law school with computer science and data science backgrounds and can help with the creation and development of these models. They need mentors and organizers to help focus their efforts, provide resources, and nurture their creativity. In addition, they provide a deep reservoir of diverse voices and experiences that may not occur to people who have spent decades in academia, the public sector, or law firms. We can bring in students to have competitions to create their own LLM apps for law practice and access to justice initiatives. We can fund fellowships to do work at schools, courts, and firms. We can bring them under our wing to usher in the next generation of tech-savvy law librarians. We can leverage the excitement and energy associated with these new tools to attract new talent into our field – I skimmed TikTok and the #ChatGPT hashtag as around 7.7 billion views. To do that, we need to brainstorm together so that we can get these programs in place.

In Sum

As the torchbearers in this promising venture, it’s time for us, the law librarians, to step up and show the world our unmatched prowess in harnessing the potential of LLMs in law, weaving our expert knowledge in information science, law, and emerging technology. Let us band together, utilizing the rich data reserves at our disposal, and carve out a future where legal technology is not just efficient and transparent, but also a collaborative masterpiece fostered by our relentless pursuit of innovation and excellence.