Is Better Case Law Data Fueling a Legal Research Boom?

Recently, I’ve noticed a surge of new and innovative legal research tools. I wondered what could be fueling this increase, and set off to find out more. 

The Moat

An image generated by DALL-E, depicting a castle made of case law reporters, with sad business children trying to construct their own versions out of pieces of paper. They just look like sand castles.

Historically, acquiring case law data has been a significant challenge, acting as a barrier to newcomers in the legal research market. Established players are often protective of their data. For instance, in an antitrust counterclaim, ROSS Intelligence accused Thomson Reuters of withholding their public law collection, claiming they had to instead resort to purchasing cases piecemeal from sources like Casemaker and Fastcase.  Other companies have taken more extreme measures. For example, Ravel Law partnered with the Harvard Law Library to scan every single opinion in their print reporter collections. There’s also speculation that major vendors might even license some of their materials directly to platforms like Google Scholar, albeit with stringent conditions.

The New Entrants

Despite the historic challenges, several new products have recently emerged offering advanced legal research capabilities:

  • Descrybe.ai (founded 2023) – This platform leverages generative AI to read and summarize judicial opinions, streamlining the search process. Currently hosting around 1.6 million summarized opinions, it’s available for free.
  • Midpage (2022) – Emphasizing the integration of legal research into the writing process, users can employ generative AI to draft documents from selected source (see Nicola Shaver’s short writeup on Midpage here). Midpage is currently free at app.midpage.ai.
  • CoPilot (by LawDroid, founded 2016) – Initially known for creating chatbots, LawDroid introduced CoPilot, a GPT-powered AI legal assistant, in 2023. It offers various tasks, including research, translating, and summarizing. CoPilot is available in beta as a web app and a Chrome extension, and is free for faculty and students.
  • Paxton.ai (2023) – Another generative AI legal assistant, Paxton.ai allows users to conduct legal research, draft documents, and more. Limited free access is available without signup at app.paxton.ai, although case law research will require you to sign up for a free account.
  • Alexi (2017) Originally focused on Canadian law, Alexi provides legal research memos. They’ve recently unveiled their instant memos, powered by generative AI. Alexi is available at alexi.com and provides a free pilot.

Caselaw Access Project and Free Law Project

With the Caselaw Access Project, launched in 2015, Ravel Law and Harvard Law Library changed the game. Through their scanning project, Harvard received rights to the case law data, and Ravel gained an exclusive commercial license for 8 years. (When Lexis acquired Ravel a few years later, they committed to completing the project.) Although the official launch date of free access is February 2024, we are already seeing a free API at Ravel Law (as reported by Sarah Glassmeyer).

Caselaw Access Project data is only current through 2020 (scanning was completed in 2018, and has been supplemented by Fastcase donations through 2020) and does not include digital-first opinions. However, this gap is mostly filled through CourtListener, which contains a quite complete set of state and federal appellate opinions for recent years, painstakingly built through their network of web scrapers and direct publishing agreements. CourtListener offers an API (along with other options for bulk data use).

And indeed, Caselaw Access Project and Free Law Project just recently announced a dataset called Collaborative Open Legal Data (COLD) – Cases. COLD Cases is a dataset of 8.3 million United States legal decisions with text and metadata, suitable for use in machine learning and natural language processing projects.

Most of the legal research products I mentioned above do not disclose their precise source of their case law data. However, both Descrybe.ai and Midpage point to CourtListener as a partner. My theory/opinion is that many of the others may be using this data as well, and that these new, more reliable and more complete sources of data are responsible for fueling some amazing innovation in the legal research sphere.

What Holes Remain?

Reviewing the coverage of CourtListener and Caselaw Access Project it appears to me that they have, when combined:

  • 100% of all published U.S. case law from 2018 and earlier (state and federal)
  • 100% of all U.S. Supreme Court, U.S. Circuit Court of Appeals, and state appellate court cases

There are, nevertheless, still a few holes that remain in the coverage:

  • Newer Reporter Citations. Newer appellate court decisions may not have reporter citations within CourtListener. These may be supplemented as Fastcase donates cases to Caselaw Access Project.
  • Newer Federal District Court Opinions. Although CourtListener collects federal decisions marked as “opinions” within PACER, these decisions are not yet available in their opinion search. Therefore, very few federal district court cases are available for the past 3-4 years. This functionality will likely be added, but even when it is, district courts are inconsistent about marking decisions as “opinions” and so not all federal district court opinions will make their way to CourtListener’s opinions database. To me, this brings into sharp relief the failure of federal courts to comply with the 2002 E-Government Act, which requires federal courts to provide online access to all written opinions.
  • State Trial Court Decisions. Some other legal research providers include state court trial-level decisions. These are generally not published on freely available websites (so CourtListener cannot scrape them) and are also typically not published in print reporters (so Caselaw Access Project could not scan them).
  • Tribal Law. Even the major vendors have patchy access to tribal law, and CourtListener has holes here as well.

The Elephant in the Room

Of course, another major factor in the increase in legal research tools may be simple economics. In August, Thomson Reuters acquired the legal research provider Casetext for the eye-watering sum of $650 million.  And Casetext itself is a newer legal research provider, founded only in 2013. In interviews, Thomson Reuters cited Casetext’s access to domain-specific legal authority, as well as its early access to GPT-4, as key to its success. 

What’s Next?

Both Courtlistener and Caselaw Acess Project have big plans for continuing to increase access to case law. CAP will launch free API access in February 2024, coordinating with LexisNexis, Fastcase, and the Free Law Project on the launch. CourtListener is planning a scanning project to fix remaining gaps in their coverage (CourtListener’s Mike Lissner tells me they are interested in speaking to law librarians about this – please reach out). And I’m sure we can expect to see additional legal research tools, and potentially entire LLMs (hopefully open source!), trained on this legal data.

Know of anything else I didn’t discuss? Let me know in the comments, or find me on social media or email.

5 thoughts on “Is Better Case Law Data Fueling a Legal Research Boom?

  1. This is a wonderful post, thank you so much, Rebecca. Helping build a complete and open collection of American case law has been a goal of ours (and certainly others!) for over a decade and it’s really coming to fruition now. It’s an exciting time for sure.

    The one spot I want to mention that we have trouble is that our scrapers have been imperfect and somewhat poorly maintained. We have spent the past few years focused on making our historical corpus really solid, and let the current coverage slide a little. We’re fixing that now, but for those reading, I thought it was a caveat worth mentioning.

    We also made a really detailed coverage page here:

    https://www.courtlistener.com/coverage/opinions/

    I’d love to hear people’s thoughts and I love that we’re having this moment in legal tech and access to justice.

  2. I love the castle made out of case reporters. can you share secrets on how to make better images on Dall-e. I haven’t been happy with my results.

    • This was using DALLE-3 via the paid ChatGPT, which is much easier to use than the others because it helps create the prompts and has lots of other guidance it gives the image generator. My prompt was fairly simple: “An illustration of a moat, in the sense it is used by technology companies to describe data used for competitive advantage. The moat I’m talking about is specifically about case law data and legal research providers. I want a castle built of case law reporters, a moat, and then on the other side of the moat, the competitors trying to build sand castles with little crumpled of pieces of paper. The image should be wide-screen format.”

      Bing has DALLE-3 available for free, but does not have the prompt generation help. (I tried this prompt on Bing and on Midjourney – not nearly as successful)

  3. Thanks so much for mentioning descrybe.ai in this post! Interestingly, we got our start with data from the Caselaw Access Project and now use CourtListener by the Free Law Project. This shows how important access to the actual data is for others to build on top of.

Leave a Reply

Your email address will not be published. Required fields are marked *