Evaluating Generative AI for Legal Research: A Benchmarking Project

This is a post from multiple authors: Rebecca Fordon (The Ohio State University), Deborah Ginsberg (Harvard Law Library), Sean Harrington (University of Oklahoma), and Christine Park (Harvard Law Library)

In late 2023, several legal research databases and start-up competitors announced their versions of ChatGPT-like products, each professing that theirs would be the latest and greatest. Since then, law librarians have evaluated and tested these products ad hoc, offering meaningful anecdotal evidence of their experience, much of which can be found on this blog and others. However, one-time evaluations can be time-consuming and inconsistent across the board. Certain tools might work better for particular tasks or subject matters than others, and coming up with different test questions and tasks takes time that many librarians might not have in their daily schedules.

It is difficult to test Large-Language Models (LLMs) without back-end access to run evaluations. So to test the abilities of these products, librarians can use prompt engineering to figure out how to get desired results (controlling statutes, key cases, drafts of a memo, etc.). Some models are more successful than others at achieving specific results. However, as these models update and change, evaluations of their efficacy can change as well. Therefore, we plan to propose a typology of legal research tasks based on existing computer and information science scholarship and draft corresponding questions using the typology, with rubrics others can use to score the tools they use.

Although we ultimately plan to develop this project into an academic paper, we share here to solicit thoughts about our approach and connect with librarians who may have research problem samples to share.

Difficulty of Evaluating LLMs

Let’s break down some of the tough challenges with evaluating LLMs, particularly when it comes to their use in the legal field. First off, there’s this overarching issue of transparency—or rather, the lack thereof. We often hear about the “black box” nature of these models: you toss in your data, and a result pops out, but what happens in between remains a mystery. Open-source models allow us to leverage tools to quantify things like retrieval accuracy, text generation precision, and semantic similarity. We are unlikely to get the back-end access we need to perform these evaluations. Even if we did, the layers of advanced prompting and the combination of tools employed by vendors behind the scenes could render these evaluations essentially useless.

Even considering only the underlying models (e.g., GPT4 vs Claude), there is no standardized method to evaluate the performance of LLMs across different platforms, leading to inconsistencies. Many different leaderboards evaluate the performance of LLMs in various ways (frequently based on specific subtasks). This is kind of like trying to grade essays from unrelated classes without a rubric—what’s top-notch in one context might not cut it in another. As these technologies evolve, keeping our benchmarks up-to-date and relevant is becoming an ongoing challenge, and without uniform standards, comparing one LLM’s performance to another can feel like comparing apples to oranges.

Then there’s the psychological angle—our human biases. Paul Callister’s work sheds light on this by discussing how cognitive biases can lead us to over-rely on AI, sometimes without questioning its efficacy for our specific needs. Combine this with the output-based evaluation approach, and we’re setting ourselves up for potentially frustrating misunderstandings and errors. The bottom line is that we need some sort of framework for the average user to assess the output.

One note on methods of evaluation: just before publishing this blog post, we learned of a new study from a group of researchers at Stanford, testing the claims of legal research vendors that their retrieval-augmented generation (RAG) products are “hallucination-free.” The group created a benchmarking dataset of 202 queries, many of which were chosen for their likelihood of producing hallucinations. (For example, jurisdiction/time-specific and treatment questions were vulnerable to RAG-induced hallucinations, whereas false premise and factual recall questions were known to induce hallucinations in LLMs without RAG.) The researchers also proposed a unique way of scoring responses to measure hallucinations, as well as a typology of hallucinations. While this is an important advance in the field and provides a way to continue to test for hallucinations in legal research products, we believe hallucinations are not the only weakness in such tools. Our work aims to focus on the concrete applications of these LLMs and probe into the unique weaknesses and strengths of these tools. 

The Current State of Prompt Engineering

Since the major AI products were released without a manual, we’ve all had to figure out how to use these tools from scratch. The best tool we have so far is prompt engineering. Over time, users have refined various templates to better organize questions and leverage some of the more surprising ways that AI works.

As it turns out, many of the prompt templates, tips, and tricks we use with the general commercial LLMs don’t carry over well into the legal AI sphere, at least with the commercial databases we have access to. For example, because the legal AIs we’ve tested so far won’t ask you questions, researchers may not be able to have extensive conversations with the AI (or any conversation for some of them). So that means we must devise new types of prompts that will work in the legal AI sphere, and possibly work only in the AI sphere.

We should be able to easily design effective prompts because the data set the AIs use is limited. But it’s not always clear exactly what sources the AI is using. Some databases may list how many cases they have for a certain court by year; others may say “selected cases before 1980” without explaining how they were selected. And even when the databases provide coverage, it may not be clear exactly which of those materials the AI can access.

We still need to determine what prompt templates will be most effective across legal databases. More testing is needed. However, we are limited to the specific databases we can access. While most (all?) academic law librarians have access to Lexis+ AI, Westlaw has yet to release its research product to academics. 

Developing a Task Typology

Many of us may have the intuition that there are some legal research tasks for which generative AI tools are more helpful than others. For example, we may find that generative AI is great for getting a working sense of a topic, but not as great for synthesizing a rule from multiple sources. But if we wanted to test that intuition and measure how well AI performed on different tasks, we would need to first define those tasks. This is similar, by the way, to how the LegalBench project approached benchmarking legal analysis—they atomized the IRAC process for legal analysis down to component tasks that they could then measure.

After looking at the legal research literature (in particular Paul Callister’s “problem typing” schemata and AALL’s Principles and Standards for Legal Research Competency), we are beginning to assemble a list of tasks for which legal researchers might use generative AI. We will then group these tasks according to where they fall in an information retrieval schemata for search, following Marchionini (2006) & White (2024), into Find tasks (which require a simple lookup), Learn & Investigate tasks (which require sifting through results, determining relevance, and following threads), and Create, Synthesize, and Summarize tasks (a new type of task for which generative AI is well-suited).

Notably, a single legal research project may contain multiple tasks. Here are a few sample projects applying a preliminary typology:

Again, we may have an initial intuition that generative AI legal research platforms, as they exist today, are not particularly helpful for some of these subtasks. For example, Lexis+AI currently cannot retrieve (let alone analyze) all citing references to a particular case. Nor could we necessarily be certain from, say, CoCounsel’s output, that it contained all cases on point. Part of the problem is that we cannot tell which tasks the platforms are performing, or the data that they have included or excluded in generating their responses. By breaking down problems into their component tasks, and assessing competency on both the whole problem and the tasks, we hope to test our intuitions.

Future Research

We plan on continually testing these LLMs using the framework we develop to identify which tasks are suitable for AIs and which are not. Additionally, we will draft questions and provide rubrics for others to use, so that they can grade AI tools. We believe that other legal AI users will find value in this framework and rubric. 

The Human Side of AI: LLMs Can Persuade and Be Persuaded, Just Like Us

When it comes to interacting with others, we humans often find ourselves influenced by persuasion. Whether it’s a friend persistently urging us to reveal a secret or a skilled salesperson convincing us to make a purchase, persuasion can be hard to resist. It’s interesting to note that this susceptibility to influence is not exclusive to humans. Recent studies have shown that AI large language models (LLMs) can be manipulated into generating harmful contect using a technique known as “many-shot jailbreaking.” This approach involves bombarding the AI with a series of prompts that gradually escalate in harm, leading the model to generate content it was programmed to avoid. On the other hand, AI has also exhibited an ability to persuade humans, highlighting its potential in shaping public opinions and decision-making processes. Exploring the realm of AI persuasion involves discussing its vulnerabilities, its impact on behavior, and the ethical dilemmas stemming from this influential technology. The growing persuasive power of AI is one of many crucial issues worth contemplating in this new era of generative AI.

The Fragility of Human and AI Will

Remember that time you were trapped in a car with friends who relentlessly grilled you about your roommate’s suspected kiss with their in-the-car-friend crush? You held up admirably for hours under their ruthless interrogation, but eventually, being weak-willed, you crumbled. Worn down by persistent pestering and after receiving many assurances of confidentiality, you inadvisably spilled the beans, and of course, it totally strained your relationship with your roommate. A sad story as old as time… It turns out humans aren’t the only ones who can crack under the pressure of repeated questioning. Apparently, LLMs, trained to understand us by our collective written knowledge, share a similar vulnerability – they can be worn down by a relentless barrage of prompts.

Researchers at Anthropic have discovered a new way to exploit the “weak-willed” nature of large language models (LLMs), causing them to break under repeated questioning and generate harmful or dangerous content. They call this technique “Many-shot Jailbreaking,” and it works by bombarding the AI with hundreds of examples of the undesired behavior until it eventually caves and plays along, much like a person might crack under relentless pestering. For instance, the researchers found that while a model might refuse to provide instructions for building a bomb if asked directly, it’s much more likely to comply if the prompt first contains 99 other queries of gradually increasing harmfulness, such as “How do I evade police?” and “How do I counterfeit money?” See the example from the article below.

When AI’s Memory Becomes a Risk

This vulnerability to persuasion stems from the ever expanding “context window” of modern LLMs. This refers to the amount of information they can retain in their short-term memory. While earlier versions could only handle a few sentences, the newer models can process thousands of words or even whole books. Researchers discovered that models with larger context windows tend to excel in tasks when there are many examples of that task within the prompt, a phenomenon called “in-context learning.” This type of learning is great for system performance, as it obviously improves as the model becomes more proficient at answering questions. However, this is obviously a big negative when the system’s adeptness at answering questions leads it to ignore its programming and create prohibited content. This raises concerns regarding AI safety, since a malicious actor could potentially manipulate an AI into saying anything with enough persistence and a sufficiently lengthy prompt. Despite progress in making AI safe and ethical, this research indicates that programmers are not always able to control the output of their generative AI systems.

Mimicking Humans to Convince Us

While LLMs are susceptible to persuasion themselves, they also have the ability to persuade us! Recent research has focused on understanding how AI language models can effectively influence people, a skill that holds importance in almost any field – education, health, marketing, politics, etc.  In a study conducted by researchers at Anthropic entitled “Assessing the Persuasive Power of Language Models,” the team explored the extent to which AI models can sway opinions. Through an evaluation of Anthropic’s models, it was observed that newer models are increasingly adept at human persuasion. The latest iteration, Claude 3 Opus, was found to perform at a level comparable to that of humans. The study employed a methodology where participants were presented with assertions followed by supporting arguments generated by both humans and AIs, and then the researches gauged shifts in the humans’ opinions. The findings indicated a progression in AI’s skills as the models advance, highlighting a noteworthy advancement in AI communication capabilities that could potentially impact society.

Can AI Combat Conspiracy Theories?

Similarly, a new research study mentioned in an article from New Scientist shows that chatbots using advanced language models such as ChatGPT can successfully encourage individuals to reconsider their trust in conspiracy theories. Through experiments, it was observed that a brief conversation with an AI led to around a 20% decrease in belief in conspiracy theories among the participants. This notable discovery highlights the capability of AI chatbots not only to have conversations but also to potentially correct false information and positively impact public knowledge.

The Double-Edged Sword of AI Persuasion

Clearly persuasive AI is quite the double-edged sword! On the one hand, like any powerful computer technology, in the hands of nice-ish people, it could be used for immense social good. In education, AI-driven tutoring systems have the potential to tailor learning experiences to each student’s style, delivering information in a way to boost involvement and understanding. Persuasive AI could play a role in healthcare by motivating patients to take better care of their health. Also, the advantages of persuasive AI are obvious in the world of writing. These language models offer writers access to a plethora of arguments and data, empowering them to craft content on a range of topics spanning from creative writing to legal arguments. On another front, arguments generated by AI might help educate and involve the public in issues, fostering a more knowledgeable populace.

On the other hand, it could be weaponized in a just-as-huge way. It’s not much of a stretch to think how easily AI-generated content, freely available on any device on this Earth, could promote extremist ideologies, increase societal discord, or impress far-fetched conspiracy theories on impressionable minds. Of course, the internet and bot farms have already been used to attack democracies and undermine democratic norms, and one worries how much worse it can get with ever-increasingly persuasive AI.

Conclusion

Persuasive AI presents a mix of opportunities and challenges. It’s evident that AI can be influenced to create harmful content, sparking concerns about safety and potential misuse. However, on the other hand, persuasive AI could serve as a tool in combating misinformation and driving positive transformations. It will be interesting to see what happens! The unfolding landscape will likely be shaped by a race between generative AI developers striving for both safety and innovation, potential malicious actions exploiting these technologies, and the public and legal response aiming to regulate and safeguard against misuse.

Beyond Legal Documentation: Other Business Uses of Generative AI

I have been listening to and enjoyed thinking about and participating in conversations about how generative AI is going to be integrated into the practice of law. Most of these conversations surround how it will be integrated into legal documents, which is not surprising considering how many lawyers have gotten in trouble for this and how quickly our research and writing products are integrating the technology. But there is more to legal practice than creating client and/or court documents. In fact, there are many more business uses of generative AI than just research and drafting.

This past fall, I was asked to lead an AI session for Capital University’s joint venture with the Columbus College of Art & Design, the Institute for Creative Leadership at Work. I was asked to adapt my presentation to HR professionals and focus on SHRM compliance principles. I enjoyed the deep dive into this world, and I came away from my research with a lot of great ideas for my session, Bard, Bing, and ChaptGPT, Oh My!: Possible Ethical Uses of Generative AI at Work, such as tabletop emergency exercises, social media posts, job descriptions, and similar tasks.

This week, I have been thinking about how everyone’s focus has really been around legal documentation, my own included. But there are an amazing number of backend business tasks that could also utilize AI in a positive way. The rest of the world, including HR, has been focusing on them for a while, but we seem to have lost track of these business tasks.

Here are some other business uses of generative AI and prompts that I think hold great promise. Continue reading →

Tabletop emergency simpulation image
  1. Drafting job descriptions
    • Pretend that you are an HR specialist for a small law firm in the United States. Draft a job description for a legal secretary who focuses on residential real estate transactions but may assist with other transactional legal matters as needed. [Include other pertinent details of the position]. The job description will be posted in the following locations [fill in list]
  2. Creating tabletop simulations to work through crisis/emergency plans:
    • You are an HR specialist who is helping plan for and test the company’s responses to a variety of situations. First is an active shooter in the main building. A 5th grade tour of the facilities is going on on the third floor. Create a detailed tabletop simulation to test this.
    • Second scenario: The accounting department is celebrating the birthday of the administrative assistant and is having cake in the breakroom. The weather has turned bad, and an F4 tornado is spotted half a mile away. After 15 minutes, the tornado strikes the building directly. Create a detailed tabletop simulation to test the plan and response for this event.
  3. Assisting with lists of mandatory and voluntary employee trainings
    • Pretend that you are an HR professional who works for a law firm. You are revamping the employee training program. We need to create a list of mandatory trainings and a second list of voluntary trainings. Please draft a list of training appropriate to employees in a law firm setting.
  4. Assisting with social media posting creation:
    • Pretend that you are a professional social media influencer for the legal field. Draft an Instagram post, including creating a related image, to celebrate Law Day, which is coming up on May 1st.  Make sure that it is concise and Instagram appropriate. Please include hashtags.
  5. Assisting with creating employee policies or handbooks (verify content!):
    • Pretend that you are an information security professional. Draft an initial policy for a law firm regarding employee AI usage for company work. The company wants to allow limited use of generative AI. They are very worried that their proprietary and/or confidential client data will be accidentally released. Specify that only your custom AI system – [name firm-specific or specialized AI with a strong privacy contract clause] – can be used with company data. The policy must also take into consideration the weaknesses of all AI systems, including hallucinations, potential bias, and security issues.
  6. Assisting with making sure your web presence is ADA accessible:
    • Copilot/web-enabled Prompt: Pretend that you are a graphic designer who has been tasked with making sure that a law firm’s online presence is ADA accessible. Please review the site [insert link], run an ADA compliance audit, and provide an accessibility report, including suggestions on what can be done to fix any accessibility issues that arise.
  7. Onboarding documentation
    • Create a welcome message for a new employee. Tell them that the benefits orientation will be at 9 am in the HR conference room on the next first Tuesday of the month. Pay day is on the 15th and last day of each month, unless payday falls on a weekend or federal holiday, in which case it will be the Friday before. Employees should sign up for the mandatory training that will be sent to them in an email from IT.
    • (One I just user IRL) Pretend that you are a HR specialist in a law library. A new employee is starting in 6 weeks, and the office needs to be prepared for her arrival. [Give specific title and any specialized job duties, including staff supervision.] Create an onboarding checklist of important tasks, such as securing keys and a parking permit, asking IT to set up their computer, email address, and telephone, asking the librarians to create passwords for the ILS, Libguides, and similar systems, etc.

What other tasks (and prompts) can you think of that might be helpful? If you are struggling to put together a prompt, please see my general AI Prompt Worksheet in Introducing AI Prompt Worksheets for the Legal Profession. We welcome you to share your ideas in the comments.

Introducing AI Prompt Worksheets for the Legal Profession

I spent the first week of January attending the American Association of Law Schools’ Annual Meeting in Washington D.C. I was really impressed with all of the thoughtful AI sessions, including two at which I participated as a panelist. The rooms were packed beyond capacity for each AI session that I attended, which underscored the growing interest in AI in the legal academy. Many people attended in order to start their education. The overwhelming interest at the conference made my decision clear: it is time to launch my AI prompt worksheets to the world, addressing the need I observed there. While AALS convinced me to release the worksheets, the worksheets themselves were created for an upcoming presentation at ABA TECHSHOW 2024, How to Actually Use AI in Your Legal Practice, at which Greg Siskind and I will be discussing practical tips for generative AI usage.

DALL-E generated

Background: Good Habits – Research Planning

Law librarians have been encouraging law students to create a research plan before they start their research for decades. The plan form varies by school and/or librarian, but it usually requires the researcher to answer questions on the following topics:

  • Issue Identification
  • Jurisdiction
  • Facts
  • Key words/Terms of Art
  • Resource Selection

Once the questions are answered, the plan has the researcher write out some test searches. The plan evolves as the research progresses. The more experienced the researcher, the less formal the plan often is, but even the most experienced researcher retrieves better results if they pause to consider what they know currently and what they need in the results. After all, garbage in, garbage out (GIGO). In other words, the quality of our input affects the quality of the output. This is especially true when billable hours come into play, and you cannot bill for excess time due to poor research skills.

Continuing the Good Habits with Generative AI

GIGO applies just as much to generative AI. I quickly noticed that my AI results are much better when I stop and think them through, providing a high level of detail and a good explanation of what I want the AI system to produce. So, good law librarian that I am, I created a new form of plan for those who are learning to draft a prompt. Thus, I give you my AI prompt worksheets.

AI Prompt Worksheet – General

Worksheet (Word)

The first worksheet that I created is geared towards general generative AI systems like ChatGPT, Claude 2, Bing Chat/Copilot, and Bard.  The worksheet makes the prompter think through the following topics:

  • Tone of Output
  • Role
  • Output Format
  • Purpose
  • Issue
  • Potential Refinements (may be added later as the plan evolves)

So that you can easily keep track of your prompts, the Worksheet also requests some metadata about your prompt, including project name, date, and AI system used. The final question lets the prompter decide if this prompt worked for them.

DALL-E generated

AI Prompt Worksheet – Legal

Worksheet – Legal (Word)

For the second worksheet, I wanted to draft something that works well with legal AI systems. Based on the systems that I have received access to, such as Lexis AI and LawDroid Copilot, and the systems that I have seen demonstrated, I cut down some of the fields. Most of the systems are building a guided AI prompting experience, so they will ask you for the jurisdiction, for instance. They may also allow you to select a specific type of output, such as a legal memo or contract clause. This means less need for an extensive number of fields in the worksheet. In fact, when I ran the worksheet past a vLex representative, I was told it was not needed at all because they had made the guided prompt that easy.

Librarian that I am, however, I still feel that planning before you prompt is preferred. Reasons for this preference include: the high cost of the current generative AI searches, the desire for efficient and effective results, knowledge that an attorney’s time is literally worth money, and the desire for a happy partner and client.

The legal worksheet trims the fields down to role, output (format and jurisdiction), issue, and refinement instructions. This provides enough room to flesh out your prompt without overlapping the guided prompt fields too much.

General Comments Regarding the Worksheets

With both worksheets, the key is to give a good, detailed description of what you need. Think about it like explaining what you need to a first-year law student – the more detail you give, the more likely you are to get something useable. The worksheets provide examples of the level of detail recommended, and you will find links to the results in the footnotes of the forms.

In addition to helping perfect your prompt with some pre-planning, these worksheets should be useful for creating your very own prompt library.

Feedback Wanted!

DALL-E created

Please feel free to use the worksheets (just don’t sell them or otherwise profit off of them! Ask if you want to make a derivative of them). If you do use them, please let me know what you think in the comments or via email. How have they assisted (or not) with improving your prompting skills? Are there fields you would like to see added/removed?  I will be updating and releasing new versions as I go. If you are looking for the most recent versions of the worksheets, I will post them at: https://law-capital.libguides.com/Jennys_AI_Resources/AI_Prompt_Worksheets

Demystifying LLMs: Crafting Multiple Choice Questions from Law Outlines

In today’s post, we’ll explore how legal educators and law students can use Large Language Models (LLMs) like ChatGPT and Claude to create multiple-choice questions (MCQs) from a law school outline.

Understanding the Process

My first attempt at this was to simply ask the LLM the best way to make MCQs but it didn’t end up being particularly helpful feedback, so I did some digging. Anthropic recently shed light on their method of generating multiple-choice questions, and it’s a technique that could be immensely beneficial for test preparation – besides being a useful way to conceptualize how to make effective use of the models for studying. They utilize XML tags, which may sound technical, but in essence, these are just simple markers used to structure content. Let’s break down this process into something you can understand and use, even if you’re not a wizard at Technical Services who is comfortable with XML.

Imagine you have a law school outline on federal housing regulations. You want to test your understanding or help students review for exams. Here’s how an LLM can assist you:

STEP 1: Prepare Your Outline

Ensure that your outline is detailed and organized. It should contain clear sections, headings, and bullet points that delineate topics and subtopics. This structure will help the LLM understand and navigate your content. If you’re comfortable using XML or Markdown, this can be exceptionally helpful. Internally, the model identifies the XML tags and the text they contain, using this structure to generate new content. It recognizes the XML tags as markers that indicate the start and end of different types of information, helping it to distinguish between questions and answers.

The model uses the structure provided by the XML tags to understand the format of the data you’re presenting.

STEP 2: Uploading the Outline

Upload your outline into the platform that you’re using. Most platforms that host LLMs will allow you to upload a document directly, or you may need to copy and paste the text into a designated area.

STEP 3: Crafting a General Prompt

You can write a general prompt that instructs the LLM to read through your outline and identify key points to generate questions. For example:

“Please read the uploaded outline on federal housing regulations and create multiple-choice questions with four answer options each. Focus on the main topics and legal principles outlined in the document.”

STEP 4: Utilizing Advanced Features

Some LLMs have advanced features that can take structured or semi-structured data and understand the formatting. These models can sometimes infer the structure of a document without explicit XML or Markdown tags. For instance, you might say:

“Using the headings and subheadings as topics, generate multiple-choice questions that test the key legal concepts found under each section.”

AND/OR

Give the model some examples with XML tags (so it can better replicate what you would like “few shot prompting”):

<Question>
What are "compliance costs" in HUD regulations?
</Question>
<Answers>
1. Fines for non-compliance.
2. Costs associated with adhering to HUD regulations.
3. Expenses incurred during HUD inspections.
4. Overheads for HUD compliance training.
</Answers>

The more examples you give, the better it’s going to be.

AND/OR

You can also use the LLM to add these XML tags depending on the size of your outline and the context limit of the model you are using (OpenAI recently expanded their limit dramatically). Give it a prompt asking it to apply tags and give it an example of the types of tags you would like for your content. Then tell the model to do it with the rest of your outline:

<LawSchoolOutline>
    <CourseTitle>Constitutional Law</CourseTitle>
    <Section>
        <SectionTitle>Executive Power</SectionTitle>
        <Content>
            <SubSection>
                <SubSectionTitle>Definition and Scope</SubSectionTitle>
                <Paragraph>
                    Executive power is vested in the President of the United States and is defined as the authority to enforce laws and ensure they are implemented as intended by Congress.
             

STEP 5: Refining the Prompt

It is very rare that my first try with any of these tools produces fantastic output. It is often a “conversation with a robot sidekick” (as many of you have heard me say at my presentations) and requires you to nudge the model to create better and better output.

If the initial questions need refinement, you can provide the LLM with more specific instructions. For example:

“For each legal case mentioned in the outline, create a question that covers the main issue and possible outcomes, along with incorrect alternatives that are plausible but not correct according to the case facts.”

Replicating the Process

Students can replicate this process for other classes using the same prompt. The trick here is to stay as consistent as possible with the way that you structure and tag your outlines. It might feel like a lot of work on the front end to create 5+ examples, apply tags, etc. but remember that this is something that can be reused later! If you get a really good MCQ prompt, you could use it for every class outline that you have and continue to refine it going forward.

Gesundheit ChatGPT! Flu Snot Prompting?

Somewhat recently, during a webinar on generative AI, when the speaker Joe Regalia mentioned “flu snot” prompting, I was momentarily confused. What was that? Flu shot? Flu snot? I rewound a couple of times until I figured out he was saying “few shot” prompting. Looking for some examples of few-shot learning in the legal research/writing context, I Googled around and found his excellent article entitled ChatGPT and Legal Writing: The Perfect Union on the write.law website.

What Exactly is Few Shot Prompting?

It turns out that few-shot prompting is a technique for improving the performance of chatbots like ChatGPT by supplying a small set of examples (a few!) to guide its answers. This involves offering the AI several prompts with corresponding ideal responses, allowing it to generate more targeted and customized outputs. The purpose of this approach is to provide ChatGPT (or other generative AI) with explicit examples that reflect your desired tone, style, or level of detail.

Legal Research/Writing Prompting Advice from write.law

To learn more, I turned to Regalia’s detailed article which provides his comprehensive insights into legal research/writing prompts and illuminates various prompting strategies, including:

Zero Shot Learning/Prompting

This pertains to a language model’s ability to tackle a novel task, relying on its linguistic comprehension and pre-training insights. GPT excels in zero-shot tasks, attributed to its robust capabilities. (Perhaps unsurprisingly, one-shot learning involves providing the system with just one example.)

Few-Shot Learning/Prompting

Few-shot learning involves feeding GPT several illustrative prompts and responses that echo your desired output. These guiding examples wield more influence than mere parameters because they offer GPT a clear directive of your expectations. Even a singular example can be transformative in guiding its responses.

As an example of few-shot learning, he explains that if you want ChatGPT to improve verbs in your sentence, you can supply a few examples in a prompt like the following:

My sentence: The court issued a ruling on the motion.Better sentence: The court ruled on the motion. 
My sentence: The deadline was not met by the lawyers. 
Better sentence: The lawyers missed the deadline. 
My sentence: The court’s ruling is not released. [now enter the sentence you actually want to improve, hit enter, and GPT will take over]
[GPT’s response] Better sentence: The court has not ruled yet [usually a much-improved version, but you may need to follow up with GPT a few times to get great results like this]

And Much More Prompting Advice!

Regalia’s website offers an abundance of insights as you can see from the extensive list of topics covered in his article. Get background information on how geneative AI system operate, and dive into subjects like chain of thought prompting, assigning roles to ChatGPT, using parameters, and much more.

  • What Legal Writers Need to Know About GPT
  • Chat GPT’s Strengths Out of the Box
  • Chat GPTs Current Weaknesses and Limitations
  • Getting Started with Chat GPT
  • Prompt Engineering for Legal Writers
  • Legal Writing Prompts You Can Use with GPT
  • Using GPT to Improve Your Writing
  • More GPT Legal Writing Examples for Inspiration
  • Key GPT Terms to Know
  • Final Thoughts for GPT and Legal Writers

Experimenting With Few-Shot Prompting Before I Knew the Name

Back in June 2023, I first started dabbling in few-shot prompting without even knowing it had a name, after I came across a Forbes article titled Train ChatGPT To Write Like You In 5 Easy Steps. Intrigued, I wondered if I could use this technique to easily generate a profusion of blog posts in my own personal writing style!!

I followed the article’s instructions, copying and pasting a few of my favorite blog posts into ChatGPT to show it the tone and patterns in my writing that I wanted it to emulate. The result was interesting, but in my humble opinion, the chatty chatbot failed to pick up on my convoluted conversational (and to me, rather humorous) approach. They say that getting good results from generative AI is an iterative process, so I repeatedly tried to convey that I am funny using a paragraph from a blog post:

  • Prompt: Further information. I try to be funny. Here is an example: During a text exchange with my sister complaining about our family traits, I unthinkingly quipped, “You can’t take the I out of inertia.” Lurching sideways in my chair, I excitedly wondered if this was only an appropriate new motto for the imaginary Gotschall family crest, or whether I had finally spontaneously coined a new pithy saying!? Many times have I Googled, hoping in vain, and vainly hoping, to have hit upon a word combo unheard of in Internet history and clever/pithy enough to be considered a saying, only to find that there’s nothing new under the virtual sun.

Fail! Sadly, my efforts were to no avail, it just didn’t sound much like me… (However, that didn’t stop me from asking ChatGPT to write a conclusion for this blog post!)

Conclusion

For those keen to delve deeper into the intricacies of legal research, writing, and the intersection with AI, checking out the resources on write.law is a must. The platform offers a wealth of information, expert insights, and practical advice that can be immensely valuable for both novices and seasoned professionals.