Coding with ChatGPT for Law Librarians

bentice1996
May 4
10 min read

Updated: May 5

This article was written with the assistance of ChatGPT

Increasingly, law librarians are involved in supporting or conducting empirical legal research, building access-to-justice tools, managing digital collections, or analyzing legal texts. Being able to write simple scripts—even just to automate repetitive tasks—can be a powerful skill in a modern legal information environment. Thus, this article has two aims: (1) to show how generative AI tools like ChatGPT can be helpful for learning how to code while creating useful programs, and (2) to provide a tutorial on how to write an R script that retrieves case opinion texts from CourtListener using their APIs.

ChatGPT, while flawed, is a decent coding partner. Even if it doesn’t produce perfect, working scripts, debugging the code it provides is a good way to teach yourself. R has many packages for data manipulation, API access, and cleaning text, and RStudio, the most popular development environment for R, is also free and easy to install.

CourtListener is maintained by the non-profit Free Law Project and allows you to access a huge database of legal opinions via its APIs. CourtListener’s website provides documentation of its different APIs, including the Citation Lookup API, which lets you input a case citation and retrieve that case’s unique ID number. You can then use that ID to call the Opinions API, which returns the full-text opinion in various formats. This article walks you through writing a script that does exactly that.

Using ChatGPT to Kickstart Your Script

If you’re totally new to coding or suffering from blank page syndrome, prompting ChatGPT with a description of what you want your program to do is a helpful first step. The code it produces probably won't work right out of the gate, but it gives you something to learn from and build on.

A good starter prompt might be:

“I want to write an R script that takes a CSV file of case citations, uses the CourtListener API to retrieve full opinions, and saves them in a structured format.”

It’s helpful to be specific and tell ChatGPT what you already have (a CSV file) and what you want the output to look like (a dataframe or separate text files).

Once ChatGPT gives you a draft, you can start breaking the task into smaller steps and asking for help with each one.

Step-by-Step Example with Code

Below, I walk through some different steps you may ask ChatGPT for help with and discuss the shortcomings of what ChatGPT provides as examples of things to watch out for.

1. Load Required Packages

Here, ChatGPT might suggest you use httr or httr2 to call the API and readr or readxl to load your CSV file. I recommend using httr2 because it has additional functions that httr does not. Also, dplyr is a really useful package for changing and cleaning your data once you have it loaded as a dataframe (you can also install the entire tidyverse which is a an awesome collection of packages that includes dplyr). Lastly, you can always instruct ChatGPT to use a specific package that you want to use when prompting it to write code. Notice that ChatGPT forgot to recommend a package for handling JSON data which is the data format that your response from any API is likely to come in.

library(httr2)
library(readr)
library(dplyr)

2. Load Your Case Citations

Here, you can load your case citations into R by reading your CSV file into your script using its file path. If you don't know the file path, then you can often right click on your file and "copy as path" and then paste that into your script. If you're unsure what the column containing your citations is titled then you can check using the "colnames()" function. Then, use whatever that column name is in your code. Lastly, you can always change the name of any of the columns in your dataframe and then use that name instead.

citations_df <- read_csv("path/to/your/citations.csv")
colnames(citations_df)

3. Define a Function to Get Case ID Using Citation Lookup API

Here, ChatGPT has created a function that makes API calls to CourtListener. Notice three important things. First, ChatGPT has selected the wrong URL. CourtListener's APIs are V4, not V3, and the correct endpoint is "citation-lookup" not "citation." This is a typical ChatGPT error because it is pulling from old code examples that used the older API versions. You can see the correct version at the bottom of the article. Second, you'll have to get your own API key (which is easy to do by making a free account on CourtListener) for the "req_headers" line of the function to work. Third, ChatGPT has assumed that you have already defined key variables like "citation" and "api_key" that you will then plug in when you actually call the function.

get_case_id <- function(citation, api_key) {
  req <- request("https://www.courtlistener.com/api/rest/v3/citations/") %>%
    req_url_query(citation = citation) %>%
    req_headers(Authorization = paste("Token", api_key)) %>%
    req_perform()
  
  resp <- resp_body_json(req)
  
  if (length(resp$results) > 0) {
    return(resp$results[[1]]$resource_uri)
  } else {
    return(NA)
  }
}

4. Define Function to Get Full Opinion Using the ID

Here, ChatGPT has created a function that takes the each cases unique ID number and queries CourtListener's Opinions API. Like the previous function it assumes you have already defined important variables.

get_opinion_text <- function(opinion_url, api_key) {
  req <- request(opinion_url) %>%
    req_headers(Authorization = paste("Token", api_key)) %>%
    req_perform()
  
  resp <- resp_body_json(req)
  
  return(resp$plain_text)
}

5. Loop Through Citations and Get Text

Here, ChatGPT recommended using a loop or purrr::map to build a dataframe of results. The below snippet is a loop (using "rowwise()") that creates a new dataframe and fills it in with information from the API calls and your existing CSV file. I recommend being very specific when prompting ChatGPT about what data you would like to include in your final dataframe.

opinions_df <- citations_df %>%
  rowwise() %>%
  mutate(
    case_id_url = get_case_id(citation, api_key),
    opinion_text = if (!is.na(case_id_url)) get_opinion_text(case_id_url, api_key) else NA
  )

6. Save Your Results

Here, ChatGPT has opted for "write_csv" from the "readr" package to save the final dataframe as a new CSV file. However, you can also use "write.csv" from base R. Note, that ChatGPT has preserved the variable name ("opinions_df") from the previous step into this one. This is an example of how ChatGPT can be helpful for preserving consistency across your code which can be especially helpful when debugging.

write_csv(opinions_df, "retrieved_opinions.csv")

Conclusion

Hopefully, this article has helped you figure out where to start when using ChatGPT to help you code as well as shown how useful CourtListener can be despite its shortcomings.

Below, you can see what my final code ended up looking like, and you'll notice that it's much more involved than what ChatGPT provided in the above snippets. So, a final tip, don't be discouraged by lots of error messages. Coding can be tedious and time consuming. But, the initial investment in creating working code can enable you to provide new forms of assistance to your patrons.

P.S. Here's a tutorial for using ChatGPT to help you debug your code as you work to address the inevitable error messages you receive.

# Install required packages if not already installed

if (!requireNamespace("httr2", quietly = TRUE)) {

  install.packages("httr2")

if (!requireNamespace("jsonlite", quietly = TRUE)) {

  install.packages("jsonlite")

if (!requireNamespace("dplyr", quietly = TRUE)) {

  install.packages("dplyr")

if (!requireNamespace("stringdist", quietly = TRUE)) {

  install.packages("stringdist")

# Load required packages

library(httr2)

library(jsonlite)

library(dplyr)

library(stringdist)

# Base URL

base_url <- "https://www.courtlistener.com/api/rest/v4/citation-lookup/"

# Set your API Token

Sys.setenv(api_key = "insert your key here")

# Read in CSV file of cases

csv_cases <- read.csv("insert your own file path", header=TRUE, stringsAsFactors=FALSE)

# Turn CSV into a dataframe

cases_1979_to_1989 <- as.data.frame(csv_cases)

# Extract citations from dataframe

citations <- cases_1979_to_1989$Citation

# Initialize a list to store API responses

api_responses <- list()

# Make API call for each case citation with debugging

for (i in 1:length(citations)) {

  citation <- as.character(citations[i])

  # Create the request for the specific citation

  request <- request(base_url) %>%

    req_method("POST") %>%

    req_headers(

      "Authorization" = paste("Token", Sys.getenv("api_key"))  # Set the API key in headers

    ) %>%

    req_body_form(

      text = citation

      )  # Pass the citation to the API request

  # Print the request to check if it’s correctly formatted

  print(request)

  # Dry run the request to check for issues without performing it

  request %>% req_dry_run()

  # Perform the request and capture the response

  response <- request %>% req_perform()

  # Check if the response is successful

  if (resp_status(response) == 200) {

    # Parse the JSON response into R objects

    data <- response %>% resp_body_string() %>% fromJSON(flatten = TRUE)

    # Store the data in the list

    api_responses[[i]] <- data

  } else {

    # Handle errors by adding a placeholder or logging

    api_responses[[i]] <- list(error = paste("Failed for citation:", citation, "with status code:", resp_status(response)))

  # Optional: Add a short pause to avoid hitting API rate limits

  Sys.sleep(2)

# Combine the list of responses into a dataframe if possible

api_responses_df <- do.call(rbind, lapply(api_responses, as.data.frame))

# View or save the dataframe for further analysis

View(api_responses_df)

View(api_responses_df[[7]][[1]][["id"]])

# Opinions url

opinions_url <- "https://www.courtlistener.com/api/rest/v4/opinions/"

# Initialize an empty dataframe to store API responses

cases_api_responses_df <- data.frame(case_id = character(),

                                     response_data = I(list()),  # Use I() to store lists in a dataframe column

                                     stringsAsFactors = FALSE)

# Loop through each case in api_responses_df[[7]]

for (i in 1:length(api_responses_df[[7]])) {

  # Extract the list of IDs for the current case

  case_ids <- api_responses_df[[7]][[i]][["id"]]

  # Check if case_ids is empty or NULL

  if (is.null(case_ids) || length(case_ids) == 0) {

    print(paste("No case IDs found for index:", i))

    next  # Skip to the next iteration if no IDs are found

  # Iterate over each ID in the list

  for (j in 1:length(case_ids)) {

    case_id <- case_ids[j]

    # Check if case_id is empty

    if (case_id == "") {

      print(paste("Empty case ID found at index:", j, "for index:", i))

      next  # Skip this iteration if case_id is empty

    print(paste("Fetching opinion for Case ID:", case_id))

    # Check if the opinion exists with a HEAD request

    head_request <- request(opinions_url) %>%

      req_method("HEAD") %>%

      req_headers(

        "Authorization" = paste("Token", Sys.getenv("api_key"))

      ) %>%

      req_url_path_append(case_id)

    # Perform the HEAD request and check if the resource exists

    head_response <- req_perform(head_request)

    if (resp_status(head_response) == 404) {

      print(paste("Opinion not found for Case ID:", case_id))

      next  # Skip if opinion not found

    # Fetch the opinion with GET request

    response <- request(opinions_url) %>%

      req_method("GET") %>%

      req_headers(

        "Authorization" = paste("Token", Sys.getenv("api_key"))

      ) %>%

      req_url_path_append(case_id) %>%

      req_perform()

    # Store the raw API response directly in the dataframe

    if (resp_status(response) == 200) {

      # Parse JSON response and store the entire response

      data <- response %>% resp_body_string() %>% fromJSON()

      # Append to the dataframe

      cases_api_responses_df <- rbind(cases_api_responses_df, data.frame(

        case_id = case_id,

        response_data = I(list(data)),  # Store the entire response as a list

        stringsAsFactors = FALSE

))

      print(paste("Stored response for Case ID:", case_id))

    } else {

      print(paste("Failed request for Case ID:", case_id, "Status:", resp_status(response)))

# Check if any data was added

if (nrow(cases_api_responses_df) > 0) {

  print("API responses successfully stored in dataframe.")

  View(cases_api_responses_df)

} else {

  print("No data was added to the dataframe.")

# Create an empty dataframe to store the extracted information with case_id only

extracted_opinions_df <- data.frame(case_id = character(),

                                    opinion_text = character(),

                                    stringsAsFactors = FALSE)

# Loop through each response in cases_api_responses_df

for (i in 1:nrow(cases_api_responses_df)) {

  # Extract the case ID and response data

  case_id <- cases_api_responses_df$case_id[i]

  response_data <- cases_api_responses_df$response_data[[i]]  # Get the list from the dataframe

  # Set variables for opinion text fields

  html_with_citations <- response_data$html_with_citations

  html_columbia <- response_data$html_columbia

  html_lawbox <- response_data$html_lawbox

  xml_harvard <- response_data$xml_harvard

  html_anon_2020 <- response_data$html_anon_2020

  html <- response_data$html

  plain_text <- response_data$plain_text

  # Initialize an empty opinion_text variable

  opinion_text <- NA

  # Extract the opinion text based on preference order

  if (!is.null(html_with_citations) && html_with_citations != "") {

    opinion_text <- html_with_citations

  } else if (!is.null(html_columbia) && html_columbia != "") {

    opinion_text <- html_columbia

  } else if (!is.null(html_lawbox) && html_lawbox != "") {

    opinion_text <- html_lawbox

  } else if (!is.null(xml_harvard) && xml_harvard != "") {

    opinion_text <- xml_harvard

  } else if (!is.null(html_anon_2020) && html_anon_2020 != "") {

    opinion_text <- html_anon_2020

  } else if (!is.null(html) && html != "") {

    opinion_text <- html

  } else if (!is.null(plain_text) && plain_text != "") {

    opinion_text <- plain_text

  # Add the extracted data to the dataframe if opinion_text is not NA

  if (!is.na(opinion_text) && opinion_text != "") {

    extracted_opinions_df <- rbind(extracted_opinions_df, data.frame(case_id = case_id, opinion_text = opinion_text, stringsAsFactors = FALSE))

# Check if any data was extracted

if (nrow(extracted_opinions_df) > 0) {

  print("Data successfully extracted into extracted_opinions_df.")

  View(extracted_opinions_df)  # View the extracted opinions dataframe

} else {

  print("No data was extracted.")

write.csv(extracted_opinions_df,

          file = "insert your own file path here",

          row.names = FALSE,

          na = "")

# Read in CSV file of cases

csv_extracted_opinions <- read.csv("insert your own file path here",

                      header=TRUE, stringsAsFactors=FALSE)

# Turn CSV into a dataframe

extracted_opinions_1979_to_1989_df <- as.data.frame(csv_extracted_opinions)

# Initialize vectors to store id, case_name_full, and date_filed

case_ids <- c()         # Vector to store ids

case_names_full <- c()  # Vector to store case names

date_fileds <- c()      # Vector to store date filed

# Loop through api_responses_df[[7]] and extract id, case_name_full, and date_filed

for (i in seq_along(api_responses_df[[7]])) {

  current_entry <- api_responses_df[[7]][[i]]

  # Check if id, case_name_full, and date_filed are present and not NULL

  if (!is.null(current_entry$id) && !is.null(current_entry$case_name_full) && !is.null(current_entry$date_filed)) {

    case_ids <- c(case_ids, as.character(current_entry$id))                # Append id

    case_names_full <- c(case_names_full, current_entry$case_name_full)   # Append case_name_full

    date_fileds <- c(date_fileds, as.character(current_entry$date_filed))  # Append date_filed

# Final lengths of vectors for debugging

cat("Final lengths of vectors:\n")

cat("case_ids:", length(case_ids), "\n")

cat("case_names_full:", length(case_names_full), "\n")

cat("date_fileds:", length(date_fileds), "\n")

# Ensure all vectors have the same length before creating the data frame

if (length(case_ids) == length(case_names_full) && length(case_ids) == length(date_fileds)) {

  # Create a new dataframe with id, case_name_full, and date_filed columns

  case_name_df <- data.frame(

    case_id = case_ids,

    case_name_full = case_names_full,

    date_filed = date_fileds,  # Include date_filed

    stringsAsFactors = FALSE

  # Convert case_id in case_name_df to integer

  case_name_df$case_id <- as.integer(case_name_df$case_id)

  # Now perform the left join

  extracted_opinions_1979_to_1989_df <- extracted_opinions_1979_to_1989_df %>%

    left_join(case_name_df, by = "case_id")  # Add columns from case_name_df to extracted_opinions_1979_to_1989_df

  # View the updated dataframe with case names and date filed

  View(extracted_opinions_1979_to_1989_df)

} else {

  cat("Error: Vectors have different lengths. Cannot create data frame.\n")

# Remove the specified columns

extracted_opinions_1979_to_1989_df <- extracted_opinions_1979_to_1989_df %>%

  select(-case_name_full.x, -case_name_full.y, -date_filed.y)

# Rename the column

extracted_opinions_1979_to_1989_df <- extracted_opinions_1979_to_1989_df %>%

  rename(date_filed = date_filed.x)

# Remove duplicates by keeping the case with the largest opinion_text length

cleaned_cases_df <- extracted_opinions_1979_to_1989_df %>%

  mutate(

    case_name_full = str_squish(tolower(case_name_full)), # Standardize case names

    opinion_length = nchar(as.character(opinion_text)),  # Calculate opinion length

    is_majority_opinion = str_starts(opinion_text, "<opinion type=\"majority\">") # Check for majority opinion

  ) %>%

  filter(opinion_length >= 250) %>% # Remove cases with opinion length < 250

  group_by(case_name_full) %>%

  slice_max(order_by = opinion_length, n = 1) %>% # Remove duplicates based on case_name_full

  ungroup() %>%

  group_by(date_filed) %>%

  arrange(is_majority_opinion) %>% # Prioritize non-majority opinions

  filter(!(is_majority_opinion & n() > 1 & row_number() > 1)) %>% # Remove majority opinion if duplicate date

  ungroup() %>%

  select(-opinion_length, -is_majority_opinion) # Remove temporary columns

# Save as case data as a csv

write.csv(cleaned_cases_df,

          file = "insert your own file path here",

          row.names = FALSE,

          na = "")