top of page
Search

Coding with ChatGPT for Law Librarians

  • bentice1996
  • May 4
  • 10 min read

Updated: May 5

This article was written with the assistance of ChatGPT


Increasingly, law librarians are involved in supporting or conducting empirical legal research, building access-to-justice tools, managing digital collections, or analyzing legal texts. Being able to write simple scripts—even just to automate repetitive tasks—can be a powerful skill in a modern legal information environment. Thus, this article has two aims: (1) to show how generative AI tools like ChatGPT can be helpful for learning how to code while creating useful programs, and (2) to provide a tutorial on how to write an R script that retrieves case opinion texts from CourtListener using their APIs.


ChatGPT, while flawed, is a decent coding partner. Even if it doesn’t produce perfect, working scripts, debugging the code it provides is a good way to teach yourself. R has many packages for data manipulation, API access, and cleaning text, and RStudio, the most popular development environment for R, is also free and easy to install.


CourtListener is maintained by the non-profit Free Law Project and allows you to access a huge database of legal opinions via its APIs. CourtListener’s website provides documentation of its different APIs, including the Citation Lookup API, which lets you input a case citation and retrieve that case’s unique ID number. You can then use that ID to call the Opinions API, which returns the full-text opinion in various formats. This article walks you through writing a script that does exactly that.


Using ChatGPT to Kickstart Your Script

If you’re totally new to coding or suffering from blank page syndrome, prompting ChatGPT with a description of what you want your program to do is a helpful first step. The code it produces probably won't work right out of the gate, but it gives you something to learn from and build on.


A good starter prompt might be:

“I want to write an R script that takes a CSV file of case citations, uses the CourtListener API to retrieve full opinions, and saves them in a structured format.”

It’s helpful to be specific and tell ChatGPT what you already have (a CSV file) and what you want the output to look like (a dataframe or separate text files).


Once ChatGPT gives you a draft, you can start breaking the task into smaller steps and asking for help with each one.


Step-by-Step Example with Code

Below, I walk through some different steps you may ask ChatGPT for help with and discuss the shortcomings of what ChatGPT provides as examples of things to watch out for.


1. Load Required Packages

Here, ChatGPT might suggest you use httr or httr2 to call the API and readr or readxl to load your CSV file. I recommend using httr2 because it has additional functions that httr does not. Also, dplyr is a really useful package for changing and cleaning your data once you have it loaded as a dataframe (you can also install the entire tidyverse which is a an awesome collection of packages that includes dplyr). Lastly, you can always instruct ChatGPT to use a specific package that you want to use when prompting it to write code. Notice that ChatGPT forgot to recommend a package for handling JSON data which is the data format that your response from any API is likely to come in.

library(httr2)
library(readr)
library(dplyr)

2. Load Your Case Citations

Here, you can load your case citations into R by reading your CSV file into your script using its file path. If you don't know the file path, then you can often right click on your file and "copy as path" and then paste that into your script. If you're unsure what the column containing your citations is titled then you can check using the "colnames()" function. Then, use whatever that column name is in your code. Lastly, you can always change the name of any of the columns in your dataframe and then use that name instead.

citations_df <- read_csv("path/to/your/citations.csv")
colnames(citations_df)

3. Define a Function to Get Case ID Using Citation Lookup API

Here, ChatGPT has created a function that makes API calls to CourtListener. Notice three important things. First, ChatGPT has selected the wrong URL. CourtListener's APIs are V4, not V3, and the correct endpoint is "citation-lookup" not "citation." This is a typical ChatGPT error because it is pulling from old code examples that used the older API versions. You can see the correct version at the bottom of the article. Second, you'll have to get your own API key (which is easy to do by making a free account on CourtListener) for the "req_headers" line of the function to work. Third, ChatGPT has assumed that you have already defined key variables like "citation" and "api_key" that you will then plug in when you actually call the function.

get_case_id <- function(citation, api_key) {
  req <- request("https://www.courtlistener.com/api/rest/v3/citations/") %>%
    req_url_query(citation = citation) %>%
    req_headers(Authorization = paste("Token", api_key)) %>%
    req_perform()
  
  resp <- resp_body_json(req)
  
  if (length(resp$results) > 0) {
    return(resp$results[[1]]$resource_uri)
  } else {
    return(NA)
  }
}

4. Define Function to Get Full Opinion Using the ID

Here, ChatGPT has created a function that takes the each cases unique ID number and queries CourtListener's Opinions API. Like the previous function it assumes you have already defined important variables.

get_opinion_text <- function(opinion_url, api_key) {
  req <- request(opinion_url) %>%
    req_headers(Authorization = paste("Token", api_key)) %>%
    req_perform()
  
  resp <- resp_body_json(req)
  
  return(resp$plain_text)
}

5. Loop Through Citations and Get Text

Here, ChatGPT recommended using a loop or purrr::map to build a dataframe of results. The below snippet is a loop (using "rowwise()") that creates a new dataframe and fills it in with information from the API calls and your existing CSV file. I recommend being very specific when prompting ChatGPT about what data you would like to include in your final dataframe.

opinions_df <- citations_df %>%
  rowwise() %>%
  mutate(
    case_id_url = get_case_id(citation, api_key),
    opinion_text = if (!is.na(case_id_url)) get_opinion_text(case_id_url, api_key) else NA
  )

6. Save Your Results

Here, ChatGPT has opted for "write_csv" from the "readr" package to save the final dataframe as a new CSV file. However, you can also use "write.csv" from base R. Note, that ChatGPT has preserved the variable name ("opinions_df") from the previous step into this one. This is an example of how ChatGPT can be helpful for preserving consistency across your code which can be especially helpful when debugging.

write_csv(opinions_df, "retrieved_opinions.csv")

Conclusion

Hopefully, this article has helped you figure out where to start when using ChatGPT to help you code as well as shown how useful CourtListener can be despite its shortcomings.


Below, you can see what my final code ended up looking like, and you'll notice that it's much more involved than what ChatGPT provided in the above snippets. So, a final tip, don't be discouraged by lots of error messages. Coding can be tedious and time consuming. But, the initial investment in creating working code can enable you to provide new forms of assistance to your patrons.


P.S. Here's a tutorial for using ChatGPT to help you debug your code as you work to address the inevitable error messages you receive.


# Install required packages if not already installed
if (!requireNamespace("httr2", quietly = TRUE)) {
  install.packages("httr2")
}
if (!requireNamespace("jsonlite", quietly = TRUE)) {
  install.packages("jsonlite")
}
if (!requireNamespace("dplyr", quietly = TRUE)) {
  install.packages("dplyr")
}
if (!requireNamespace("stringdist", quietly = TRUE)) {
  install.packages("stringdist")
}
# Load required packages
library(httr2)
library(jsonlite)
library(dplyr)
library(stringdist)
# Base URL
base_url <- "https://www.courtlistener.com/api/rest/v4/citation-lookup/"
# Set your API Token
Sys.setenv(api_key = "insert your key here")
# Read in CSV file of cases
csv_cases <- read.csv("insert your own file path", header=TRUE, stringsAsFactors=FALSE)
# Turn CSV into a dataframe
cases_1979_to_1989 <- as.data.frame(csv_cases)
# Extract citations from dataframe
citations <- cases_1979_to_1989$Citation
# Initialize a list to store API responses
api_responses <- list()
# Make API call for each case citation with debugging
for (i in 1:length(citations)) {
  citation <- as.character(citations[i])
  
  # Create the request for the specific citation
  request <- request(base_url) %>%
    req_method("POST") %>%
    req_headers(
      "Authorization" = paste("Token", Sys.getenv("api_key"))  # Set the API key in headers
    ) %>%
    req_body_form(
      text = citation
      )  # Pass the citation to the API request
  
  # Print the request to check if it’s correctly formatted
  print(request)
  
  # Dry run the request to check for issues without performing it
  request %>% req_dry_run()
  
  # Perform the request and capture the response
  response <- request %>% req_perform()
  
  # Check if the response is successful
  if (resp_status(response) == 200) {
    # Parse the JSON response into R objects
    data <- response %>% resp_body_string() %>% fromJSON(flatten = TRUE)
    
    # Store the data in the list
    api_responses[[i]] <- data
  } else {
    # Handle errors by adding a placeholder or logging
    api_responses[[i]] <- list(error = paste("Failed for citation:", citation, "with status code:", resp_status(response)))
  }
  
  # Optional: Add a short pause to avoid hitting API rate limits
  Sys.sleep(2)
}
# Combine the list of responses into a dataframe if possible
api_responses_df <- do.call(rbind, lapply(api_responses, as.data.frame))
# View or save the dataframe for further analysis
View(api_responses_df)
View(api_responses_df[[7]][[1]][["id"]])
# Opinions url
opinions_url <- "https://www.courtlistener.com/api/rest/v4/opinions/"
# Initialize an empty dataframe to store API responses
cases_api_responses_df <- data.frame(case_id = character(),
                                     response_data = I(list()),  # Use I() to store lists in a dataframe column
                                     stringsAsFactors = FALSE)
# Loop through each case in api_responses_df[[7]]
for (i in 1:length(api_responses_df[[7]])) {
  
  # Extract the list of IDs for the current case
  case_ids <- api_responses_df[[7]][[i]][["id"]]
  
  # Check if case_ids is empty or NULL
  if (is.null(case_ids) || length(case_ids) == 0) {
    print(paste("No case IDs found for index:", i))
    next  # Skip to the next iteration if no IDs are found
  }
  
  # Iterate over each ID in the list
  for (j in 1:length(case_ids)) {
    
    case_id <- case_ids[j]
    
    # Check if case_id is empty
    if (case_id == "") {
      print(paste("Empty case ID found at index:", j, "for index:", i))
      next  # Skip this iteration if case_id is empty
    }
    
    print(paste("Fetching opinion for Case ID:", case_id))
    
    # Check if the opinion exists with a HEAD request
    head_request <- request(opinions_url) %>%
      req_method("HEAD") %>%
      req_headers(
        "Authorization" = paste("Token", Sys.getenv("api_key"))
      ) %>%
      req_url_path_append(case_id)
    
    # Perform the HEAD request and check if the resource exists
    head_response <- req_perform(head_request)
    if (resp_status(head_response) == 404) {
      print(paste("Opinion not found for Case ID:", case_id))
      next  # Skip if opinion not found
    }
    
    # Fetch the opinion with GET request
    response <- request(opinions_url) %>%
      req_method("GET") %>%
      req_headers(
        "Authorization" = paste("Token", Sys.getenv("api_key"))
      ) %>%
      req_url_path_append(case_id) %>%
      req_perform()
    
    # Store the raw API response directly in the dataframe
    if (resp_status(response) == 200) {
      # Parse JSON response and store the entire response
      data <- response %>% resp_body_string() %>% fromJSON()
      
      # Append to the dataframe
      cases_api_responses_df <- rbind(cases_api_responses_df, data.frame(
        case_id = case_id,
        response_data = I(list(data)),  # Store the entire response as a list
        stringsAsFactors = FALSE
      ))
      
      print(paste("Stored response for Case ID:", case_id))
    } else {
      print(paste("Failed request for Case ID:", case_id, "Status:", resp_status(response)))
    }
  }
}
# Check if any data was added
if (nrow(cases_api_responses_df) > 0) {
  print("API responses successfully stored in dataframe.")
  View(cases_api_responses_df)
} else {
  print("No data was added to the dataframe.")
}
# Create an empty dataframe to store the extracted information with case_id only
extracted_opinions_df <- data.frame(case_id = character(),
                                    opinion_text = character(),
                                    stringsAsFactors = FALSE)
# Loop through each response in cases_api_responses_df
for (i in 1:nrow(cases_api_responses_df)) {
  
  # Extract the case ID and response data
  case_id <- cases_api_responses_df$case_id[i]
  response_data <- cases_api_responses_df$response_data[[i]]  # Get the list from the dataframe
  
  # Set variables for opinion text fields
  html_with_citations <- response_data$html_with_citations
  html_columbia <- response_data$html_columbia
  html_lawbox <- response_data$html_lawbox
  xml_harvard <- response_data$xml_harvard
  html_anon_2020 <- response_data$html_anon_2020
  html <- response_data$html
  plain_text <- response_data$plain_text
  
  # Initialize an empty opinion_text variable
  opinion_text <- NA
  
  # Extract the opinion text based on preference order
  if (!is.null(html_with_citations) && html_with_citations != "") {
    opinion_text <- html_with_citations
  } else if (!is.null(html_columbia) && html_columbia != "") {
    opinion_text <- html_columbia
  } else if (!is.null(html_lawbox) && html_lawbox != "") {
    opinion_text <- html_lawbox
  } else if (!is.null(xml_harvard) && xml_harvard != "") {
    opinion_text <- xml_harvard
  } else if (!is.null(html_anon_2020) && html_anon_2020 != "") {
    opinion_text <- html_anon_2020
  } else if (!is.null(html) && html != "") {
    opinion_text <- html
  } else if (!is.null(plain_text) && plain_text != "") {
    opinion_text <- plain_text
  }
  
  # Add the extracted data to the dataframe if opinion_text is not NA
  if (!is.na(opinion_text) && opinion_text != "") {
    extracted_opinions_df <- rbind(extracted_opinions_df, data.frame(case_id = case_id, opinion_text = opinion_text, stringsAsFactors = FALSE))
  }
}
# Check if any data was extracted
if (nrow(extracted_opinions_df) > 0) {
  print("Data successfully extracted into extracted_opinions_df.")
  View(extracted_opinions_df)  # View the extracted opinions dataframe
} else {
  print("No data was extracted.")
}
write.csv(extracted_opinions_df, 
          file = "insert your own file path here", 
          row.names = FALSE, 
          na = "")
# Read in CSV file of cases
csv_extracted_opinions <- read.csv("insert your own file path here", 
                      header=TRUE, stringsAsFactors=FALSE)
# Turn CSV into a dataframe
extracted_opinions_1979_to_1989_df <- as.data.frame(csv_extracted_opinions)
# Initialize vectors to store id, case_name_full, and date_filed
case_ids <- c()         # Vector to store ids
case_names_full <- c()  # Vector to store case names
date_fileds <- c()      # Vector to store date filed
# Loop through api_responses_df[[7]] and extract id, case_name_full, and date_filed
for (i in seq_along(api_responses_df[[7]])) {
  current_entry <- api_responses_df[[7]][[i]]
  
  # Check if id, case_name_full, and date_filed are present and not NULL
  if (!is.null(current_entry$id) && !is.null(current_entry$case_name_full) && !is.null(current_entry$date_filed)) {
    case_ids <- c(case_ids, as.character(current_entry$id))                # Append id
    case_names_full <- c(case_names_full, current_entry$case_name_full)   # Append case_name_full
    date_fileds <- c(date_fileds, as.character(current_entry$date_filed))  # Append date_filed
  }
}
# Final lengths of vectors for debugging
cat("Final lengths of vectors:\n")
cat("case_ids:", length(case_ids), "\n")
cat("case_names_full:", length(case_names_full), "\n")
cat("date_fileds:", length(date_fileds), "\n")
# Ensure all vectors have the same length before creating the data frame
if (length(case_ids) == length(case_names_full) && length(case_ids) == length(date_fileds)) {
  
  # Create a new dataframe with id, case_name_full, and date_filed columns
  case_name_df <- data.frame(
    case_id = case_ids,
    case_name_full = case_names_full,
    date_filed = date_fileds,  # Include date_filed
    stringsAsFactors = FALSE
  )
  
  # Convert case_id in case_name_df to integer
  case_name_df$case_id <- as.integer(case_name_df$case_id)
  
  # Now perform the left join
  extracted_opinions_1979_to_1989_df <- extracted_opinions_1979_to_1989_df %>%
    left_join(case_name_df, by = "case_id")  # Add columns from case_name_df to extracted_opinions_1979_to_1989_df
  
  # View the updated dataframe with case names and date filed
  View(extracted_opinions_1979_to_1989_df)
} else {
  cat("Error: Vectors have different lengths. Cannot create data frame.\n")
}
# Remove the specified columns
extracted_opinions_1979_to_1989_df <- extracted_opinions_1979_to_1989_df %>%
  select(-case_name_full.x, -case_name_full.y, -date_filed.y)
# Rename the column
extracted_opinions_1979_to_1989_df <- extracted_opinions_1979_to_1989_df %>%
  rename(date_filed = date_filed.x)
# Remove duplicates by keeping the case with the largest opinion_text length
cleaned_cases_df <- extracted_opinions_1979_to_1989_df %>%
  mutate(
    case_name_full = str_squish(tolower(case_name_full)), # Standardize case names
    opinion_length = nchar(as.character(opinion_text)),  # Calculate opinion length
    is_majority_opinion = str_starts(opinion_text, "<opinion type=\"majority\">") # Check for majority opinion
  ) %>%
  filter(opinion_length >= 250) %>% # Remove cases with opinion length < 250
  group_by(case_name_full) %>%
  slice_max(order_by = opinion_length, n = 1) %>% # Remove duplicates based on case_name_full
  ungroup() %>%
  group_by(date_filed) %>%
  arrange(is_majority_opinion) %>% # Prioritize non-majority opinions
  filter(!(is_majority_opinion & n() > 1 & row_number() > 1)) %>% # Remove majority opinion if duplicate date
  ungroup() %>%
  select(-opinion_length, -is_majority_opinion) # Remove temporary columns 
# Save as case data as a csv
write.csv(cleaned_cases_df, 
          file = "insert your own file path here", 
          row.names = FALSE, 
          na = "")

 
 
 

Recent Posts

See All
The Promise and The Law Part I

My first law review article has been published and is available here: https://digitalcommons.law.wne.edu/lawreview/vol46/iss3/3/.

 
 
 

Comentários


bottom of page