How to Scrape FDA Drug Approval Data with Python

Personal Update:

Before we embark on today’s tutorial, I wanted to share a personal update with you that sheds light on my recent hiatus from blogging. In the past few months, I’ve been immersed in the world of motherhood, cherishing precious moments with my newborn, Ellie.

As I transition back to my role in the biotech world, I find myself navigating the delicate balance of work and motherhood. It’s been a journey filled with new challenges, learning experiences, and a deep sense of fulfillment.

As I continue to share coding insights related to data science, visualization and Cheminformatics, I look forward to weaving in more stories about the joys and challenges of being a working mom. I also encourage and welcome others to share their experiences, thoughts, and suggestions.

Now, let’s jump into the tutorial!

Introduction

Drug approval data from the U.S. Food and Drug Administration (FDA) is a valuable resource for researchers, pharmaceutical professionals, and enthusiasts. In this tutorial, we will explore how to scrape FDA drug approval data with Python. Our objective is to retrieve information about newly approved drugs for a specified range of years. Additionally, I will demonstrate how to extract label PDF links associated with each drug. We’ll be using Python, along with the requests, BeautifulSoup, and pandas libraries.

Webpages Structure

The FDA provides a dedicated page for each year, summarizing the newly approved drugs (e.g. Novel Drug Approvals for 2015).

These pages typically contain tables with information such as the approval date, drug name, and additional details. The structure of these pages is HTML-based, and we will use web scraping techniques to extract the relevant information.

Objective:

The objective of this tutorial is to automate the extraction of FDA drug approval data from specific years. By fetching the HTML content, parsing it, and extracting drug approval tables, we aim to organize the data into a Pandas DataFrame for subsequent analysis and exploration.

Throughout this tutorial, we will:

  1. Scrape FDA drug approval data: Employ web scraping techniques to retrieve data from FDA webpages for a specified range of years.
  2. Extract hyperlinks and drug names: Navigate through the drug approval data, extracting hyperlinks associated with drug names, and structuring the information into a comprehensible Pandas DataFrame.
  3. Search for additional drug information and extract label PDF links: Utilize the extracted hyperlinks to navigate to individual drug pages, fetching and extracting label PDF links.

The FDA drug label (PDF) is a vital guide for healthcare professionals, providing crucial details for safe and effective medication use. It outlines uses, dosage, side effects, contraindications, and more, ensuring informed decisions and patient safety. This resource is essential for understanding a drug’s characteristics, usage and safety profile, contributing significantly to evidence-based medical practice and regulatory oversight.

The jupyter notebook containing the tutorial is available in github.

Prerequisites

Before we begin, make sure you have the required Python libraries installed:

pip install requests beautifulsoup4 pandas

Step 1: Importing Libraries

Let’s start by importing the necessary libraries:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

Step 2: Scrape FDA Approved Drugs

scrape_fda_drug_approvals is a function that scrapes FDA drug approvals data for specified years. It iterates through each year in the provided range, constructs the URL for the FDA drug approvals page, and makes a request to retrieve the HTML content. The function then extracts the drug approval information from the HTML tables, renames columns for consistency, and adds additional information such as hyperlinks and drug names by calling the extract_links_from_fda_drugname function.

def scrape_fda_drug_approvals(start_year, end_year):
    """
    Scrapes FDA drug approvals data from specified years.

    Parameters:
    - start_year (int): The starting year for scraping.
    - end_year (int): The ending year for scraping.

    Returns:
    - df_final (DataFrame): Pandas DataFrame containing drug approval information.
    """

    # Initialize an empty list to store DataFrames
    tables = []

    # Iterate through each year in the specified range
    for year in range(start_year, end_year + 1):
        print(f"Scraping data for year {year}")

        # Construct the URL for the FDA drug approvals page for the current year
        url = f'https://www.fda.gov/drugs/new-drugs-fda-cders-new-molecular-entities-and-new-therapeutic-biological-products/novel-drug-approvals-{year}'

        # Make a request to the URL and get the HTML content
        response = requests.get(url)

        # Check if the request was successful
        if response.status_code != 200:
            print(f"Failed to retrieve content for year {year}. Status code: {response.status_code}")
            continue  # Skip to the next iteration

        # Extract the table from the HTML content
        df_list = pd.read_html(response.content)

        # Check if any tables were found
        if not df_list:
            print(f"No tables found for year {year}.")
            continue  # Skip to the next iteration

        # Use the first table found
        df = df_list[0]

        # Rename columns for consistency
        df.rename(columns={'Date': 'Approval Date', 'Drug  Name': 'Drug Name'}, inplace=True)

        # Extract links and names from the drug names in the table
        soup = BeautifulSoup(response.text, 'html.parser')
        table = soup.find('table')

        # Check if the table is found
        if table is None:
            print(f"No table found for year {year}.")
            continue  # Skip to the next iteration

        links, names = extract_links_from_fda_drugname(table)

        # Add links and names as new columns in the DataFrame
        df['links'], df['check_names'] = links, names

        # Append the DataFrame to the list of tables
        tables.append(df)
        
    df_final = pd.concat(tables, ignore_index=True)
    return df_final

Step 3: Extract Links from FDA Drug Names to Explore Further Details

extract_links_from_fda_drugname is a function that, given an HTML table (BeautifulSoup object), retrieves hyperlinks and drug names. It iterates through each row, excluding the header, finds the first hyperlink, and extracts the hyperlink’s URL (href) and associated text (drug name).

def extract_links_from_fda_drugname(table_provided):
    """
    Extracts hyperlinks and corresponding drug names from an HTML table.

    Parameters:
    - table_provided (BeautifulSoup): HTML table containing drug information.

    Returns:
    - links (list): List of hyperlinks.
    - names (list): List of drug names.
    """

    # Initialize lists to store links and names
    links, names = [], []

    # Iterate through each row in the provided table, excluding the header (first row)
    for tr in table_provided.select("tr")[1:]:
        try: 
            # Try to find the first hyperlink in the row
            trs = tr.find("a")
            
            # Check if trs is not None before trying to access attributes
            if trs is not None:
                actual_link, name = trs.get('href', ''), trs.get_text()
            else:
                actual_link, name = '', ''
            
        except (AttributeError, IndexError): 
            # Handle cases where there's an attribute error or indexing error
            actual_link, name = '', ''

        # Append the extracted link and name to the respective lists
        links.append(actual_link)
        names.append(name)
        
    return links, names

Step 4: Specify the Range of Years and Call the Scraping Function

# Specify the range of years for scraping
start_year = 2015
end_year = 2023

# Call the function to scrape FDA drug approvals data
df_result = scrape_fda_drug_approvals(start_year, end_year)

Now, you should have a table containing FDA approved drugs, looking like this:

Step 5: Extract Label PDF Links from Drug Detail Pages

all_main_label_pdf_links is a list created to store main label PDF links associated with drug entries. The code iterates through each URL in the ‘links’ column of the DataFrame (df_result). It checks if the URL is correctly formatted, then attempts to retrieve HTML content using the requests library. Then, it uses BeautifulSoup to parse the HTML and extract potential label PDF links based on a specified pattern. The extracted links are then filtered, removing duplicates and irrelevant fragments. The first valid link is appended to the all_main_label_pdf_links list or an empty string is appended if no valid link is found. The function handles errors during the process, and prints an error message if encountered.

all_main_label_pdf_links = []

for counter, each_url in enumerate(df_result['links']):
    # Check if the URL is correctly formatted
    if each_url.startswith(('http://', 'https://')):
        try:
            html = requests.get(each_url).content
            soup = BeautifulSoup(html, 'html5lib')

            possible_label_pdf_links = []
            if soup:
                for link in soup.findAll('a'):
                    current_link = link.get('href')
                    if current_link is not None:
                        label_pdf_pattern = ['https://www.accessdata.fda.gov/drugsatfda_docs/label/', '.pdf']
                        if all(x in current_link for x in label_pdf_pattern):
                            if '#' in current_link:
                                hashsymbol_stripped = current_link[:current_link.find('#')]
                            else:
                                hashsymbol_stripped = current_link
                            possible_label_pdf_links.append(hashsymbol_stripped)

            possible_label_pdf_links = list(set(possible_label_pdf_links))

            try:
                all_main_label_pdf_links.append(possible_label_pdf_links[0]) if possible_label_pdf_links else all_main_label_pdf_links.append('')
            except IndexError:
                all_main_label_pdf_links.append('')

        except requests.exceptions.RequestException as e:
            print(f"Error fetching content for {each_url}: {e}")
            all_main_label_pdf_links.append('')
    else:
        # Skip invalid URLs
        all_main_label_pdf_links.append('')

# Check if the final lists have the same number of items as the number of rows in the DataFrame
if len(all_main_label_pdf_links) != len(df_result):
    print("The lengths of the lists do not match the number of rows in the DataFrame.")
    
df_result['main_label_pdf'] = all_main_label_pdf_links

At the end, you should have a Pandas DataFrame containing the FDA approved drugs for the specified years. The DataFrame integrates embedded links to individual drug pages and direct links to each drug’s Label (PDF) for further analysis.

Conclusion

This tutorial covers the process of web scraping FDA drug approvals data for specific years, extracting additional information by following links, and organizing the data in a Pandas DataFrame. The extracted label PDF links provide valuable information for further analysis or research.

Feel free to customize and enhance the provided code to meet your specific requirements. Thanks a lot for reading this blogpost tutorial! I hope you find this post useful. Your constructive feedback and suggestions are always appreciated.

If you’re interested in more content related to data science, visualization, or cheminformatics, be sure to check out my other blog posts. Happy coding!