This blog post is for readers as well as myself. In this tutorial, I will show how to make different types of boxplots including horizontal, vertical, grouped boxplots, and interactive ones.
It’s not meant to be comprehensive. It’s just a collection of different styles and visualizations that I like.
For the code, you will need the following python libraries: pandas, NumPy, Plotly, Matplotlib, and seaborn. They all can be installed with either pip or conda.
I will be using fake data to show different types of boxplots. Normally, I would create a conda environment and install these required libraries there. Once they have been installed, make sure the environment is activated.
Okay, let’s get to the code. All the code in this tutorial is available in this github repository.
Below is the code for importing the libraries needed for data visualization in this tutorial.
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
Please note I am using a jupyter notebook to run the code. It’s very useful to test different snippets of python code and display visualizations.
Generating Fake Data
Now, let’s create fake data for the test scores. Let’s imagine the scores range from 0 to 100, and there are 100 students taking the exam for five different subjects.
df = pd.DataFrame(np.random.randint(0,100,size = (100,5)), columns=['Literature', 'Chemistry', 'Biology', 'History', 'Geography'])
Let’s check what they look like for now.
Okay, that looks fine. But I want to get fancy and create random fake names for the dataset. It’s really not important for the purpose of this tutorial, but I am doing it for fun.
I am gonna use ‘names’ package which can also be installed with pip. Then, I will create a column called ‘student_names’ which contains randomly generated names, and set that column as the index.
import names
df['student_names'] = [names.get_full_name() for i in range(100)]
df.set_index('student_names', inplace=True)
Okay, so now the dataset looks like this.
Generating & Modifying a Basic Box Plot for Categorical Distributions
Let’s start plotting it in a boxplot. We will use the seaborn package for this, and we will want to see the distribution of test scores for all the subjects in one plot.
ax = sns.boxplot(data=df)
plt.rcParams["figure.dpi"] = 300
plt.xlabel("Subjects", size=12)
plt.ylabel("Test Scores", size=12)
plt.show()
Alright, this class is not looking so hot, considering the median of all the test scores is around 50 across all subjects.
Let’s try turning the notch on to better see the medians of the distributions.
ax = sns.boxplot(data=df, notch=True)
plt.show()
You can also change the color palettes for your box plots by manipulating ‘palette’. See the documentation for the available color palettes, and if you desire, you can create your own as well.
ax = sns.boxplot(data=df, notch=True, palette="flare")
plt.show()
You can also change the orientation of the box plot distributions to horizontal.
ax = sns.boxplot(data=df, orient = 'h')
plt.show()
Incorporating a Swarm Plot to a Box Plot
Now, let’s try incorporating a swarm plot into a box plot. Basically, we get to see the data distribution better in each category.
ax = sns.boxplot(data=df)
ax = sns.swarmplot(data=df, color=".25", size = 4)
plt.show()
You can adjust the color of the swarm plot by adjusting “color” attribute. There are also several other attributes that you can manipulate. For more detail, please see the documentation.
Generating a Violin Plot
Let’s try a violin plot next.
sns.violinplot(data=df)
plt.show()
Incorporating a Swarm Plot to a Violin Plot
Now, let’s try incorporating a swarm plot with a violin plot.
ax = sns.violinplot(data=df, inner=None)
ax = sns.swarmplot(data=df,
color="white", edgecolor="gray", size = 4)
plt.show()
Reformatting Data for Grouped Box Plot
Okay, let’s try making some grouped boxplots based on a variable, perhaps “gender”. So, we will assign random genders to the dataset.
df['Gender'] = np.random.choice(['Men', 'Women'], len(df))
Now, our dataset has a new column called ‘Gender’. We will now proceed to grouped box plotting adventure!
Out of curiosity, let’s see how many are men and women.
df.Gender.value_counts()
Men 54 Women 46 Name: Gender, dtype: int64
For generating a grouped box plot, we will need to change the format of the current dataset.
Remember that the current dataset has 5 columns for subjects and a column for ‘Gender’. We will now melt the data frame into four major columns: ‘student_names’, ‘subjects’, ‘scores’, and ‘gender’.
We will be grouping each subject category by ‘Gender’. Note that even though I am preserving the student names, it is not really necessary. I just want to keep that information. Our melted data frame will be called ‘new_df’.
new_df = df.reset_index().melt(id_vars=['student_names'])
Below is a snippet to see how the melted data frame looks.
Let’s rename a couple of columns here for clarity. ‘variable’ column contains subjects, and ‘value’ column contains test scores. So, let’s change ‘variable’ to ‘Subjects’, and ‘value’ to ‘Test Scores’.
new_df.rename({'variable': 'Subjects', 'value': 'Test Scores'}, axis=1, inplace=True)
Generating a Grouped Box Plot
Now, to do a grouped box plot, write the following code snippet. Note that in this case, we are grouping each categorical distribution by ‘Gender’, so we will need to specify that in the ‘hue’ parameter of the boxplot.
sns.boxplot(x='Subjects', y='Test Scores', hue='Gender', data=new_df)
plt.xlabel("Subjects", size=12)
plt.ylabel("Test Scores", size=12)
plt.legend(loc='upper right')
plt.show()
Generating Sub-Boxplots for Sub-Categories
Next, let’s try splitting the boxplots based on different categories. We will observe test score distributions of all five subjects separately for each gender.
sns.catplot(x='Test Scores', y='Subjects', col = 'Gender', aspect = 0.5,kind='box', data= new_df)
plt.show()
Generating Interactive Box Plots
Cool. Now, let’s try some interactive boxplots. We are using the Plotly library for that. In the beginning, we have imported plotly.express as px.
fig = px.box(new_df, x="Subjects", y="Test Scores")
fig.show()
You will see the interactive boxplot in your jupyter notebook which I shared at the beginning. For now, I don’t know how to embed that interactive graph in this blog post, so I just screenshotted the plot for showing.
We can see useful information such as median, quantiles for each category when you hover over the graph.
We can also see the underlying data points by setting the parameter ‘points’ to all.
fig = px.box(new_df, x="Subjects", y="Test Scores", points="all")
fig.show()
Generating Interactive Grouped Box Plots
Next, let’s try interactive, grouped boxplots.
fig = px.box(new_df, x="Subjects", y="Test Scores", color="Gender")
fig.show()
We can also turn on the notches.
fig = px.box(new_df, x="Subjects", y="Test Scores", color = "Gender",
notched=True, # used notched shape
title="Box Plot of Test Scores",
)
fig.show()
Generating Interactive Subboxplots for Subcategories
Next, let’s try a different layout where we separate each interactive boxplot by subject. We want to see ‘Test Scores’ distributions, so assign that to y. We will separate each box plot by ‘Subjects’, so assign that to facet_col. Then, inside each subject category, we want to see distributions for ‘Gender’ side by side. So, assign ‘Gender’ to color and ‘group’ to boxmode. I also want to see all the points so, I assign ‘all’ to points.
fig = px.box(new_df, y="Test Scores", facet_col="Subjects", color="Gender",
boxmode="group", points='all')
fig.show()
If you don’t want to show the points, then remove that parameter.
fig = px.box(new_df, y="Test Scores", facet_col="Subjects", color="Gender",
boxmode="group")
fig.show()
I will wrap up this tutorial here for now. This blog post is mainly the code tutorial for those who may be interested, and also for myself.
Hopefully, you find this post useful. Thank you so much for reading this post. As always, I welcome constructive feedback and suggestions.