Welcome to Part II of Exploring a new dataset with Python! If you missed Part I: The Basics, you can check it out here. In this article, we’ll be returning to our animal mug company’s dataset to continue our exploratory data analysis and answer some new questions about our dataset.
We’ll be using Python’s Matplotlib and Seaborn data visualization libraries to create our charts. Since we’re creating these visuals for the sole purpose of understanding our dataset, we won’t be getting into any customization features related to titles, labels, and color schemes. The Seaborn documentation is a great resource for this if you need to create more aesthetically pleasing visualizations for a dashboard or if you just want to play around!
Alright, let’s go exploring.
Set up your data for analysis with Seaborn
Import the following libraries, your data file, and check the head() to make sure it imported properly.
Exploring Numerical Variables with Seaborn
Distribution of a numerical variable
To see the overall distribution of any numerical variable, we can use Seaborn’s distplot, basically their version of a histogram.
Below, I’m passing in the TotalSales column to see my distribution of revenue from orders, which I can see follows a normal distribution.
Swap out the column name in the function to check out the distribution of each of your numerical variables.
Relationships between two numerical variables
Pairplot is an awesome and easy (only pass in the name of your DataFrame!) way to use Seaborn to get a high-level look at relationships between all of your numerical variables.
For example, the chart in the middle of the top row is using TotalSales on the x-axis and TotalUnits on the y-axis. We see a straight diagonal line which makes sense since the TotalSales column is calculated by multiplying how many units were sold in the TotalUnits column by the price of the mug. As units sold go up, so does revenue!
Choose your own numerical variables to compare
If you’re interested in diving deeper into any of the charts from pairplot, Seaborn’s jointplot allows you to choose numerical variables to pass in the x and y-axes.
In the example below, I’m comparing TotalAdCost with TotalSales (revenue). We can see the individual distributions of both, along with the relationship between them in the scatter plot. As we would hope, as our ad spend goes up, so does revenue.
Check for outliers
It’s important to know if outliers exist in the dataset so they can be handled appropriately – whether that means it should be removed, is an error from data collection that needs to be fixed, or left in. Luckily, Seaborn’s box plot makes it simple to identify outliers in your data so you can take the appropriate action. You’ll be able to see in a second if there is a value or values in any one of your columns that are straying from the rest.
Set the x-axis to any of your column names to check for outliers in that column. Using my TotalAdCost column as an example, we get a completely normal looking boxplot with no outliers identified:
To demonstrate how the box plot would look if there were an outlier, I changed one of my values in the TotalAdCost column to fall outside the normal range:
The little dot shows that we have an outlier falling around the 250 mark – quite a bit away from the max value, falling around the 123 mark.
I’d then repeat this process for the remainder of my numerical columns to find any other outliers that exist in the dataset.
Using Seaborn to Explore Relationships Between Numerical & Categorical Variables
Distribution of a numerical variable across categories
In the previous section, we used Seaborn’s box plot to only look at one numerical variable. We can also use it to look at two variables – assigning one to the x-axis and one to the y-axis to see the distribution of a numerical variable across a categorical variable.
Here, we’re looking at the distribution of revenue for each day of the week.
Outliers will make an appearance here as well – we can see a few unusually low revenue orders on Wednesday, a few unusually high ones on Thursday, and a couple others throughout the chart. In terms of distribution, days like Monday and Thursday have much wider ranges in revenue than a day like Friday.
Stripplot offers another way to view distribution
Seaborn’s stripplot also gives us a look at the distribution of values in each category in a different way than the box plot. The visual is simplified with the elimination of quartile information, and it also shows each data point as a dot if you’re more interested in seeing where each observation lies.
In the example below, we get a look at the revenue brought in from each animal mug. Each data point shows how much a typical order brings in for each mug, so we can understand the lowest and highest amounts, the most frequent order amounts where there is clustering, or where values don’t occur at all (like the gap between ~320 and 400 in the sloth mug column!)
Now that we have the tools to learn everything there is to know about the distribution of our data, let’s move on to some different types of comparisons.
Comparing counts across categories
Seaborn’s countplot allows us to compare the number of occurrences in each category. We only need to pass the categorical column we want to look at in the x-axis and count will automatically apply to the y-axis.
In this example, I’m looking at the Product column to see how the sales of each animal mug compare:
I can quickly see that my top sellers are the sloth, swan, and dog mugs, and the least sold are the dinosaur and lion mugs.
Note: By default, the counts are not sorted in ascending or descending order. To see which categories were highest and lowest a little easier, I added this portion of code to order them: order=sales[‘Product’].value_counts().index
Comparing averages across categories (or other aggregate function of your choosing)
Seaborn’s barplot is basically countplot with more options since here we can add a variable to the y-axis, and also specify what type of aggregation we want to see.
By default, the barplot will use the mean as the aggregation. For example, the barplot below shows the average revenue generated on each day of the week.
To use a different aggregation, like the sum, the estimator parameter can be used.
The values on the y-axis got much higher as we’re now seeing the total revenue brought in on each day during the time frame of our data set rather than the average in the previous chart.
Visualize Your Own Adventure
We’ve made it to the end of our exploratory data analysis journey together, but the adventure is just beginning! If you’ve gone through both Part I and Part II, you should be well-equipped with the tools you need to learn about your dataset. There really is so much more than can be talked about with Seaborn, and other visualization libraries for that matter. There are tons of ways you can customize your visuals further, not just for aesthetic reasons, but to display your data in different ways with the use of additional parameters in your code. I once again encourage you to visit the Seaborn documentation to learn more about the options available. Let us know what you’re creating in the comments below!