Visualizing distributions: histograms and cumulative distributions

Create a dual distribution visualization of the desired column in a DataFrame. A histogram shows the distribution of observations, and bottom chart displays them all showing their cumulative distribution. Optionally, you can set hover_name to show the actual value of each observation on hover.

The ecdf function is a thin wrapper for the plotly.express function of the same name. It adds a few minor options to it.

It can be used to visualize how numeric variables are distributed, both using a histogram, as well the cumulative distribution (ecdf: empirical cumulative distribution function).

While the histogram shows us how many observations we have for each interval, the ecdf shows each observation and its particular position in the ranking order.

Note that you have access to all the parameters of the ecdf function, so please check them if you want to see what else can be modified with:

import plolty.express as px
help(px.ecdf)

source

ecdf

 ecdf (df, x, hover_name=None, title=None, subtitle=None, height=None,
       width=None, template=None, **kwargs)

Create an empirical cumulative distribution chart, a thin wrapper around px.ecdf.

Type Default Details
df pandas.DataFrame A DataFrame from which you want to visualize one of the columns’ distribution.
x str The name of the column to visualize.
hover_name NoneType None The name of the column to use for labeling the markers on mouseover.
title NoneType None The title of the chart.
subtitle NoneType None The subtitle of the chart.
height NoneType None The height of the chart in pixels.
width NoneType None The width of the chart in pixels.
template NoneType None
kwargs VAR_KEYWORD
Returns plotly.graph_objects.Figure A plotly chart of the desired ecdf.
gsc = pd.read_csv("data/gsc_query_month_report.csv")
by_query = gsc.groupby("query")["impressions"].sum().reset_index()
by_query
query impressions
0 "2020 seo tools" 2
1 "anonymizer" 1
2 "best serp checker" 1
3 "cheap seo tools" 1
4 "comma separated phrases to monitor" 2
... ... ...
6907 🥸meaning of this emoji 1
6908 🥺 meaning of this emoji 1
6909 🥺meaning of this emoji 1
6910 🦺 meaning in text 1
6911 🫐 meaning in text 11

6912 rows × 2 columns

Visualizing the distribution of GSC queries and their impressions

To see how these query impressions are distributed we simply run ecdf.

Select a column from the DataFrame to display its value on hover

In this case, we set hover_name='query' and now we can see which query each marker represents.

ecdf(by_query, x="impressions", hover_name="query")  

Hover label

Hovering over any of the points gives you a set of data about that point, and tells you more about it relative to the dataset:

label meaning
top label this is the actual value, in this case it is the keyword “api serp analysis”
first value (impressions) this is the metric and value of this item, in this case we are measuring impressions, and the impressions for “api serp analysis” is 3,785
Counts this section has four values that tell you about the counts of the items
percent the percentage of items below the current value, 99.6% of the items are less than or equal to the current one in this case
count below how many items are less than the current one
count above how many items are greater than the current one
total count the number of all items in the dataset, this is unchanged throughout the dataset
Values this section shows two values telling about the dataset’s sum of values up to this data point
cumulative sum the total metric (impressions in this case) up to this point
cumulative percentage the percentage of metric (impressions) up to this point

In other words, the keyword “api serp analysis” generated more impressions than 99.6% of the items in the list. There are 6,886 keywords with less impressions, and 25 keywords with more impressions (out of a total of 6,912 keywords).

Keywords up to this point have generated a total of 501,328 impressions, which is 75.1% of the sum of all impressions of all keywords.

In other cases the distribution can be much more extreme, as is the case in some examples below.

The aim is to show how (in)significant a set of values are in the dataset.

Mousing over any of the circles you see the query it represents, the value represented (impressions in this case), how many other observations are equal-to or below it as a percentage, and also the counts of observations above and below.

Visualizing clustered/classified keywords

The same applies to keywords, as it is crucial to know how they are distributed. We can also gain more insight after categorizing the keywords and applying the same technique we applied in the previous example.

photography = pd.read_csv("data/photgraphy_keywords.csv")
photography
keyword volume category
0 instagram landscape dimensions 1600 Social Media and Digital Platforms
1 instagram profile picture size 4700 Social Media and Digital Platforms
2 inches to pixels 14000 Design and Graphics
3 profile picture 74000 Photography and Image Editing
4 how to make a watermark 1900 Design and Graphics
... ... ... ...
590 eharmony blurry photos 20 Photography and Image Editing
591 booed images 30 Photography and Image Editing
592 how to make a good checklist 30 Ideas and Inspiration
593 picmonkey coupon 30 NaN
594 printable garage sale sign 20 Design and Graphics

595 rows × 3 columns

ecdf(
    photography,
    x="volume",
    hover_name="keyword",
    height=650,
    template="plotly_dark",
    title="Photography keywords",
    subtitle="Keyword research data",
)

Visualizing keywords split by category

fig = ecdf(
    photography,
    x="volume",
    hover_name="keyword",
    height=750,
    title="Photography keywords",
    subtitle="Keyword research data - split by category",
    template="plotly_dark",
    facet_row="category",  
    color="category",
)  
for annotation in fig.layout.annotations:
    annotation.text = ""
fig

GSC queries - by device

nba = pd.read_csv("data/nba_keywords.csv")
nba.head()
query device impressions
0 larry bird stats Mobile 215
1 stephen curry stats Desktop 209
2 michael jordan stats Desktop 197
3 larry bird stats Desktop 174
4 kobe bryant stats Desktop 129

Groupby queries and visualize impressions

fig = ecdf(
    nba.groupby("query")["impressions"].sum().reset_index(),
    x="impressions",
    title="GSC impressions by query",
    height=550,
    template="plotly_dark",
    hover_name="query",
)
# optional, just to demonstrate styling options
fig.data[0].marker.color = "silver"
fig.data[1].marker.color = "silver"
fig.layout.font.color = "silver"
fig.update_xaxes(gridcolor="#222222")
fig.update_yaxes(gridcolor="#222222")
fig.layout.yaxis.zeroline = False
fig.layout.yaxis2.zeroline = False
fig

GSC queries - by device

fig = ecdf(
    nba,
    x="impressions",
    template="plotly_white",
    facet_row="device",  
    color="device",  
    hover_name="query",
    height=600,
    color_discrete_sequence=px.colors.qualitative.Vivid,
    title="GSC query impressions - by device",
)
# optional, just to demonstrate styling options
fig.layout.paper_bgcolor = "#efefef"
fig.layout.plot_bgcolor = "#efefef"
fig.update_xaxes(gridcolor="gray")
fig.update_yaxes(gridcolor="gray")
fig