Visualizing distributions: histograms and cumulative distributions

Create a dual distribution visualization of the desired column in a DataFrame. A histogram shows the distribution of observations, and bottom chart displays them all showing their cumulative distribution. Optionally, you can set hover_name to show the actual value of each observation on hover.

The ecdf function is a thin wrapper for the plotly.express function of the same name. It adds a few minor options to it.

It can be used to visualize how numeric variables are distributed, both using a histogram, as well the cumulative distribution (ecdf: empirical cumulative distribution function).

While the histogram shows us how many observations we have for each interval, the ecdf shows each observation and its particular position in the ranking order.

Note that you have access to all the parameters of the ecdf function, so please check them if you want to see what else can be modified with:

import plolty.express as px
help(px.ecdf)

source

ecdf

 ecdf (df, x, hover_name=None, height=None, width=None, **kwargs)

Create an empirical cumulative distribution chart, a thin wrapper around px.ecdf.

Type Default Details
df pandas.DataFrame A DataFrame from which you want to visualize one of the columns’ distribution.
x str The name of the column to visualize.
hover_name NoneType None The name of the column to use for labeling the markers on mouseover.
height NoneType None The height of the chart in pixels.
width NoneType None The width of the chart in pixels.
kwargs
Returns plotly.graph_objects.Figure A plotly chart of the desired ecdf.
gsc = pd.read_csv('data/gsc_query_month_report.csv')
by_query = gsc.groupby('query')['impressions'].sum().reset_index()
by_query
query impressions
0 "2020 seo tools" 2
1 "anonymizer" 1
2 "best serp checker" 1
3 "cheap seo tools" 1
4 "comma separated phrases to monitor" 2
... ... ...
6907 🥸meaning of this emoji 1
6908 🥺 meaning of this emoji 1
6909 🥺meaning of this emoji 1
6910 🦺 meaning in text 1
6911 🫐 meaning in text 11

6912 rows × 2 columns

Visualizing the distribution of GSC queries and their impressions

To see how these query impressions are distributed we simply run ecdf.

ecdf(
    by_query,
    x='impressions')

Select a column from the DataFrame to display its value on hover

In this case, we set hover_name='query' and now we can see which query each marker represents.

ecdf(
    by_query,
    x='impressions',
    hover_name='query')

Mousing over any of the circles you see the query it represents, the value represented (impressions in this case), how many other observations are equal-to or below it as a percentage, and also the counts of observations above and below.

Visualizing clustered/classified keywords

The same applies to keywords, as it is crucial to know how they are distributed. We can also gain more insight after categorizing the keywords and applying the same technique we applied in the previous example.

photography = pd.read_csv('data/photgraphy_keywords.csv')
photography
keyword volume category
0 instagram landscape dimensions 1600 Social Media and Digital Platforms
1 instagram profile picture size 4700 Social Media and Digital Platforms
2 inches to pixels 14000 Design and Graphics
3 profile picture 74000 Photography and Image Editing
4 how to make a watermark 1900 Design and Graphics
... ... ... ...
590 eharmony blurry photos 20 Photography and Image Editing
591 booed images 30 Photography and Image Editing
592 how to make a good checklist 30 Ideas and Inspiration
593 picmonkey coupon 30 NaN
594 printable garage sale sign 20 Design and Graphics

595 rows × 3 columns

ecdf(
    photography,
    x='volume',
    hover_name='keyword',
    height=650,
    template='plotly_dark')

Visualizing keywords split by category

fig = ecdf(
    photography,
    x='volume',
    hover_name='keyword',
    height=750,
    template='plotly_dark',
    facet_row='category',
    color='category')
for annotation in fig.layout.annotations:
    annotation.text = ''
fig

GSC queries - by device

nba = pd.read_csv('data/nba_keywords.csv')
nba.head()
query device impressions
0 larry bird stats Mobile 215
1 stephen curry stats Desktop 209
2 michael jordan stats Desktop 197
3 larry bird stats Desktop 174
4 kobe bryant stats Desktop 129

Groupby queries and visualize impressions

fig = ecdf(
    nba.groupby('query')['impressions'].sum().reset_index(),
    x='impressions',
    title='GSC impressions by query',
    height=550,
    template='plotly_dark',
    hover_name='query')
# optional, just to demonstrate styling options
fig.data[0].marker.color = 'silver'
fig.data[1].marker.color = 'silver'
fig.layout.font.color = 'silver'
fig.update_xaxes(gridcolor='#222222')
fig.update_yaxes(gridcolor='#222222')
fig.layout.yaxis.zeroline = False
fig.layout.yaxis2.zeroline = False
fig

GSC queries - by device

fig = ecdf(
    nba,
    x='impressions',
    template='plotly_white',
    facet_row='device',
    color='device',
    hover_name='query',
    height=600,
    color_discrete_sequence=px.colors.qualitative.Vivid,
    title='GSC query impressions - by device')
# optional, just to demonstrate styling options
fig.layout.paper_bgcolor = '#efefef'
fig.layout.plot_bgcolor = '#efefef'
fig.update_xaxes(gridcolor='gray')
fig.update_yaxes(gridcolor='gray')
fig