Visualizing distributions: histograms and cumulative distributions

Create a dual distribution visualization of the desired column in a DataFrame. A histogram shows the distribution of observations, and bottom chart displays them all showing their cumulative distribution. Optionally, you can set hover_name to show the actual value of each observation on hover.

The ecdf function is a thin wrapper for the plotly.express function of the same name. It adds a few minor options to it.

It can be used to visualize how numeric variables are distributed, both using a histogram, as well the cumulative distribution (ecdf: empirical cumulative distribution function).

While the histogram shows us how many observations we have for each interval, the ecdf shows each observation and its particular position in the ranking order.

Note that you have access to all the parameters of the ecdf function, so please check them if you want to see what else can be modified with:

import plolty.express as px
help(px.ecdf)

source

ecdf

 ecdf (df, x, hover_name=None, height=None, width=None, **kwargs)

Create an empirical cumulative distribution chart, a thin wrapper around px.ecdf.

	Type	Default	Details
df	pandas.DataFrame		A DataFrame from which you want to visualize one of the columns’ distribution.
x	str		The name of the column to visualize.
hover_name	NoneType	None	The name of the column to use for labeling the markers on mouseover.
height	NoneType	None	The height of the chart in pixels.
width	NoneType	None	The width of the chart in pixels.
kwargs
Returns	plotly.graph_objects.Figure		A plotly chart of the desired ecdf.

gsc = pd.read_csv('data/gsc_query_month_report.csv')
by_query = gsc.groupby('query')['impressions'].sum().reset_index()
by_query

	query	impressions
0	"2020 seo tools"	2
1	"anonymizer"	1
2	"best serp checker"	1
3	"cheap seo tools"	1
4	"comma separated phrases to monitor"	2
...	...	...
6907	🥸meaning of this emoji	1
6908	🥺 meaning of this emoji	1
6909	🥺meaning of this emoji	1
6910	🦺 meaning in text	1
6911	🫐 meaning in text	11

6912 rows × 2 columns

Visualizing the distribution of GSC queries and their impressions

To see how these query impressions are distributed we simply run ecdf.

ecdf(
    by_query,
    x='impressions')

Select a column from the DataFrame to display its value on hover

In this case, we set hover_name='query' and now we can see which query each marker represents.

ecdf(
    by_query,
    x='impressions',
    hover_name='query')

Mousing over any of the circles you see the query it represents, the value represented (impressions in this case), how many other observations are equal-to or below it as a percentage, and also the counts of observations above and below.

Publishing trends from XML sitemaps

brighton = pd.read_csv('data/brightonseo_sitemap.csv', parse_dates=['lastmod'])
brighton['segment'] = '/' + brighton['segment'] + '/'
brighton.head()

	loc	lastmod	sitemap	sitemap_size_mb	download_date	scheme	netloc	path	query	fragment	dir_1	dir_2	dir_3	dir_4	last_dir	segment
0	https://brightonseo.com	2024-04-29 10:03:31+00:00	https://brightonseo.com/sitemap.xml	0.149905	2024-06-08 19:15:35.969506+00:00	https	brightonseo.com	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	/Others/
1	https://brightonseo.com/ssas-october-2024	2024-05-15 12:43:37+00:00	https://brightonseo.com/sitemap.xml	0.149905	2024-06-08 19:15:35.969506+00:00	https	brightonseo.com	/ssas-october-2024	NaN	NaN	ssas-october-2024	NaN	NaN	NaN	ssas-october-2024	/Others/
2	https://brightonseo.com/training	2024-05-26 20:48:41+00:00	https://brightonseo.com/sitemap.xml	0.149905	2024-06-08 19:15:35.969506+00:00	https	brightonseo.com	/training	NaN	NaN	training	NaN	NaN	NaN	training	/Others/
3	https://brightonseo.com/measurefest-october-2024	2024-05-24 13:27:10+00:00	https://brightonseo.com/sitemap.xml	0.149905	2024-06-08 19:15:35.969506+00:00	https	brightonseo.com	/measurefest-october-2024	NaN	NaN	measurefest-october-2024	NaN	NaN	NaN	measurefest-october-2024	/Others/
4	https://brightonseo.com/mailing-list	2023-09-07 10:56:08+00:00	https://brightonseo.com/sitemap.xml	0.149905	2024-06-08 19:15:35.969506+00:00	https	brightonseo.com	/mailing-list	NaN	NaN	mailing-list	NaN	NaN	NaN	mailing-list	/Others/

Let’s do the same with URLs in an XML sitemap. We can visualize the cumulative distribution of the loc tags, and give it more context by showing each URL when we mouseover. This becomes a rich report with a lot of data on each URL.

We can immediately see in the above chart that the content on this website spans the period September 2023 - June 2024. We can clearly see that most updates happened in the first periods by looking at the top histogram.

When we have a vertically looking set of dots, we know that there were many updates happening in a very short period of time. These are likely being updated in a batch.

With a simple option we can split and color the chart by the website segment.

I took the top five values in /dir_1/ and labelled all other values as “Others”.

By using facet_row="segment" we have six charts showing us the trend for each segment of the website separately.

ecdf(
    brighton,
    x='lastmod',
    height=1300,
    hover_name='loc',
    template='seaborn',
    facet_row='segment',
    color='segment',
    title='URL lastmod trend by page category<br><b>BrightonSEO.com</b>')

Visualizing clustered/classified keywords

The same applies to keywords, as it is crucial to know how they are distributed. We can also gain more insight after categorizing the keywords and applying the same technique we applied in the previous example.

photography = pd.read_csv('data/photgraphy_keywords.csv')
photography

	keyword	volume	category
0	instagram landscape dimensions	1600	Social Media and Digital Platforms
1	instagram profile picture size	4700	Social Media and Digital Platforms
2	inches to pixels	14000	Design and Graphics
3	profile picture	74000	Photography and Image Editing
4	how to make a watermark	1900	Design and Graphics
...	...	...	...
590	eharmony blurry photos	20	Photography and Image Editing
591	booed images	30	Photography and Image Editing
592	how to make a good checklist	30	Ideas and Inspiration
593	picmonkey coupon	30	NaN
594	printable garage sale sign	20	Design and Graphics

595 rows × 3 columns

ecdf(
    photography,
    x='volume',
    hover_name='keyword',
    height=650,
    template='plotly_dark')

Visualizing keywords split by category

fig = ecdf(
    photography,
    x='volume',
    hover_name='keyword',
    height=750,
    template='plotly_dark',
    facet_row='category',
    color='category')
for annotation in fig.layout.annotations:
    annotation.text = ''
fig

GSC queries - by device

nba = pd.read_csv('data/nba_keywords.csv')
nba.head()

	query	device	impressions
0	larry bird stats	Mobile	215
1	stephen curry stats	Desktop	209
2	michael jordan stats	Desktop	197
3	larry bird stats	Desktop	174
4	kobe bryant stats	Desktop	129

Groupby queries and visualize impressions

fig = ecdf(
    nba.groupby('query')['impressions'].sum().reset_index(),
    x='impressions',
    title='GSC impressions by query',
    height=550,
    template='plotly_dark',
    hover_name='query')
# optional, just to demonstrate styling options
fig.data[0].marker.color = 'silver'
fig.data[1].marker.color = 'silver'
fig.layout.font.color = 'silver'
fig.update_xaxes(gridcolor='#222222')
fig.update_yaxes(gridcolor='#222222')
fig.layout.yaxis.zeroline = False
fig.layout.yaxis2.zeroline = False
fig

GSC queries - by device

fig = ecdf(
    nba,
    x='impressions',
    template='plotly_white',
    facet_row='device',
    color='device',
    hover_name='query',
    height=600,
    color_discrete_sequence=px.colors.qualitative.Vivid,
    title='GSC query impressions - by device')
# optional, just to demonstrate styling options
fig.layout.paper_bgcolor = '#efefef'
fig.layout.plot_bgcolor = '#efefef'
fig.update_xaxes(gridcolor='gray')
fig.update_yaxes(gridcolor='gray')
fig