Clustergram - visualization and diagnostics for cluster analysis
Project description
Clustergram
Visualization and diagnostics for cluster analysis
Clustergram is a diagram proposed by Matthias Schonlau in his paper The clustergram: A graph for visualizing hierarchical and nonhierarchical cluster analyses.
In hierarchical cluster analysis, dendrograms are used to visualize how clusters are formed. I propose an alternative graph called a “clustergram” to examine how cluster members are assigned to clusters as the number of clusters increases. This graph is useful in exploratory analysis for nonhierarchical clustering algorithms such as k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical.
The clustergram was later implemented in R by Tal Galili, who also gives a thorough explanation of the concept.
This is a Python translation of Tal's script written for scikit-learn
and RAPIDS cuML
implementations of K-Means and Gaussian Mixture Model (scikit-learn only) clustering.
Getting started
You can install clustergram from conda
or pip
:
conda install clustergram -c conda-forge
pip install clustergram
In any case, you still need to install your selected backend
(scikit-learn
or cuML
).
The example of clustergram on Palmer penguins dataset:
import seaborn
df = seaborn.load_dataset('penguins')
First we have to select numerical data and scale them.
from sklearn.preprocessing import scale
data = scale(df.drop(columns=['species', 'island', 'sex']).dropna())
And then we can simply pass the data to clustergram
.
from clustergram import Clustergram
cgram = Clustergram(range(1, 8))
cgram.fit(data)
cgram.plot()
Styling
Clustergram.plot()
returns matplotlib axis and can be fully customised as any other matplotlib plot.
seaborn.set(style='whitegrid')
cgram.plot(
ax=ax,
size=0.5,
linewidth=0.5,
cluster_style={"color": "lightblue", "edgecolor": "black"},
line_style={"color": "red", "linestyle": "-."},
figsize=(12, 8)
)
Mean options
On the y
axis, a clustergram can use mean values as in the original paper by Matthias Schonlau or PCA weighted mean values as in the implementation by Tal Galili.
cgram = Clustergram(range(1, 8), pca_weighted=True)
cgram.fit(data)
cgram.plot(figsize=(12, 8))
cgram = Clustergram(range(1, 8), pca_weighted=False)
cgram.fit(data)
cgram.plot(figsize=(12, 8))
Scikit-learn and RAPIDS cuML backends
Clustergram offers two backends for the computation - scikit-learn
which uses CPU and RAPIDS.AI cuML
, which uses GPU. Note that both are optional dependencies, but you will need at least one of them to generate clustergram.
Using scikit-learn (default):
cgram = Clustergram(range(1, 8), backend='sklearn')
cgram.fit(data)
cgram.plot()
Using cuML:
cgram = Clustergram(range(1, 8), backend='cuML')
cgram.fit(data)
cgram.plot()
data
can be all data types supported by the selected backend (including cudf.DataFrame
with cuML
backend).
Supported methods
Clustergram currently supports K-Means and Gaussian Mixture Model clustering methods. Note tha GMM is supported only for scikit-learn
backend.
Using K-Means (default):
cgram = Clustergram(range(1, 8), method='kmeans')
cgram.fit(data)
cgram.plot()
Using Gaussian Mixture Model:
cgram = Clustergram(range(1, 8), method='gmm')
cgram.fit(data)
cgram.plot()
Partial plot
Clustergram.plot()
can also plot only a part of the diagram, if you want to focus on a limited range of k
.
cgram = Clustergram(range(1, 20))
cgram.fit(data)
cgram.plot(figsize=(12, 8))
cgram.plot(k_range=range(3, 10), figsize=(12, 8))
Saving clustergram
You can save both plot and clustergram.Clustergram
to a disk.
Saving plot
Clustergram.plot()
returns matplotlib axis object and as such can be saved as any other plot:
import matplotlib.pyplot as plt
cgram.plot()
plt.savefig('clustergram.svg')
Saving object
If you want to save your computed clustergram.Clustergram
object to a disk, you can use pickle
library:
import pickle
with open('clustergram.pickle','wb') as f:
pickle.dump(cgram, f)
Then loading is equally simple:
with open('clustergram.pickle','rb') as f:
loaded = pickle.load(f)
References
Schonlau M. The clustergram: a graph for visualizing hierarchical and non-hierarchical cluster analyses. The Stata Journal, 2002; 2 (4):391-402.
Schonlau M. Visualizing Hierarchical and Non-Hierarchical Cluster Analyses with Clustergrams. Computational Statistics: 2004; 19(1):95-111.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file clustergram-0.2.1.tar.gz
.
File metadata
- Download URL: clustergram-0.2.1.tar.gz
- Upload date:
- Size: 9.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2034e150cee142c567ceb3875390394b71d868e33577ec59b6a9bc4bd0a5b627 |
|
MD5 | 2712fef9152c72fa295223cd1b36a3da |
|
BLAKE2b-256 | e6b5cbc315f1ae7ecfde2cb4f15c339b78a994b7e08354f62f483a3ee856dead |
Provenance
File details
Details for the file clustergram-0.2.1-py3-none-any.whl
.
File metadata
- Download URL: clustergram-0.2.1-py3-none-any.whl
- Upload date:
- Size: 8.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 87e9a199c3c496f99deee8e53d4540b18766ee232e559cfcda038dd9342014cf |
|
MD5 | f379b2b43e6ce76eedfaa4486c7ea5f8 |
|
BLAKE2b-256 | 4dfed5b875f2942dbf6c7a460d2464b5519635001a356fd3208fedf392771bc7 |