Chapter 13: Clustering Algorithms for Mass Cytometry Data

In the intricate world of mass cytometry data, clustering algorithms serve as our trusty guides, helping us navigate the complex cellular landscapes and identify distinct populations. Let’s embark on a journey through the diverse ecosystem of clustering techniques, from the venerable classics to the cutting-edge innovators.

Hierarchical Clustering: The Family Tree of Cells

Hierarchical clustering, one of the oldest techniques in the book, builds a tree-like structure of data points. It’s like creating a family tree for your cells, grouping them based on their similarities.

There are two main approaches:

  1. Agglomerative: Start with each cell as its own cluster and progressively merge the closest ones.
  2. Divisive: Begin with all cells in one cluster and recursively divide them.

While not always the fastest, hierarchical clustering provides an intuitive visualization of relationships between cell populations. It’s particularly useful for exploratory analysis and for understanding the overall structure of your data.

K-means Clustering: The Classic Workhorse

K-means clustering is the dependable workhorse of the clustering world. It aims to partition n observations into k clusters, with each observation belonging to the cluster with the nearest mean.

Despite its simplicity, k-means remains popular due to its speed and interpretability. However, it requires you to specify the number of clusters in advance, which can be challenging in complex cytometry data.

FlowSOM: The Rising Star

FlowSOM, introduced by Van Gassen et al. in 2015, has quickly become a favorite in the cytometry community. This algorithm uses self-organizing maps (SOMs) to cluster and visualize high-dimensional cytometry data.

What sets FlowSOM apart is its speed and scalability. It can handle millions of cells in just minutes, making it ideal for the large datasets typical in mass cytometry. The resulting star-shaped SOMs provide an intuitive visualization of the data structure.

Since its introduction, FlowSOM has been widely adopted and incorporated into numerous studies. For instance, Nowicka et al. (2017) integrated FlowSOM into their CyTOF workflow in “CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets” (F1000Research). Similarly, Weber et al. (2019) used FlowSOM in their comprehensive benchmarking study “Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data” (Cytometry Part A).

One of the key advantages of FlowSOM is its availability as a free, open-source package in R. This accessibility has undoubtedly contributed to its widespread adoption in the cytometry community, as seen by its citation rates, over 1500 times summer 2024.

Other Specialized Algorithms

UMAP

The world of clustering algorithms for cytometry data is rich and diverse. Here are some historical noteworthy techniques:

  1. PhenoGraph: Introduced by Levine et al. (2015) in “Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis” (Cell), PhenoGraph uses nearest-neighbor graphs to identify communities of phenotypically similar cells.
  2. DensVM: Becher et al. (2014) presented this density-based algorithm in “High-dimensional analysis of the murine myeloid cell system” (Nature Immunology), combining density-based clustering with support vector machines.
  3. X-shift: This algorithm, introduced by Samusik et al. (2016) in “Automated mapping of phenotype space with single-cell data” (Nature Methods), uses weighted k-nearest-neighbor density estimation to automatically determine the number of clusters.
  4. SPADE (Spanning-tree Progression Analysis of Density-normalized Events): While not strictly a clustering algorithm, this technique by Qiu et al. (2011) in “Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE” (Nature Biotechnology) combines density-based downsampling with hierarchical clustering to create tree-like visualizations of cellular hierarchies.
  5. ACCENSE (Automatic Classification of Cellular Expression by Nonlinear Stochastic Embedding): Shekhar et al. (2014) introduced this method in “Automatic Classification of Cellular Expression by Nonlinear Stochastic Embedding (ACCENSE)” (Proceedings of the National Academy of Sciences), combining t-SNE with density-based clustering.
  6. Citrus (Cluster identification, characterization, and regression): This algorithm, presented by Bruggner et al. (2014) in “Automated identification of stratifying signatures in cellular subpopulations” (Proceedings of the National Academy of Sciences), combines hierarchical clustering with predictive modeling to identify stratifying cell subpopulations.

Recent Innovations :

The field continues to evolve, with new techniques emerging to address the unique challenges of high-dimensional cytometry data:

  1. PARC (Phenotyping by Accelerated Refined Community-partitioning): Introduced by Stassen et al. (2020) in “PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells” (Bioinformatics), this method uses community detection algorithms for rapid clustering of large datasets.
  2. SAUCIE (Sparse Autoencoder for Unsupervised Clustering, Imputation, and Embedding): Amodio et al. (2019) introduced this deep learning-based method in “Exploring single-cell data with deep multitasking neural networks” (Nature Methods), which performs multiple tasks including clustering and batch correction, cited over 285 times.
  3. CytofDR: This method, presented by Wang et al. (2023) in “Comparative analysis of dimension reduction methods for cytometry by time-of-flight data” (Nature Communications), integrates multiple dimensionality reduction techniques with clustering for comprehensive cytometry data analysis.
  4. CRUSTY: a versatile web platform for the rapid analysis and visualization of high-dimensional flow cytometry data”, by Puccio in 2023
Performance and workflow of PARC algorithm
Performance and description of Crusty workflow

As we survey this rich landscape of clustering algorithms, we’re reminded of the incredible progress made in analyzing high-dimensional cytometry data. From the foundational techniques of hierarchical and k-means clustering to the specialized algorithms like FlowSOM and beyond, each method offers a unique lens through which to view our cellular data.

These algorithms are more than just computational tools; they’re the cartographers of our cellular world, mapping out the complex territories of cell types and states. As mass cytometry continues to push the boundaries of what’s possible in single-cell analysis, clustering algorithms evolve in tandem, rising to meet new challenges and uncover deeper insights.

In this ever-expanding universe of cellular data, clustering algorithms serve as our guiding stars, helping us navigate the vast expanses of high-dimensional space and chart the unexplored territories of cellular biology. With each new algorithm and each refinement of existing techniques, we inch closer to a comprehensive understanding of the intricate cellular ecosystems that underlie health and disease.

As we look to the future, one thing is certain: the journey of discovery in cytometry data analysis is far from over. New challenges will arise, and with them, new algorithmic innovations. In this dynamic field, today’s cutting-edge technique may become tomorrow’s standard tool, and the next revolutionary algorithm may be just around the corner, waiting to unveil new wonders in our cellular universe.

I like FlowSOM, its algorithm felt very powerful... but its visualization tool? Not that much. So, there I was, caught between my love for FlowSOM's clustering prowess and my desperate need for eye-catching visuals. It was like dating someone brilliant but with a terrible sense of fashion. Enter my knight in shining R code - a colleague who helped me create a visualization package that could make even the dullest data strut its stuff. Suddenly, with just a few lines of code, I could have my cake and eat it too: FlowSOM's brains with Hollywood-worthy good looks. And just like that, I took my most significant step in R programming by developing Cytofast in R...

author avatar
Dr. Guillaume Beyrend-Frizon Scientist - Physician
Dr. Guillaume Beyrend-Frizon is an MD-PhD researcher and creator of the Cytofast R package, with 15 peer-reviewed publications in Cell Reports Medicine, JITC, and JoVE focusing on immunotherapy and advanced cytometry analysis. Through LearnCytometry.com, he has trained over 500 scientists worldwide in R-based cytometry analysis, translating cutting-edge research into practical educational tools that provide cost-effective alternatives to expensive commercial software.
Scroll to Top