** Data Reduction ** in Phenotypic Screening:

From Complex Data to ** Actionable Insights **

Incredible advancements in microscopy and image analysis software have enabled increased throughput and data content, contributing to the **emergence of phenotypic assays in drug discovery**. Cell Painting is a popular example of such an assay, which uses six fluorescent dyes to reveal eight cellular components or organelles (1, 2). Combined with advanced image analysis approaches, researchers can rapidly obtain hundreds or thousands of morphological features, enabling the detection of subtle phenotypes.

** **

**This capability to rapidly generate highly detailed datasets represents both the technology’s greatest strength and its greatest challenge**. On the one hand, datasets contain comprehensive and holistic information on cellular phenotypes, revealing profound insights into biological processes. On the other hand, many biologists find it challenging to transform these overwhelmingly large datasets into meaningful insights. The task of extracting clear phenotypic insights from thousands of measurements can be daunting.

**The Curse of Dimensionality**

In addition to interpretability, this level of dimensionality presents a number of potential problems, collectively known as the 'curse of dimensionality' (3). As the number of dimensions (features) increases, the data space expands exponentially, resulting in several challenges:

**Computational Complexity**: High-dimensional data demands more computational resources and can thus be expensive and time-consuming, often requiring specialized infrastructure.**Sparsity**: As dimensionality increases, data points become sparse. Essentially, data points are spread out over a much larger space and with that, most of the space becomes empty, complicating the identification of patterns and relationships. Sparsity can result in overfitting of Machine Learning models: the model learns noise in the data instead of the underlying patterns, and thus fails to generalize well to new data.**Meaningful Distances**: In high-dimensional spaces, the concept of distance loses its meaning, complicating the identification of similar or dissimilar data points. Algorithms that rely on distance scores, such as k-nearest neighbors (KNN) will therefore struggle to differentiate between data points.**Visualization**: traditional visualization techniques like scatter plots are inadequate for more than three dimensions. This can hinder the researcher’s ability to explore and communicate insights from the data, complicating both the analysis and the presentation of findings.

**Dimensionality Reduction Techniques**

Sophisticated data analysis strategies are therefore needed to analyze and interpret the data. Dimensionality reduction or data reduction methods transform high-dimensional data into a lower-dimensional space, while preserving most of the original information. This helps make the data more manageable and interpretable. Two commonly used methods are Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP).

**Principal Component Analysis (PCA)**

PCA is a statistical technique that constructs new variables, called Principal Components, as linear combinations of subsets of the initial features. Features are combined in such a way that the new Principal Components are uncorrelated, and the components are ordered so that the first contains most of the variance present in the original dataset. So, PC1 represents the features that together explain most of the variance in the dataset, PC2 accounts for the largest possible variance in the remaining data, and so on.

**Uniform Manifold Approximation and Projection (UMAP)**

UMAP is a dimensionality reduction and clustering method that, unlike PCA, can capture non-linear relationships in the data. Essentially, UMAP maps high-dimensional data to a lower-dimensional space, while maintaining the overall local and global structure and relationships from the original data. This makes UMAP particularly useful for complex biological data where interactions are often non-linear.

**Easily Perform Dimensionality Reduction with StratoMineR**

Data reduction methods, such as PCA and UMAP, are powerful tools for overcoming the curse of dimensionality and essential for phenotypic analysis. Their practical implementation however, can be complex and time-consuming, requiring significant expertise. To empower biologists to independently perform these complex analyses, we have developed StratoMineR, part of StratoInsight. This intuitive data mining tool guides biologists through a best-practices workflow for multiparametric data, helping users to preprocess their data, and to subsequently perform one of several data reduction techniques, including PCA and UMAP.

**Explore the Biology in Your Data**

Aided by interactive visualizations, StratoMineR enables in-depth data exploration. One can, for example, **examine the loadings (weights) of features on each principal component.** This allows researchers to determine which features contribute most, or which features load heavily on the same principal component and thus vary together, providing insight into the biology underlying the dataset.

**Reduced data can then be visualized and explored in an interactive 3D plot.** This helps researchers to examine phenotype distributions, compare them to reference compounds, and uncover patterns that might not be apparent in higher-dimensional spaces. In StratoMineR, users can use filters, and adjust viewing angles and zoom levels to better understand the relationships between different phenotypes and identify clusters or anomalies.

**From data reduction to discovery**

Following data reduction, researchers can use supervised or unsupervised methods to perform hit selection, identifying compounds that produce phenotypes similar or dissimilar to that of a known compound of interest, for example. Clustering algorithms can then group similar data points, revealing patterns and relationships within the dataset, providing further insights into mechanisms of action or toxicity profiles.

**Curious to learn more?** Discover how we used StratoMineR for the analysis of a public Cell Painting dataset in our webinar, or request a StratoInsight demo.

**References**

1: Bray MA, Singh S, Han H, Davis CT, Borgeson B, Hartland C, Kost-Alimova M, Gustafsdottir SM, Gibson CC, Carpenter AE. Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nat Protoc. 2016 Sep;11(9):1757-74. doi: 10.1038/nprot.2016.105. Epub 2016 Aug 25. PMID: 27560178; PMCID: PMC5223290. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5223290/

2: Gustafsdottir SM, Ljosa V, Sokolnicki KL, Anthony Wilson J, Walpita D, Kemp MM, Petri Seiler K, Carrel HA, Golub TR, Schreiber SL, Clemons PA, Carpenter AE, Shamji AF. Multiplex cytological profiling assay to measure diverse cellular states. PLoS One. 2013 Dec 2;8(12):e80999. doi: 10.1371/journal. pone.0080999. PMID: 24312513; PMCID: PMC3847047. https://pmc.ncbi.nlm.nih.gov/articles/PMC3847047/

3: Bellman R. Dynamic programming. Science. 1966 Jul 1;153(3731):34-7. doi: 10.1126/science.153.3731.34. PMID: 17730601. https://pubmed.ncbi.nlm.nih.gov/17730601/