用内核平均嵌入式透明的单细胞设置分类

论文标题

用内核平均嵌入式透明的单细胞设置分类

Transparent Single-Cell Set Classification with Kernel Mean Embeddings

论文作者

Shan, Siyuan, Baskaran, Vishal, Yi, Haidong, Ranek, Jolene, Stanley, Natalie, Oliva, Junier

论文摘要

现代的单细胞流量和质量细胞术技术测量了血液或组织样品中单个细胞几种蛋白的表达。因此，每个介绍的生物样品都由数十万个多维细胞特征向量表示，这会产生高计算成本，以预测每个生物样品与机器学习模型的相关表型。如此大的固定基数也限制了机器学习模型的可解释性，因为难以跟踪每个单个单个细胞如何影响最终预测。我们建议使用内核平均嵌入来编码每个分类生物样品的细胞景观。尽管我们最重要的目标是制作一个更透明的模型，但我们发现我们的方法与通过简单的线性分类器相比，与最先进的无门控方法相比，我们的方法获得了可比性或更好的精度。结果，我们的模型包含很少的参数，但仍与具有数百万参数的深度学习模型相似。与深度学习方法相反，我们模型的线性和子选择步骤使解释分类结果变得容易。分析进一步表明，我们的方法可以接受丰富的生物学解释性，以将细胞异质性与临床表型联系起来。

Modern single-cell flow and mass cytometry technologies measure the expression of several proteins of the individual cells within a blood or tissue sample. Each profiled biological sample is thus represented by a set of hundreds of thousands of multidimensional cell feature vectors, which incurs a high computational cost to predict each biological sample's associated phenotype with machine learning models. Such a large set cardinality also limits the interpretability of machine learning models due to the difficulty in tracking how each individual cell influences the ultimate prediction. We propose using Kernel Mean Embedding to encode the cellular landscape of each profiled biological sample. Although our foremost goal is to make a more transparent model, we find that our method achieves comparable or better accuracies than the state-of-the-art gating-free methods through a simple linear classifier. As a result, our model contains few parameters but still performs similarly to deep learning models with millions of parameters. In contrast with deep learning approaches, the linearity and sub-selection step of our model makes it easy to interpret classification results. Analysis further shows that our method admits rich biological interpretability for linking cellular heterogeneity to clinical phenotype.

下载PDF全文

下载文献需遵守相关版权规定

论文标题