In order to maximize interoperability of Loom files between analysis pipelines, we suggest adhering to the following conventions.
Note: This document is work in progress and subject to change! You can follow the discussion here.
Single analysis per file¶
Each Loom file stores a single analysis. If you want to try two different ways of clustering, you store the results separately.
This convention simplifies things a lot, because we only need to keep track of one set of cluster labels (for example). It also means that we don’t need to store the relationship between different attributes (e.g. that this clustering was done using that PCA). Such relationships must be stored external to the file.
- Columns represent cells or aggregates of cells
- Rows represent genes
Loom files can grow along the column axis, but not the row axis, so this makes sense.
Attribute naming conventions¶
We propose that algorithms always accept an argument specifying the name of each attribute it will use, with
the defaults set as listed below. For example, an algorithm that needs a unique gene accession string would
take an argument
accession_attr="Accession", and a clustering algorithm that generates cluster labels would
take an argument
cluster_id_attr="ClusterID". In this way, Loom files that conform to the conventions below
would work without fuss, but files that don’t could still be made to work by supplying the non-standard attribute names.
CellID a string label unique to each cell (preferrably globally unique across all datasets)
Valid integers 1 or 0, indicating cells that are considered valid after some QC step
ClusterID an integer label with values in [0, n_clusters].
ClusterName a string label with arbitrary values representing cluster names. If both
ClusterName are present, they should correspond 1:1.
Outliers an integer label 1 or 0, indicating cells that are outliers relative to the clusters.
Embedding an M-by-Y matrix where M is the number of columns and Y is the dimensionality of the embedding. This can be used
to store e.g. a PCA or t-SNE dimensionality reduction.
Gene a string human-readable gene name, not necessarily unique
Accession a string, unique within the file (e.g. an ENSEMBL accession etc.)
Selected integers 1 or 0, indicating genes that were selected by some previous step.
Valid integers 1 or 0, indicating genes that are considered valid after some QC step