Conventions

In order to maximize interoperability of Loom files between analysis pipelines, we suggest adhering to the following conventions.

Note: This document is work in progress and subject to change! You can follow the discussion here.

Single analysis per file

Each Loom file stores a single analysis. If you want to try two different ways of clustering, you store the results separately.

This convention simplifies things a lot, because we only need to keep track of one set of cluster labels (for example). It also means that we don’t need to store the relationship between different attributes (e.g. that this clustering was done using that PCA). Such relationships must be stored external to the file.

Orientation

  • Columns represent cells or aggregates of cells
  • Rows represent genes

Loom files can grow along the column axis, but not the row axis, so this makes sense.

Attribute naming conventions

We propose that algorithms always accept an argument specifying the name of each attribute it will use, with the defaults set as listed below. For example, an algorithm that needs a unique gene accession string would take an argument accession_attr="Accession", and a clustering algorithm that generates cluster labels would take an argument cluster_id_attr="ClusterID". In this way, Loom files that conform to the conventions below would work without fuss, but files that don’t could still be made to work by supplying the non-standard attribute names.

Column attributes

CellID a string label unique to each cell (preferrably globally unique across all datasets)

Valid integers 1 or 0, indicating cells that are considered valid after some QC step

ClusterID an integer label with values in [0, n_clusters].

ClusterName a string label with arbitrary values representing cluster names. If both ClusterID and ClusterName are present, they should correspond 1:1.

Outliers an integer label 1 or 0, indicating cells that are outliers relative to the clusters.

Embedding an M-by-Y matrix where M is the number of columns and Y is the dimensionality of the embedding. This can be used to store e.g. a PCA or t-SNE dimensionality reduction.

Row attributes

Gene a string human-readable gene name, not necessarily unique

Accession a string, unique within the file (e.g. an ENSEMBL accession etc.)

Selected integers 1 or 0, indicating genes that were selected by some previous step.

Valid integers 1 or 0, indicating genes that are considered valid after some QC step