API Walkthrough¶
Creating and connecting¶
Creating Loom files¶
To create a loom file from data, you need to supply a main matrix (numpy ndarray or scipy sparse matrix) and two dictionaries of row and column attributes (with attribute names as keys, and numpy ndarrays as values). If the main matrix is N×M, then the row attributes must have N elements, and the column attributes must have M elements.
For example, the following creates a loom file with a 100x100 main matrix, one row attribute and one column attribute:
import numpy as np
import loompy
filename = "test.loom"
matrix = np.arange(10000).reshape(100,100)
row_attrs = { "SomeRowAttr": np.arange(100) }
col_attrs = { "SomeColAttr": np.arange(100) }
loompy.create(filename, matrix, row_attrs, col_attrs)
loompy.create()
accepts numpy dense matrices (numpy.ndarray
) as well as scipy sparse matrices (scipy.sparse.coo_matrix
,
scipy.sparse.csc_matrix
, or scipy.sparse.csr_matrix
). For example:
import numpy as np
import loompy
import scipy.sparse as sparse
filename = "test.loom"
matrix = sparse.coo_matrix((100, 100))
row_attrs = { "SomeRowAttr": np.arange(100) }
col_attrs = { "SomeColAttr": np.arange(100) }
loompy.create(filename, matrix, row_attrs, col_attrs)
Note that loompy.create()
does not return anything. To work with the newly created file, you must loompy.connect()
to it.
You can also create an empty file using loompy.new()
, which returns a connection to the newly created file. The file can then be populated with data.
This is especially useful when you’re building a dataset incrementally, e.g. by pooling subsets of other datasets:
with loompy.new("outfile.loom") as dsout:
for sample in samples:
with loompy.connect(sample) as dsin:
logging.info(f"Appending {sample}.")
dsout.add_columns(ds.layers, col_attrs=dsin.col_attrs, row_attrs=dsin.row_attrs)
You can also create a file by combining existing loom files (loompy.combine()
). The files will be concatenated along the column
axis, and therefore must have the same number of rows. If the rows are potentially not in the same order,
you can supply a key
argument; the row attribute corresponding to the key will be used to sort the files.
For example, the following code will combine files and use the “Accession” row attribute as the key:
loompy.combine(files, output_filename, key="Accession")
You can import a 10X Genomics
cellranger
output folder using loompy.create_from_cellranger()
:
loompy.create_from_cellranger(folder, output_filename)
Connecting to Loom files¶
In order to work with a loom file, you must first loompy.connect()
to it. This does not load the data
or attributes, so is very quick regardless of the size of the file. It’s more like connecting to a
database than reading a file. Loom supports Python context management, so normally you should use
a with
statement to take care of the connection:
with loompy.connect("filename.loom") as ds:
# do something with ds
The connection will be automatically closed at the end of the with
block.
Sometimes, especially in interactive use in a Jupyter notebook, you may want to just open the file and keep the connection around:
ds = loompy.connect("filename.loom")
In that case, you should close the file when you are done:
ds.close()
In most cases, forgetting to close the file will do no harm, but may (for example) prevent concurrent processes from accessing the file, and will leak file handles.
In the rest of the documentation below, ds
is assumed to be an
instance of LoomConnection
obtained by connecting to a Loom
file.
Manipulate data¶
Shape, indexing and slicing¶
The LoomConnection.shape
attribute returns the row and column count as a tuple:
>>> ds.shape
(100, 2345)
The data stored in the main matrix can be retrieved by indexing and slicing. The following are supported:
Indices: anything that can be converted to a Python long
Slices (i.e.
:
or0:10
)Lists of the rows/columns you want (i.e.
[0, 34, 576]
)Mask arrays (i.e. numpy array of bool indicating the rows/columns you want)
Lists and mask arrays are supported along one dimension at a time only. Since the main matrix is two-dimensional, two arguments are always needed. Examples:
ds[:, :] # Return the entire matrix
ds[0:10, 0:10] # Return the 10x10 submatrix starting at row and column zero
ds[99, :] # Return the 100th row
ds[:, 99] # Return the 100th column
ds[[0,3,5], :] # Return rows with index 0, 3 and 5
ds[:, bool_array] # Return columns where bool_array elements are True
Note that performance will be poor if you select many individual rows (columns) out of a large matrix. For example, in a dataset with shape (27998, 160796), loading ten randomly chosen individual full columns took 914 ms, whereas loading 1000 columns took 1 minute and 6 seconds, and loadingh 5000 columns took 13 minutes. This slowdown is caused by a performance bug in h5py.
If the whole dataset fits in RAM, loading it in full and then selecting the row/columns you want
will be faster. If it doesn’t, consider using the LoomConnection.scan()
method (see below), which in this example took
1 minute and 12 seconds regardless of how many columns were selected. As a rule of thumb,
LoomConnection.scan()
will be faster whenever you are loading more than about 1% of the rows
or columns (randomly selected).
Sparse data¶
On disk, every layer is stored chunked and block-compressed, for efficient storage and access along both axes.
The main matrix and additional layers can be assigned from dense or sparse matrices.
You can load the main matrix or any layer as sparse:
ds.layers["exons"].sparse() # Returns a scipy.sparse.coo_matrix
ds.layers["unspliced"].sparse(rows, cols) # Returns only the indicated rows and columns (ndarrays of integers or bools)
You can assign layers from sparse matrices:
ds.layers["exons"] = my_sparse_matrix
Modifying layers¶
You can modify the data in any layer by assigning to a slice. For example:
ds[:, :] = newdata # Assign a full matrix
ds[3, 500] = 31 # Set the element at (3, 500) to the value 31
ds[99, :] = rowdata # Assign new values to row with index 99
ds[:, 99] = coldata # Assign new values to column with index 99
Global attributes¶
Global attributes are available at ds.attrs
and can be accessed by name or
as a dictionary. You create new attributes by assignment, and delete them
using the del
statement:
>>> ds.attrs.title
"The title of the dataset"
>>> ds.attrs.title = "New title"
>>> ds.attrs["title"]
"New title"
>>> del ds.attrs.title
You can list the attributes and loop over them as you would with a dictionary:
>>> ds.attrs.keys()
["title", "description"]
>>> for key, value in ds.attrs.items():
>>> print(f"{key} = {value}")
title = New title
description = Fancy dataset
Global attributes can be scalars, or multidimensional arrays of any shape, and the elements can be integers, floats or strings. See below for the exact types allowed.
Row and column attributes¶
Row and column attributes are accessed at ds.ra
and ds.ca
, respectively, and support the same interface as global
attributes. For example:
ds.ra.keys() # Return list of row attribute names
ds.ca.keys() # Return list of column attribute names
ds.ra.Gene = ... # Create or replace the Gene attribute
a = ds.ra.Gene # Assign the array of gene names (assuming the attribute exists)
del ds.ra.Gene # Delete the Gene row attribute
Attributes can also be accessed by indexing:
a = ds.ra["Gene"] # Assign the array of gene names (assuming the attribute exists)
del ds.ra["Gene"] # Delete the Gene row attribute
You can pick out multiple attributes into a single numpy array, as long as they have the same type:
a = ds.ra["Gene", "Attribute"] # Returns a 2D array of shape (n_genes, 2)
b = ds.ca["PCA1", "PCA2"] # Returns a 2D array of shape (n_cells, 2)
Note that when you ask for multiple attributes, missing attributes are silently ignored. This can be exploited to access attributes that may have different names:
a = ds.ra["Gene", "GeneName"] # Return one or the other (if only one exists)
b = ds.ca["TSNE", "PCA", "UMAP"] # Return the one that exists (if only one exists)
(of course, if two or more attributes exists, they will be stacked as above)
Attributes can be any of the following:
One-dimensional arrays of integers, floats or strings. The number of elements in the array must match the corresponding matrix dimension.
Multidimensional arrays of any of the same element types. The length along the first dimension of a row attribute must equal the number of rows in the main matrix (and vice versa for column attributes). Remaining dimensions can be any size.
For example, if the main matrix has M columns, the result of a dimensionality reduction (for example, a PCA) to 20 dimensions could be stored as a column attribute with shape (M, 20).
You can assign attributes using almost any array or list-like type, but attributes will
always return numpy array (np.ndarray
).
Using attributes as masks for indexing the main matrix results in a very compact and readable syntax for selecting subarrays:
>>> ds[ds.ra.Gene == "Actb", :]
array([[ 2., 9., 9., ..., 0., 14., 0.]], dtype=float32)
>>> ds[(ds.ra.Gene == "Actb") | (ds.ra.Gene == "Gapdh"), :]
array([[ 2., 9., 9., ..., 0., 14., 0.],
[ 0., 1., 4., ..., 0., 14., 3.]], dtype=float32)
>>> ds[:, ds.ca.CellID == "AAACATACATTCTC-1"]
array([[ 0.],
[ 0.],
[ 0.],
...,
[ 0.],
[ 0.],
[ 0.]], dtype=float32)
Note that numpy logical functions overload the bitwise, not the boolean operators. Use |
for ‘or’, &
for ‘and’ and ~
for ‘not’. You also must place parentheses around the comparison
expressions to ensure proper operator precedence. For example:
(a == b) & (a > c) | ~(c <= b)
Modifying attributes¶
Unlike layers, attributes are always only read and written in their entirety. Thus, assigning to a slice does not modify the attribute on disk. To write new values for an attribute, you must assign a full list or ndarray to the attribute:
with loompy.connect("filename.loom") as ds:
ds.ca.ClusterNames = values # where values is a list or ndarray with one element per column
# This does not change the attribute on disk:
ds.ca.ClusterNames[10] = "banana"
Adding columns¶
You can add columns to an existing loom file. It’s not possible to add rows or to delete any part of the matrix.
ds.add_columns(submatrix, col_attrs)
You need to provide a submatrix corresponding to the new columns, as well as a dictionary of column attributes with values for all the new columns.
Note that if you are adding columns to an empty file, you must also provide row attributes:
ds.add_columns(submatrix, col_attrs, row_attrs={"Gene": genes})
You can also add the contents of another .loom file:
ds.add_loom(other_file, key="Gene")
The content of the other file is added as columns on the right of the
current dataset. The rows must match for this to work. That is, the two
files must have exactly the same number of rows. If key
is given, the
rows will be ordered based on the key attribute. Furthermore, the two
datasets must have the same column
attributes (but of course can have different values for those
attributes at each column). Missing attributes can be given default
values using the fill_values
argument. If the files contain any global attribute
with conflicting values, you can automatically convert such attributes into column attributes
by passing convert_attrs=True
to the method.
Layers¶
Loom supports multiple layers. There is always a single main matrix, but
optionally one or more additional layers having the same number of rows
and columns. Layers are accessed using the layers
property on the
LoomConnection
object.
Layers support the same pythonic API as attributes:
ds.layers.keys() # Return list of layers
ds.layers["unspliced"] # Return the layer named "unspliced"
ds.layers["spliced"] = ... # Create or replace the "spliced" layer
a = ds.layers["spliced"][:, 10] # Assign the 10th column of layer "spliced" to the variable a
del ds.layers["spliced"] # Delete the "spliced" layer
The main matrix is availabe as a layer named “” (the empty string). It cannot be deleted but otherwise supports the same operations as any other layer.
As a convenience, layers are also available directly on the connection object. The above expressions are equivalent to the following:
ds["unspliced"] # Return the layer named "unspliced"
ds["spliced"] = ... # Create or replace the "spliced" layer
a = ds["spliced"][:, 10] # Assign the 10th column of layer "spliced" to the variable a
del ds["spliced"] # Delete the "spliced" layer
Sometimes you may need to create an empty layer (all zeros), to be filled later. Empty layers are created by assigning a type to a layer name. For example:
ds["empty_floats"] = "float32"
ds["empty_ints"] = "int64"
Graphs¶
Loom supports sparse graphs with either the rows or the columns as nodes. For example, a sparse graph of cells (stored in the columns) could represent a K nearest-neighbors graph of the cells. In that case, the cells are the nodes (so there are M nodes in the graph if there are M columns in the main matrix), which are connected by an arbitrary number of edges. The graph could be considered directed or undirected, and can have float-valued weights on the edges. Loom even supports multigraphs (permitting multiple edges between pairs of nodes). Graphs are stored as arrays of edges and the associated edge weights.
Row and column graphs are accessed at ds.row_graphs
and ds.col_graphs
, respectively,
and support the same interface as attributes. For example:
ds.row_graphs.keys() # Return list of row graphs
ds.col_graphs.KNN = ... # Create or replace the column-oriented graph KNN
a = ds.col_graphs.KNN # Assign the KNN column graph to variable a
del ds.col_graphs.KNN # Delete the KNN graph
Graphs are returned as scipy.sparse.coo_matrix
, and can be created/assigned from any
scipy sparse format as well as from a numpy dense matrix or ndarray. In each case, the matrix
represents the adjacency matrix of the graph.
Views¶
Loompy views are in-memory views of a slice through the underlying loom file. Views can be created explicitly by slicing:
ds.view[:, 10:20]
This will create a view, fully loaded in memory, containing all the rows of the underlying loom file, but only columns 10 through 19 (zero-based). You can use fancy indexing including slices, arrays of integers (to pick out specific rows/columns) and boolean arrays.
The power of the view is that it slices through everything: the main matrix, every layer, every attribute, and every graph. This hides a lot of messy and error-prone code, and makes it easy to extract relevant subsets of a loom file.
The most common use of a view
is in scanning through a file (see scan()
below).
Operations¶
Map¶
You can map one or more functions across all rows (all columns), while avoiding loading the entire dataset into memory:
ds.map([np.mean, np.std], axis=1)
The functions will receive an array (of floats or integers) as their only argument, and
should return a single float or integer value. Internally, map()
uses scan()
to
loop across the file.
Note that you must always provide a list of functions, even if it has only one element, and that the result is a list of vectors, one per function that was supplied. Hence the correct way to map a single function across the matrix is:
(means,) = ds.map([np.mean], axis=1)
Permutation¶
Permute the order of the rows or columns:
ordering = np.random.permutation(np.arange(ds.shape[1]))
ds.permute(ordering, axis=1)
This permutes the order of rows or columns in the file, without loading
the entire file in RAM. The ordering
argument should be a numpy array
of ds.shape[axis] elements, in the desired order.
Scan¶
For very large loom files, it’s very useful to scan across the file
(along either rows or columns) in batches, to avoid loading the entire
file in memory. This can be achieved using the scan()
method:
for (ix, selection, view) in ds.scan(axis=1):
# do something with each view
Inside the loop, you get access to the current view
into the file. It has all the
attributes, graphs and data of the original loom file, but only for the columns included
in selection
(or rows, if axis=0).
In essence, you get a succession of slices through the loom file, corresponding to
bands of columns (rows). The ix
variable tells you the starting column of the band, whereas
the selection
gives you the list of columns contained in the current view.
You can also scan across a selected subset of the columns or rows. For example:
cells = # List of columns you want to see
for (ix, selection, view) in ds.scan(items=cells, axis=1):
# do something with each view
This works exactly the same, except that each selection
and view
now include only
the columns you asked for.