Module shoji.dimension
Dimensions represent named, shared tensor axes. When two tensors share an axis, they are constrained to have the same number of elements along that axis.
Overview
Dimensions must be defined in the Workspace
before they can be used:
db = shoji.connect()
db.scRNA = shoji.Workspace()
db.scRNA.cells = shoji.Dimension(shape=None)
db.scRNA.genes = shoji.Dimension(shape=5000)
Once the dimensions have been declared, tensors can use those dimensions:
db.scRNA.Expression = shoji.Tensor("int16", ("cells", "genes"))
db.scRNA.CellType = shoji.Tensor("string", ("cells",))
db.scRNA.Length = shoji.Tensor("uint16", ("genes",))
db.scRNA.Chromosome = shoji.Tensor("string", ("genes",))
Adding data along a dimension
In order to ensure that the dimension constraints are always fulfilled, data must be added
in parallel to all tensors that share a dimension, using the append()
method on the
dimension. For example, to add cells to a project with a cells
dimension:
db.scRNA.cells.append({
"CellType": np.array(["Neuron", "Astrocyte", "Neuron", "Miroglia"], dtype=object),
"Expression": np.random.randint(0, 10, size=(4, 5000), dtype="uint16")
})
Note that if you leave out one of the tensors (of the ones that have "cells"
as one of their
dimensions), or supply data of inconsistent shape, the append method will raise an exception.
Appending data using this method is guaranteed to never fail with a partial row written (but may not
complete all the rows successfully), and will never leave the database in an inconsistent state
(e.g. with data appended to only one of the tensors). If you need a stronger guarantee of success/failure,
wrap the append()
in a Transaction
.
Grouping
Calling .groupby(labels)
on a dimension creates a GroupDimensionBy
object.
The labels
should be the name of a tensor which is used to group the dimension.
Calling an aggregation function such as mean()
then returns a grouped tensor.
For example, suppose you have an Expression tensor with dimensions ("genes", "cells") and a ClusterID tensor with dimension ("cells",). Suppose there are 10 distinct values of ClusterID. The following code will return an np.ndarray of shape ("genes", 10) containing mean values for each ClusterID:
grouped = ws.cells.groupby("ClusterID")
(labels, values) = grouped.mean("Expression")
# labels is a list of distinct ClusterID values
# values is a np.ndarray where the "cells" dimension is replaced by the distinct cluster IDs
Expand source code
"""
Dimensions represent named, shared tensor axes. When two tensors share an axis, they are
constrained to have the same number of elements along that axis.
..image:: assets/bitmap/tensor_dims@2x.png
## Overview
Dimensions must be defined in the `shoji.workspace.Workspace` before they can be used:
```python
db = shoji.connect()
db.scRNA = shoji.Workspace()
db.scRNA.cells = shoji.Dimension(shape=None)
db.scRNA.genes = shoji.Dimension(shape=5000)
```
Once the dimensions have been declared, tensors can use those dimensions:
```python
db.scRNA.Expression = shoji.Tensor("int16", ("cells", "genes"))
db.scRNA.CellType = shoji.Tensor("string", ("cells",))
db.scRNA.Length = shoji.Tensor("uint16", ("genes",))
db.scRNA.Chromosome = shoji.Tensor("string", ("genes",))
```
## Adding data along a dimension
In order to ensure that the dimension constraints are always fulfilled, data must be added
in parallel to all tensors that share a dimension, using the `append()` method on the
dimension. For example, to add cells to a project with a `cells` dimension:
```python
db.scRNA.cells.append({
"CellType": np.array(["Neuron", "Astrocyte", "Neuron", "Miroglia"], dtype=object),
"Expression": np.random.randint(0, 10, size=(4, 5000), dtype="uint16")
})
```
Note that if you leave out one of the tensors (of the ones that have `"cells"` as one of their
dimensions), or supply data of inconsistent shape, the append method will raise an exception.
Appending data using this method is guaranteed to never fail with a partial row written (but may not
complete all the rows successfully), and will never leave the database in an inconsistent state
(e.g. with data appended to only one of the tensors). If you need a stronger guarantee of success/failure,
wrap the `append()` in a `shoji.transaction.Transaction`.
## Grouping
Calling `.groupby(labels)` on a dimension creates a `shoji.groupby.GroupDimensionBy` object.
The `labels` should be the name of a tensor which is used to group the dimension.
Calling an aggregation function such as `mean()` then returns a grouped tensor.
For example, suppose you have an Expression tensor with dimensions ("genes", "cells") and
a ClusterID tensor with dimension ("cells",). Suppose there are 10 distinct values of
ClusterID. The following code will return an np.ndarray of shape ("genes", 10) containing
mean values for each ClusterID:
```python
grouped = ws.cells.groupby("ClusterID")
(labels, values) = grouped.mean("Expression")
# labels is a list of distinct ClusterID values
# values is a np.ndarray where the "cells" dimension is replaced by the distinct cluster IDs
```
"""
from typing import Optional, Dict, Union, List, Callable
import numpy as np
import shoji
import fdb
class Dimension:
"""
Class representing a named dimension, which can be shared by multiple `shoji.tensor.Tensor`s
"""
def __init__(self, shape: Optional[int], length: int = 0) -> None:
"""
Create a new Dimension
Args:
shape An integer, or None to create a variable-length dimension
length (do not use when creating a new Dimension)
"""
if shape == -1:
shape = None
if shape is not None and shape < 0:
raise ValueError("Length must be non-negative")
self.shape = shape # None means variable length (i.e. can append) or jagged
self.length = length # Actual length, will be set when dimension is read from db
self.name = "" # Will be set if the Dimension is read from the db
self.wsm: Optional[shoji.WorkspaceManager] = None # Will be set if the Dimension is read from the db
# Support pickling
def __getstate__(self):
"""Return state values to be pickled."""
return (self.shape, self.length)
def __setstate__(self, state):
"""Restore state from the unpickled state values."""
self.shape, self.length = state
def __getitem__(self, key) -> "shoji.View":
if self.wsm is None:
raise ValueError("Cannot filter unbound dimension")
if isinstance(key, slice):
return shoji.View(self.wsm, (shoji.DimensionSliceFilter(self, key),))
if isinstance(key, (list, tuple, int)):
key = np.array(key)
if isinstance(key, np.ndarray):
if np.issubdtype(key.dtype, np.bool_):
return shoji.View(self.wsm, (shoji.DimensionBoolFilter(self, key),))
elif np.issubdtype(key.dtype, np.int_):
return shoji.View(self.wsm, (shoji.DimensionIndicesFilter(self, key),))
raise IndexError(f"Invalid fancy index along dimension '{self.name}' (only slice, bool array or int array are allowed)")
def __repr__(self) -> str:
if self.shape is None:
return "<Dimension of variable shape>"
else:
return f"<Dimension of shape {self.shape}>"
def __len__(self) -> int:
return self.length
def groupby(self, labels: Union[str, np.ndarray], projection: Callable = None) -> "shoji.GroupDimensionBy":
return shoji.GroupDimensionBy(self, labels, projection)
def append(self, vals: Dict[str, Union[List[np.ndarray], np.ndarray]]) -> None:
"""
Append values to all tensors that have this as one of their dimensions
Args:
vals: Dict mapping tensor names (`str`) to tensor values (`np.ndarray`)
Remarks:
The method is transactional, i.e. it's guaranteed to either succeed or
fail without leaving the database in an inconsistent state. If it fails,
a smaller than expected number of rows may have been appended, but all
tensors will have the same length along the dimension.
"""
assert self.wsm is not None, "Cannot append to unsaved dimension"
assert self.shape is None, "Cannot append to fixed-size dimension"
# Figure out the relevant axes
axes: List[int] = []
n_rows = -1
for name, values in vals.items():
assert isinstance(values, np.ndarray), f"Input values must be numpy ndarrays, but '{name}' was {type(values)}"
assert values.ndim >= 1, f"Input values must be at least 1-dimensional, but '{name}' was scalar"
tensor = self.wsm._get_tensor(name)
assert self.name in tensor.dims, f"Input values were provided for '{name}', but '{self.name}' is not one of its dimensions"
axis = tensor.dims.index(self.name)
if n_rows == -1:
n_rows = values.shape[axis]
elif values.shape[axis] != n_rows:
raise ValueError(f"Length (along first dimension) of tensors must be the same when appending, but '{name}' was length {len(values)} while other arrays were {n_rows} long")
axes.append(axis)
names = list(vals.keys())
values = [shoji.TensorValue(x) for x in vals.values()]
shoji.io.append_values_multibatch(self.wsm, names, values, tuple(axes))
Classes
class Dimension (shape: Union[int, NoneType], length: int = 0)
-
Class representing a named dimension, which can be shared by multiple
Tensor
sCreate a new Dimension
Args
shape An integer, or None to create a variable-length dimension length (do not use when creating a new Dimension)
Expand source code
class Dimension: """ Class representing a named dimension, which can be shared by multiple `shoji.tensor.Tensor`s """ def __init__(self, shape: Optional[int], length: int = 0) -> None: """ Create a new Dimension Args: shape An integer, or None to create a variable-length dimension length (do not use when creating a new Dimension) """ if shape == -1: shape = None if shape is not None and shape < 0: raise ValueError("Length must be non-negative") self.shape = shape # None means variable length (i.e. can append) or jagged self.length = length # Actual length, will be set when dimension is read from db self.name = "" # Will be set if the Dimension is read from the db self.wsm: Optional[shoji.WorkspaceManager] = None # Will be set if the Dimension is read from the db # Support pickling def __getstate__(self): """Return state values to be pickled.""" return (self.shape, self.length) def __setstate__(self, state): """Restore state from the unpickled state values.""" self.shape, self.length = state def __getitem__(self, key) -> "shoji.View": if self.wsm is None: raise ValueError("Cannot filter unbound dimension") if isinstance(key, slice): return shoji.View(self.wsm, (shoji.DimensionSliceFilter(self, key),)) if isinstance(key, (list, tuple, int)): key = np.array(key) if isinstance(key, np.ndarray): if np.issubdtype(key.dtype, np.bool_): return shoji.View(self.wsm, (shoji.DimensionBoolFilter(self, key),)) elif np.issubdtype(key.dtype, np.int_): return shoji.View(self.wsm, (shoji.DimensionIndicesFilter(self, key),)) raise IndexError(f"Invalid fancy index along dimension '{self.name}' (only slice, bool array or int array are allowed)") def __repr__(self) -> str: if self.shape is None: return "<Dimension of variable shape>" else: return f"<Dimension of shape {self.shape}>" def __len__(self) -> int: return self.length def groupby(self, labels: Union[str, np.ndarray], projection: Callable = None) -> "shoji.GroupDimensionBy": return shoji.GroupDimensionBy(self, labels, projection) def append(self, vals: Dict[str, Union[List[np.ndarray], np.ndarray]]) -> None: """ Append values to all tensors that have this as one of their dimensions Args: vals: Dict mapping tensor names (`str`) to tensor values (`np.ndarray`) Remarks: The method is transactional, i.e. it's guaranteed to either succeed or fail without leaving the database in an inconsistent state. If it fails, a smaller than expected number of rows may have been appended, but all tensors will have the same length along the dimension. """ assert self.wsm is not None, "Cannot append to unsaved dimension" assert self.shape is None, "Cannot append to fixed-size dimension" # Figure out the relevant axes axes: List[int] = [] n_rows = -1 for name, values in vals.items(): assert isinstance(values, np.ndarray), f"Input values must be numpy ndarrays, but '{name}' was {type(values)}" assert values.ndim >= 1, f"Input values must be at least 1-dimensional, but '{name}' was scalar" tensor = self.wsm._get_tensor(name) assert self.name in tensor.dims, f"Input values were provided for '{name}', but '{self.name}' is not one of its dimensions" axis = tensor.dims.index(self.name) if n_rows == -1: n_rows = values.shape[axis] elif values.shape[axis] != n_rows: raise ValueError(f"Length (along first dimension) of tensors must be the same when appending, but '{name}' was length {len(values)} while other arrays were {n_rows} long") axes.append(axis) names = list(vals.keys()) values = [shoji.TensorValue(x) for x in vals.values()] shoji.io.append_values_multibatch(self.wsm, names, values, tuple(axes))
Methods
def append(self, vals: Dict[str, Union[List[numpy.ndarray], numpy.ndarray]]) ‑> NoneType
-
Append values to all tensors that have this as one of their dimensions
Args
vals
- Dict mapping tensor names (
str
) to tensor values (np.ndarray
)
Remarks
The method is transactional, i.e. it's guaranteed to either succeed or fail without leaving the database in an inconsistent state. If it fails, a smaller than expected number of rows may have been appended, but all tensors will have the same length along the dimension.
Expand source code
def append(self, vals: Dict[str, Union[List[np.ndarray], np.ndarray]]) -> None: """ Append values to all tensors that have this as one of their dimensions Args: vals: Dict mapping tensor names (`str`) to tensor values (`np.ndarray`) Remarks: The method is transactional, i.e. it's guaranteed to either succeed or fail without leaving the database in an inconsistent state. If it fails, a smaller than expected number of rows may have been appended, but all tensors will have the same length along the dimension. """ assert self.wsm is not None, "Cannot append to unsaved dimension" assert self.shape is None, "Cannot append to fixed-size dimension" # Figure out the relevant axes axes: List[int] = [] n_rows = -1 for name, values in vals.items(): assert isinstance(values, np.ndarray), f"Input values must be numpy ndarrays, but '{name}' was {type(values)}" assert values.ndim >= 1, f"Input values must be at least 1-dimensional, but '{name}' was scalar" tensor = self.wsm._get_tensor(name) assert self.name in tensor.dims, f"Input values were provided for '{name}', but '{self.name}' is not one of its dimensions" axis = tensor.dims.index(self.name) if n_rows == -1: n_rows = values.shape[axis] elif values.shape[axis] != n_rows: raise ValueError(f"Length (along first dimension) of tensors must be the same when appending, but '{name}' was length {len(values)} while other arrays were {n_rows} long") axes.append(axis) names = list(vals.keys()) values = [shoji.TensorValue(x) for x in vals.values()] shoji.io.append_values_multibatch(self.wsm, names, values, tuple(axes))
def groupby(self, labels: Union[str, numpy.ndarray], projection: Callable = None) ‑> GroupDimensionBy
-
Expand source code
def groupby(self, labels: Union[str, np.ndarray], projection: Callable = None) -> "shoji.GroupDimensionBy": return shoji.GroupDimensionBy(self, labels, projection)