Module shoji.tensor
All data in Shoji is stored as N-dimensional tensors. A tensor is a generalisation of scalars, vectors and matrices to N dimensions.
Tensors are defined by their rank, datatype, dimensions and shape. In addition, tensors can be jagged (i.e. some dimensions have non-uniform sizes).
Tensors can be extended along any of their dimensions (unless the dimension is declared fixed-length) by appending values.
Overview
Tensors are created like this:
import shoji
tissues = ... # Assume we have an np.ndarray of tissue names
db = shoji.connect() # Connect to the database
ws = db.scRNA # scRNA is a Workspace in the database, previously created
ws.Tissue = shoji.Tensor("string", ("cells",), inits=tissues)
The tensor is declared with a datatype "string"
, a tuple of dimensions ("cells",)
and an optional np.ndarray
of initial values.
Rank
The rank of a tensor is the number of dimensions of the tensor. A scalar value has rank 0, a vector has rank 1, and a matrix has rank 2. Higher ranks are possible; for example, a vector of 2D images would have rank 3, and a timelapse recording in several color channels would have rank 4 (timepoint, x, y, color).
Datatype
Tensors support the following datatypes:
"bool"
"uint8", "uint16", "uint32", "uint64"
"int8", "int16", "int32", "int64"
"float16", "float32", "float64"
"string"
The datatype of a tensor must always be declared; there is no default type.
When a tensor is created, any initial values provided (via the inits
argument)
must have the matching numpy datatype. The bool and numeric datatypes match 1:1 with numpy dtypes.
However, the Shoji string
datatype is a Unicode string of variable length,
which corresponds to a numpy array of string objects. That is, the corresponding
numpy datatype is not str
or "unicode"
. Instead, Shoji string tensors correspond
to numpy object
arrays whose elements are Python str
objects. You can cast a numpy
str
array to an object
array as follows:
import numpy as np
s = np.array(["dog", "cat", "apple", "orange"]) # s.dtype.kind == 'U'
t = s.astype(object) # t.dtype.kind == 'O'
# Or directly, using dtype
s = np.array(["dog", "cat", "apple", "orange"], dtype="object")
The reason for this discrepancy is that numpy str
arrays store only
fixed-length strings, whereas Shoji string
tensors store strings of variable length.
Dimensions
When creating a tensor, its dimensions must be declared using a tuple.
Scalars have rank zero, and are declared with the empty tuple ()
.
Vectors have rank one, and are declared with a single-element tuple, e.g.
(20,)
(note the comma, which is necessary). Matrices have rank 2, and are
declared with a two-element tuple, e.g. (20, 40)
. Higher-rank tensors are
declared with correspondingly longer tuples.
Dimensions can be fixed or variable-length. A fixed-length dimension is
declared with an integer specifying the number of elements of the dimension.
A variable-length dimension is declared as None
. For example, (10, None)
is a matrix with ten rows and a variable number of columns.
The meaning of a variable-length dimension is slightly different for regular
and jagged tensors. For a regular tensor, if a dimension is variable-length then
the tensor can be extended along that dimension by appending data. Thus a tensor declared
with dims=(None, 10)
at any point in time has a fixed number of
rows and columns, but rows can be appended (see shoji.dimension
and
Dimension.append()
).
If the tensor is jagged, then a variable-length dimension can contain individual rows (columns, etc) of different lengths.
Each dimension of a tensor can be named, and named dimensions (within a
Workspace
) are constrained to have the same number of elements.
For example, if two tensors are declared with dimensions ("cells",)
and ("cells", "genes")
, then the first dimensions are guaranteed to have
the same number of elements, which are assumed to be in the same order.
Named dimensions must be declared before they are used; see shoji.dimension
.
Shape
The shape
of a Tensor
is a tuple of integers that gives the current
shape of the tensor as stored in the database. For example, a tensor with dims=(None, 10, 20)
might have shape=(10, 10, 20)
, indicating that currently the tensor has ten rows. Since
the first dimension is variable-length (in this case), rows might be appended later, and the shape
would change to reflect the new number of rows.
Chunks
Data in Shoji is stored and retrieved as N-dimensional chunks. When you read or write from a tensor, your operations are converted to operations on chunks. For example, if you access a single element of a matrix, under the hood the whole chunk containing the element is retrieved.
When you create a tensor, you can optionally specify the chunk size along each dimension. Chunking is very important for performance. Small chunks such as (10,100) or even (1, 100) can be an order of magnitude faster for random access, but an order of magnitude slower for contiguous access, as compared to large chunks like (100, 1000) or (1000, 1000). If you know that you will only read in large contiguous blocks, use large chunks along those dimensions. If you know you will be reading many randomly placed single or few indices, use small chunks along those dimensions.
Reading from tensors
The universal method for reading data in shoji is to create a View
of the workspace. However, sometimes you just want to read from one tensor
and don't care about creating a view. Shoji supports indexing tensors similar
to numpy "fancy indexing" (and similar to how views are created):
x = ws.Expression[:] # Read the whole tensor
y = ws.Expression[10:20] # Read a slice
z = ws.Expression[(1, 2, 5, 9)] # Read specific rows
w = ws.Expression[(True, False, True)] # Read rows given by bool mask array
The above expressions are just shorthands for creating the corresponding view and immediately reading from the tensor. There is no difference in performance. For example, these two expressions below are equivalent:
x = ws.Expression[:]
x = ws[:].Expression
Jagged tensors
If a tensor is declared jagged, the size along variable-length dimensions can be different for different rows (columns, etc.). For example:
ws.cells = shoji.Dimension(shape=None)
ws.Image = shoji.Tensor("uint16", ("cells", None, None), jagged=True)
In this example, we declare a 3D jagged tensor Image
, where dimensions 2 and 3
are variable-length. This could be used to store 2D images of cells, each of which
has a different width and height. The first dimension represents the objects
(individual cells) and the 2nd and 3rd dimensions represent the images. Accessing
a single row of this tensor would return a single 2D image matrix. Accessing a set
of rows would return a list of 2D images.
In a similar way, we could store multichannel timelapse images of cells:
ws.cells = shoji.Dimension(shape=None)
ws.channels = shoji.Dimension(shape=3)
ws.timepoints = shoji.Dimension(shape=1200) # 1200 timepoints
ws.Image = shoji.Tensor("uint16", ("cells", "channels", "timepoints", None, None), jagged=True)
In this examples, Image
is a 5-dimensional tensor, where the last two dimensions
have variable length.
Expand source code
"""
All data in Shoji is stored as N-dimensional tensors. A tensor is a
generalisation of scalars, vectors and matrices to N dimensions.
Tensors are defined by their *rank*, *datatype*, *dimensions* and *shape*.
In addition, tensors can be *jagged* (i.e. some dimensions have non-uniform
sizes).
Tensors can be extended along any of their dimensions (unless the
dimension is declared fixed-length) by appending values.
## Overview
Tensors are created like this:
```python
import shoji
tissues = ... # Assume we have an np.ndarray of tissue names
db = shoji.connect() # Connect to the database
ws = db.scRNA # scRNA is a Workspace in the database, previously created
ws.Tissue = shoji.Tensor("string", ("cells",), inits=tissues)
```
The tensor is declared with a datatype `"string"`, a tuple of dimensions `("cells",)` and an optional `np.ndarray` of initial values.
### Rank
The *rank* of a tensor is the number of dimensions of the tensor. A scalar
value has rank 0, a vector has rank 1, and a matrix has rank 2. Higher ranks
are possible; for example, a vector of 2D images would have rank 3, and a
timelapse recording in several color channels would have rank 4 (timepoint, x, y,
color).
### Datatype
Tensors support the following datatypes:
```python
"bool"
"uint8", "uint16", "uint32", "uint64"
"int8", "int16", "int32", "int64"
"float16", "float32", "float64"
"string"
```
The datatype of a tensor must always be declared; there is no default type.
When a tensor is created, any initial values provided (via the `inits` argument)
must have the matching numpy datatype. The bool and numeric datatypes match 1:1 with numpy dtypes.
However, the Shoji `string` datatype is a Unicode string of variable length,
which corresponds to a numpy array of string objects. That is, the corresponding
numpy datatype is *not* `str` or `"unicode"`. Instead, Shoji string tensors correspond
to numpy `object` arrays whose elements are Python `str` objects. You can cast a numpy
`str` array to an `object` array as follows:
```python
import numpy as np
s = np.array(["dog", "cat", "apple", "orange"]) # s.dtype.kind == 'U'
t = s.astype(object) # t.dtype.kind == 'O'
# Or directly, using dtype
s = np.array(["dog", "cat", "apple", "orange"], dtype="object")
```
The reason for this discrepancy is that numpy `str` arrays store only
fixed-length strings, whereas Shoji `string` tensors store strings of variable length.
### Dimensions
When creating a tensor, its dimensions must be declared using a tuple.
Scalars have rank zero, and are declared with the empty tuple `()`.
Vectors have rank one, and are declared with a single-element tuple, e.g.
`(20,)` (note the comma, which is necessary). Matrices have rank 2, and are
declared with a two-element tuple, e.g. `(20, 40)`. Higher-rank tensors are
declared with correspondingly longer tuples.
Dimensions can be fixed or variable-length. A fixed-length dimension is
declared with an integer specifying the number of elements of the dimension.
A variable-length dimension is declared as `None`. For example, `(10, None)`
is a matrix with ten rows and a variable number of columns.
The meaning of a variable-length dimension is slightly different for regular
and jagged tensors. For a regular tensor, if a dimension is variable-length then
the tensor can be extended along that dimension by appending data. Thus a tensor declared
with `dims=(None, 10)` at any point in time has a fixed number of
rows and columns, but rows can be appended (see `shoji.dimension` and
`shoji.dimension.Dimension.append`).
If the tensor is jagged, then a variable-length dimension can contain
individual rows (columns, etc) of different lengths.
Each dimension of a tensor can be named, and named dimensions (within a
`shoji.workspace.Workspace`) are constrained to have the same number of elements.
For example, if two tensors are declared with dimensions `("cells",)`
and `("cells", "genes")`, then the first dimensions are guaranteed to have
the same number of elements, which are assumed to be in the same order.
Named dimensions must be declared before they are used; see `shoji.dimension`.
### Shape
The `shape` of a `shoji.tensor.Tensor` is a tuple of integers that gives the current
shape of the tensor as stored in the database. For example, a tensor with `dims=(None, 10, 20)`
might have `shape=(10, 10, 20)`, indicating that currently the tensor has ten rows. Since
the first dimension is variable-length (in this case), rows might be appended later, and the shape
would change to reflect the new number of rows.
## Chunks
Data in Shoji is stored and retrieved as N-dimensional chunks. When you
read or write from a tensor, your operations are converted to operations
on chunks. For example, if you access a single element of a matrix, under
the hood the whole chunk containing the element is retrieved.
When you create a tensor, you can optionally specify the chunk size along
each dimension. Chunking is **very** important for performance. Small chunks
such as (10,100) or even (1, 100) can be an order of magnitude
faster for random access, but an order of magnitude slower for contiguous
access, as compared to large chunks like (100, 1000) or (1000, 1000). If
you know that you will only read in large contiguous blocks, use large chunks
along those dimensions. If you know you will be reading many randomly
placed single or few indices, use small chunks along those dimensions.
## Reading from tensors
The universal method for reading data in shoji is to create a `shoji.view.View`
of the workspace. However, sometimes you just want to read from one tensor
and don't care about creating a view. Shoji supports indexing tensors similar
to numpy "fancy indexing" (and similar to how views are created):
```python
x = ws.Expression[:] # Read the whole tensor
y = ws.Expression[10:20] # Read a slice
z = ws.Expression[(1, 2, 5, 9)] # Read specific rows
w = ws.Expression[(True, False, True)] # Read rows given by bool mask array
```
The above expressions are just shorthands for creating the corresponding view
and immediately reading from the tensor. There is no difference in performance.
For example, these two expressions below are equivalent:
```python
x = ws.Expression[:]
x = ws[:].Expression
```
## Jagged tensors
If a tensor is declared *jagged*, the size along variable-length dimensions
can be different for different rows (columns, etc.). For example:
```python
ws.cells = shoji.Dimension(shape=None)
ws.Image = shoji.Tensor("uint16", ("cells", None, None), jagged=True)
```
In this example, we declare a 3D jagged tensor `Image`, where dimensions 2 and 3
are variable-length. This could be used to store 2D images of cells, each of which
has a different width and height. The first dimension represents the objects
(individual cells) and the 2nd and 3rd dimensions represent the images. Accessing
a single row of this tensor would return a single 2D image matrix. Accessing a set
of rows would return a list of 2D images.
In a similar way, we could store multichannel timelapse images of cells:
```python
ws.cells = shoji.Dimension(shape=None)
ws.channels = shoji.Dimension(shape=3)
ws.timepoints = shoji.Dimension(shape=1200) # 1200 timepoints
ws.Image = shoji.Tensor("uint16", ("cells", "channels", "timepoints", None, None), jagged=True)
```
In this examples, `Image` is a 5-dimensional tensor, where the last two dimensions
have variable length.
"""
from typing import Tuple, Union, List, Optional, Callable
try:
from typing import Literal
except ImportError:
from typing_extensions import Literal
import numpy as np
import shoji
import sys
import logging
FancyIndexElement = Union["shoji.Filter", slice, int, np.ndarray]
FancyIndex = Union[FancyIndexElement, Tuple[FancyIndexElement, ...]]
class TensorValue:
def __init__(self, values: Union[Tuple[np.ndarray], List[np.ndarray], np.ndarray]) -> None:
self.values = values
if isinstance(values, (list, tuple)):
self.jagged = True
self.rank = values[0].ndim + 1
self.dtype = values[0].dtype.name
if self.dtype == "object":
self.dtype = "string"
shape = np.array(values[0].shape)
for i, array in enumerate(values):
if not isinstance(array, np.ndarray):
raise ValueError("Rows of jagged tensor must be numpy ndarrays")
if self.rank != array.ndim + 1:
raise ValueError("Rows of jagged tensor cannot be mixed rank")
if self.dtype != array.dtype:
raise ValueError("Rows of jagged tensor cannot be mixed dtype")
if self.dtype == "string":
if not all([isinstance(x, str) for x in array.flat]):
raise TypeError("string tensors (numpy dtype='object') must contain only string elements")
if array.ndim != len(shape):
raise ValueError(f"Rank mismatch: shape {array.shape} of subarray at row {i} is not the same rank as shape {shape} at row 0")
shape = np.maximum(shape, array.shape)
self.shape = tuple([len(values)] + list(shape))
else:
self.jagged = False
self.rank = values.ndim
self.dtype = values.dtype.name
if self.dtype == "object":
self.dtype = "string"
if not all([isinstance(x, str) for x in values.flat]):
raise TypeError("string tensors (numpy dtype='object') must contain only string elements")
self.shape = values.shape
if self.dtype not in Tensor.valid_types:
raise TypeError(f"Invalid dtype '{self.dtype}' for tensor value")
@property
def size(self) -> int:
return np.prod(self.shape)
def __len__(self) -> int:
if self.rank > 0:
return self.shape[0]
return 1
def __iter__(self):
for row in self.values:
yield row
def __getitem__(self, slice_) -> "TensorValue":
if self.jagged:
if isinstance(slice_, slice):
slice_ = (slice_)
slice_ = slice_ + (slice(None),) * (self.rank - len(slice_))
sliced = [vals[slice_[1:]] for vals in self.values[slice_[0]]]
return TensorValue(sliced)
return TensorValue(self.values[slice_])
def size_in_bytes(self) -> int:
n_bytes = 0
if not self.jagged:
if self.dtype == "string":
n_bytes += sum([len(s) + 1 for s in self.values]) * 2
else:
n_bytes += self.values.size * self.values.itemsize # type: ignore
else:
for row in self.values:
if self.dtype == "string":
n_bytes += sum([len(s) + 1 for s in row]) * 2
else:
n_bytes += row.size * row.itemsize
return n_bytes
class Tensor:
valid_types = ("bool", "uint8", "uint16", "uint32", "uint64", "int8", "int16", "int32", "int64", "float16", "float32", "float64", "string")
def __init__(self, dtype: str, dims: Union[Tuple[Union[None, int, str], ...]], *, chunks: Tuple[int, ...] = None, jagged: bool = False, inits: Union[List[np.ndarray], np.ndarray] = None) -> None:
"""
Args:
dtype: string giving the datatype of the tensor elements
dims: A tuple of None, int, string (empty tuple designates a scalar)
chunks: Tuple defining the chunk size along each dimension, or "auto" to use automatic chunking
jagged: If true, this is a jagged tensor (and inits must be a list of ndarrays)
inits: Optional values to initialize the tensor with
Remarks:
Dimensions are specified as:
None: resizable/jagged anonymous dimension
int: fixed-shape anonymous dimension
string: named dimension
Chunking is VERY important for performance. Small chunks such as (10,100) or even (1, 100) can be an order of magnitude
faster for random access, but an order of magnitude slower for contiguous access, as compared to large chunks
like (100, 1000) or (1000, 1000). If you know that you will only read in large contiguous blocks, use large chunks
along those dimensions. If you know you will be reading many randomly placed single or few indices, use small chunks
along those dimensions.
For rank-0 tensors, use chunks=(1,)
"""
self.dtype = dtype
# Check that the type is valid
if dtype not in Tensor.valid_types:
raise TypeError(f"Invalid Tensor type {dtype}")
self.dims = dims
self.jagged = jagged
self.name = "" # Will be set if the Tensor is read from the db
self.wsm: Optional[shoji.WorkspaceManager] = None # Will be set if the Tensor is read from the db
if inits is None:
self.inits: Optional[TensorValue] = None
self.shape = (0,) * len(dims)
else:
# If scalar, convert to an ndarray scalar which will have shape ()
if np.isscalar(inits):
self.inits = TensorValue(np.array(inits, dtype=self.numpy_dtype()))
else:
self.inits = TensorValue(inits)
if self.inits.jagged and not self.jagged:
raise ValueError(f"Jagged inits cannot be used to create non-jagged tensor")
self.shape = self.inits.shape
if len(self.dims) != len(self.shape):
raise ValueError(f"Rank mismatch: shape {self.dims} declared is not the same rank as shape {self.shape} of values")
if self.dtype != self.inits.dtype:
raise TypeError(f"Tensor dtype '{self.dtype}' does not match dtype of inits '{self.inits.dtype}'")
for ix, dim in enumerate(self.dims):
if dim is not None and not isinstance(dim, int) and not isinstance(dim, str):
raise ValueError(f"Dimension {ix} '{dim}' is invalid (must be None, int or str)")
if isinstance(dim, int) and self.inits is not None:
if self.shape[ix] != dim: # type: ignore
raise IndexError(f"Mismatch between the declared shape {dim} of dimension {ix} and the inferred shape {self.shape} of values")
self.chunks: Tuple[int, ...] = ()
if chunks is None:
if dtype in ("bool", "uint8", "int8"):
byte_size = 1
if dtype in ("uint16", "int16", "float16"):
byte_size = 2
elif dtype in ("uint32", "int32", "float32"):
byte_size = 4
elif dtype in ("uint64", "int64", "float64"):
byte_size = 8
elif dtype == "string":
byte_size = 32 # This will fail for very long strings
if self.rank == 0:
self.chunks = ()
elif self.rank == 1:
self.chunks = (500 // byte_size,)
else:
desired_sizes = (300 // byte_size, 100) + (1,) * (self.rank - 2)
max_sizes = (dim if isinstance(dim, int) else sys.maxsize for dim in self.dims)
self.chunks = tuple(min(a,b) for a,b in zip(max_sizes, desired_sizes))
else:
if len(chunks) != self.rank:
raise ValueError(f"chunks={chunks} is wrong number of dimensions for rank-{self.rank} tensor" + (" (use () for rank-0 tensor)" if self.rank == 0 else ""))
self.chunks = chunks
self.initializing = False
# Support pickling
def __getstate__(self):
"""Return state values to be pickled."""
return (self.dtype, self.jagged, self.dims, self.shape, self.chunks, self.initializing, 0) # The extra zero is for future use as a version flag
def __setstate__(self, state):
"""Restore state from the unpickled state values."""
self.dtype, self.jagged, self.dims, self.shape, self.chunks, self.initializing, _ = state
def __len__(self) -> int:
if self.rank > 0:
return self.shape[0]
return 0
@property
def rank(self) -> int:
return len(self.dims)
@property
def bytewidth(self) -> int:
if self.dtype in ("bool", "uint8", "int8"):
return 1
elif self.dtype in ("uint16", "int16", "float16"):
return 2
elif self.dtype in ("uint32", "int32", "float32"):
return 3
elif self.dtype in ("uint64", "int64", "float64"):
return 4
return -1
def _fancy_indexing(self, expr: FancyIndex) -> Tuple["shoji.Filter", ...]:
if isinstance(expr, tuple):
fancyindex: Tuple[FancyIndexElement, ...] = expr
else:
fancyindex = (expr,)
# Fill in missing axes with : just like numpy does
if any(isinstance(x, type(...)) for x in fancyindex): # We can't use a simple "if ... in fancyindex" because fancyindex may contain numpy arrays which complain about ==
ix = fancyindex.index(...)
fancyindex = fancyindex[:ix] + (slice(None),) * (self.rank - len(fancyindex) - 1) + fancyindex[ix + 1:]
if len(fancyindex) < self.rank:
fancyindex += (slice(None),) * (self.rank - len(fancyindex))
filters: List[shoji.Filter] = []
for axis, (dim, fi) in enumerate(zip(self.dims, fancyindex)):
# Maybe it's a Filter?
if isinstance(fi, shoji.Filter):
if isinstance(dim, str) and isinstance(fi.dim, str) and fi.dim != dim:
raise IndexError(f"Tensor dimension '{dim}' cannot be indexed uing filter expression '{fi}' with dimension '{fi.dim}'")
filters.append(fi)
elif isinstance(fi, slice):
filters.append(shoji.TensorSliceFilter(self, fi, axis))
elif isinstance(fi, (int, np.int64, np.int32)):
filters.append(shoji.TensorIndicesFilter(self, np.array(fi), axis))
elif isinstance(fi, np.ndarray):
if np.issubdtype(fi.dtype, np.bool_):
filters.append(shoji.TensorBoolFilter(self, fi, axis))
elif np.issubdtype(fi.dtype, np.int_):
filters.append(shoji.TensorIndicesFilter(self, fi, axis))
else:
raise KeyError()
return tuple(filters)
def __getitem__(self, expr: FancyIndex) -> np.ndarray:
assert self.wsm is not None, "Tensor is not bound to a database"
return shoji.View(self.wsm, self._fancy_indexing(expr))[self.name]
def __setitem__(self, expr: FancyIndex, vals: np.ndarray) -> None:
assert self.wsm is not None, "Tensor is not bound to a database"
shoji.View(self.wsm, self._fancy_indexing(expr))[self.name] = vals
def numpy_dtype(self) -> str:
if self.dtype == "string":
return "object"
return self.dtype
def python_dtype(self) -> Callable:
if self.dtype == "string":
return str
if self.dtype == "bool":
return bool
if self.dtype in ("float16", "float32", "float64"):
return float
return int
def _compare(self, operator, other) -> "shoji.Filter":
if isinstance(other, Tensor):
return shoji.TensorFilter(operator, self, other)
elif isinstance(other, (str, int, float, bool)):
return shoji.ConstFilter(operator, self, other)
elif isinstance(other, np.integer):
return shoji.ConstFilter(operator, self, int(other))
elif isinstance(other, np.float):
return shoji.ConstFilter(operator, self, float(other))
elif isinstance(other, np.object):
return shoji.ConstFilter(operator, self, str(other))
elif isinstance(other, np.bool):
return shoji.ConstFilter(operator, self, bool(other))
else:
raise TypeError("Invalid operands for expression")
def __eq__(self, other) -> "shoji.Filter": # type: ignore
return self._compare("==", other)
def __ne__(self, other) -> "shoji.Filter": # type: ignore
return self._compare("!=", other)
def __gt__(self, other) -> "shoji.Filter": # type: ignore
return self._compare(">", other)
def __lt__(self, other) -> "shoji.Filter": # type: ignore
return self._compare("<", other)
def __ge__(self, other) -> "shoji.Filter": # type: ignore
return self._compare(">=", other)
def __le__(self, other) -> "shoji.Filter": # type: ignore
return self._compare("<=", other)
def append(self, vals: Union[List[np.ndarray], np.ndarray], axis: int = 0) -> None:
assert self.wsm is not None, "Cannot append to unbound tensor"
if self.rank == 0:
raise ValueError("Cannot append to a scalar")
tv = TensorValue(vals)
shoji.io.append_values_multibatch(self.wsm, [self.name], [tv], (axis,))
def _quick_look(self) -> str:
if self.rank == 0:
if self.dtype == "string":
s = f'"{self[:]}"'
else:
s = str(self[:])
if len(s) > 60:
return s[:56] + " ..."
return s
def look(vals) -> str:
s = "["
if not isinstance(vals, list) and vals.ndim == 1:
if self.dtype == "string":
s += ", ".join([f'"{x}"' for x in vals[:5]])
else:
s += ", ".join([str(x) for x in vals[:5]])
else:
elms = []
for val in vals[:5]:
elms.append(look(val))
s += ", ".join(elms)
if len(vals) > 5:
s += ", ...]"
else:
s += "]"
return s
s = look(self[:10])
if len(s) > 60:
return s[:56] + " ···"
return s
def __repr__(self) -> str:
return f"<Tensor {self.name} dtype='{self.dtype}' dims={self.dims}, shape={self.shape}, chunks={self.chunks}>"
Classes
class Tensor (dtype: str, dims: Tuple[Union[NoneType, int, str], ...], *, chunks: Tuple[int, ...] = None, jagged: bool = False, inits: Union[List[numpy.ndarray], numpy.ndarray] = None)
-
Args
dtype
- string giving the datatype of the tensor elements
dims
- A tuple of None, int, string (empty tuple designates a scalar)
chunks
- Tuple defining the chunk size along each dimension, or "auto" to use automatic chunking
jagged
- If true, this is a jagged tensor (and inits must be a list of ndarrays)
inits
- Optional values to initialize the tensor with
Remarks
Dimensions are specified as:
None: resizable/jagged anonymous dimension int: fixed-shape anonymous dimension string: named dimension
Chunking is VERY important for performance. Small chunks such as (10,100) or even (1, 100) can be an order of magnitude faster for random access, but an order of magnitude slower for contiguous access, as compared to large chunks like (100, 1000) or (1000, 1000). If you know that you will only read in large contiguous blocks, use large chunks along those dimensions. If you know you will be reading many randomly placed single or few indices, use small chunks along those dimensions.
For rank-0 tensors, use chunks=(1,)
Expand source code
class Tensor: valid_types = ("bool", "uint8", "uint16", "uint32", "uint64", "int8", "int16", "int32", "int64", "float16", "float32", "float64", "string") def __init__(self, dtype: str, dims: Union[Tuple[Union[None, int, str], ...]], *, chunks: Tuple[int, ...] = None, jagged: bool = False, inits: Union[List[np.ndarray], np.ndarray] = None) -> None: """ Args: dtype: string giving the datatype of the tensor elements dims: A tuple of None, int, string (empty tuple designates a scalar) chunks: Tuple defining the chunk size along each dimension, or "auto" to use automatic chunking jagged: If true, this is a jagged tensor (and inits must be a list of ndarrays) inits: Optional values to initialize the tensor with Remarks: Dimensions are specified as: None: resizable/jagged anonymous dimension int: fixed-shape anonymous dimension string: named dimension Chunking is VERY important for performance. Small chunks such as (10,100) or even (1, 100) can be an order of magnitude faster for random access, but an order of magnitude slower for contiguous access, as compared to large chunks like (100, 1000) or (1000, 1000). If you know that you will only read in large contiguous blocks, use large chunks along those dimensions. If you know you will be reading many randomly placed single or few indices, use small chunks along those dimensions. For rank-0 tensors, use chunks=(1,) """ self.dtype = dtype # Check that the type is valid if dtype not in Tensor.valid_types: raise TypeError(f"Invalid Tensor type {dtype}") self.dims = dims self.jagged = jagged self.name = "" # Will be set if the Tensor is read from the db self.wsm: Optional[shoji.WorkspaceManager] = None # Will be set if the Tensor is read from the db if inits is None: self.inits: Optional[TensorValue] = None self.shape = (0,) * len(dims) else: # If scalar, convert to an ndarray scalar which will have shape () if np.isscalar(inits): self.inits = TensorValue(np.array(inits, dtype=self.numpy_dtype())) else: self.inits = TensorValue(inits) if self.inits.jagged and not self.jagged: raise ValueError(f"Jagged inits cannot be used to create non-jagged tensor") self.shape = self.inits.shape if len(self.dims) != len(self.shape): raise ValueError(f"Rank mismatch: shape {self.dims} declared is not the same rank as shape {self.shape} of values") if self.dtype != self.inits.dtype: raise TypeError(f"Tensor dtype '{self.dtype}' does not match dtype of inits '{self.inits.dtype}'") for ix, dim in enumerate(self.dims): if dim is not None and not isinstance(dim, int) and not isinstance(dim, str): raise ValueError(f"Dimension {ix} '{dim}' is invalid (must be None, int or str)") if isinstance(dim, int) and self.inits is not None: if self.shape[ix] != dim: # type: ignore raise IndexError(f"Mismatch between the declared shape {dim} of dimension {ix} and the inferred shape {self.shape} of values") self.chunks: Tuple[int, ...] = () if chunks is None: if dtype in ("bool", "uint8", "int8"): byte_size = 1 if dtype in ("uint16", "int16", "float16"): byte_size = 2 elif dtype in ("uint32", "int32", "float32"): byte_size = 4 elif dtype in ("uint64", "int64", "float64"): byte_size = 8 elif dtype == "string": byte_size = 32 # This will fail for very long strings if self.rank == 0: self.chunks = () elif self.rank == 1: self.chunks = (500 // byte_size,) else: desired_sizes = (300 // byte_size, 100) + (1,) * (self.rank - 2) max_sizes = (dim if isinstance(dim, int) else sys.maxsize for dim in self.dims) self.chunks = tuple(min(a,b) for a,b in zip(max_sizes, desired_sizes)) else: if len(chunks) != self.rank: raise ValueError(f"chunks={chunks} is wrong number of dimensions for rank-{self.rank} tensor" + (" (use () for rank-0 tensor)" if self.rank == 0 else "")) self.chunks = chunks self.initializing = False # Support pickling def __getstate__(self): """Return state values to be pickled.""" return (self.dtype, self.jagged, self.dims, self.shape, self.chunks, self.initializing, 0) # The extra zero is for future use as a version flag def __setstate__(self, state): """Restore state from the unpickled state values.""" self.dtype, self.jagged, self.dims, self.shape, self.chunks, self.initializing, _ = state def __len__(self) -> int: if self.rank > 0: return self.shape[0] return 0 @property def rank(self) -> int: return len(self.dims) @property def bytewidth(self) -> int: if self.dtype in ("bool", "uint8", "int8"): return 1 elif self.dtype in ("uint16", "int16", "float16"): return 2 elif self.dtype in ("uint32", "int32", "float32"): return 3 elif self.dtype in ("uint64", "int64", "float64"): return 4 return -1 def _fancy_indexing(self, expr: FancyIndex) -> Tuple["shoji.Filter", ...]: if isinstance(expr, tuple): fancyindex: Tuple[FancyIndexElement, ...] = expr else: fancyindex = (expr,) # Fill in missing axes with : just like numpy does if any(isinstance(x, type(...)) for x in fancyindex): # We can't use a simple "if ... in fancyindex" because fancyindex may contain numpy arrays which complain about == ix = fancyindex.index(...) fancyindex = fancyindex[:ix] + (slice(None),) * (self.rank - len(fancyindex) - 1) + fancyindex[ix + 1:] if len(fancyindex) < self.rank: fancyindex += (slice(None),) * (self.rank - len(fancyindex)) filters: List[shoji.Filter] = [] for axis, (dim, fi) in enumerate(zip(self.dims, fancyindex)): # Maybe it's a Filter? if isinstance(fi, shoji.Filter): if isinstance(dim, str) and isinstance(fi.dim, str) and fi.dim != dim: raise IndexError(f"Tensor dimension '{dim}' cannot be indexed uing filter expression '{fi}' with dimension '{fi.dim}'") filters.append(fi) elif isinstance(fi, slice): filters.append(shoji.TensorSliceFilter(self, fi, axis)) elif isinstance(fi, (int, np.int64, np.int32)): filters.append(shoji.TensorIndicesFilter(self, np.array(fi), axis)) elif isinstance(fi, np.ndarray): if np.issubdtype(fi.dtype, np.bool_): filters.append(shoji.TensorBoolFilter(self, fi, axis)) elif np.issubdtype(fi.dtype, np.int_): filters.append(shoji.TensorIndicesFilter(self, fi, axis)) else: raise KeyError() return tuple(filters) def __getitem__(self, expr: FancyIndex) -> np.ndarray: assert self.wsm is not None, "Tensor is not bound to a database" return shoji.View(self.wsm, self._fancy_indexing(expr))[self.name] def __setitem__(self, expr: FancyIndex, vals: np.ndarray) -> None: assert self.wsm is not None, "Tensor is not bound to a database" shoji.View(self.wsm, self._fancy_indexing(expr))[self.name] = vals def numpy_dtype(self) -> str: if self.dtype == "string": return "object" return self.dtype def python_dtype(self) -> Callable: if self.dtype == "string": return str if self.dtype == "bool": return bool if self.dtype in ("float16", "float32", "float64"): return float return int def _compare(self, operator, other) -> "shoji.Filter": if isinstance(other, Tensor): return shoji.TensorFilter(operator, self, other) elif isinstance(other, (str, int, float, bool)): return shoji.ConstFilter(operator, self, other) elif isinstance(other, np.integer): return shoji.ConstFilter(operator, self, int(other)) elif isinstance(other, np.float): return shoji.ConstFilter(operator, self, float(other)) elif isinstance(other, np.object): return shoji.ConstFilter(operator, self, str(other)) elif isinstance(other, np.bool): return shoji.ConstFilter(operator, self, bool(other)) else: raise TypeError("Invalid operands for expression") def __eq__(self, other) -> "shoji.Filter": # type: ignore return self._compare("==", other) def __ne__(self, other) -> "shoji.Filter": # type: ignore return self._compare("!=", other) def __gt__(self, other) -> "shoji.Filter": # type: ignore return self._compare(">", other) def __lt__(self, other) -> "shoji.Filter": # type: ignore return self._compare("<", other) def __ge__(self, other) -> "shoji.Filter": # type: ignore return self._compare(">=", other) def __le__(self, other) -> "shoji.Filter": # type: ignore return self._compare("<=", other) def append(self, vals: Union[List[np.ndarray], np.ndarray], axis: int = 0) -> None: assert self.wsm is not None, "Cannot append to unbound tensor" if self.rank == 0: raise ValueError("Cannot append to a scalar") tv = TensorValue(vals) shoji.io.append_values_multibatch(self.wsm, [self.name], [tv], (axis,)) def _quick_look(self) -> str: if self.rank == 0: if self.dtype == "string": s = f'"{self[:]}"' else: s = str(self[:]) if len(s) > 60: return s[:56] + " ..." return s def look(vals) -> str: s = "[" if not isinstance(vals, list) and vals.ndim == 1: if self.dtype == "string": s += ", ".join([f'"{x}"' for x in vals[:5]]) else: s += ", ".join([str(x) for x in vals[:5]]) else: elms = [] for val in vals[:5]: elms.append(look(val)) s += ", ".join(elms) if len(vals) > 5: s += ", ...]" else: s += "]" return s s = look(self[:10]) if len(s) > 60: return s[:56] + " ···" return s def __repr__(self) -> str: return f"<Tensor {self.name} dtype='{self.dtype}' dims={self.dims}, shape={self.shape}, chunks={self.chunks}>"
Class variables
var valid_types
Instance variables
var bytewidth : int
-
Expand source code
@property def bytewidth(self) -> int: if self.dtype in ("bool", "uint8", "int8"): return 1 elif self.dtype in ("uint16", "int16", "float16"): return 2 elif self.dtype in ("uint32", "int32", "float32"): return 3 elif self.dtype in ("uint64", "int64", "float64"): return 4 return -1
var rank : int
-
Expand source code
@property def rank(self) -> int: return len(self.dims)
Methods
def append(self, vals: Union[List[numpy.ndarray], numpy.ndarray], axis: int = 0) ‑> NoneType
-
Expand source code
def append(self, vals: Union[List[np.ndarray], np.ndarray], axis: int = 0) -> None: assert self.wsm is not None, "Cannot append to unbound tensor" if self.rank == 0: raise ValueError("Cannot append to a scalar") tv = TensorValue(vals) shoji.io.append_values_multibatch(self.wsm, [self.name], [tv], (axis,))
def numpy_dtype(self) ‑> str
-
Expand source code
def numpy_dtype(self) -> str: if self.dtype == "string": return "object" return self.dtype
def python_dtype(self) ‑> Callable
-
Expand source code
def python_dtype(self) -> Callable: if self.dtype == "string": return str if self.dtype == "bool": return bool if self.dtype in ("float16", "float32", "float64"): return float return int
class TensorValue (values: Union[Tuple[numpy.ndarray], List[numpy.ndarray], numpy.ndarray])
-
Expand source code
class TensorValue: def __init__(self, values: Union[Tuple[np.ndarray], List[np.ndarray], np.ndarray]) -> None: self.values = values if isinstance(values, (list, tuple)): self.jagged = True self.rank = values[0].ndim + 1 self.dtype = values[0].dtype.name if self.dtype == "object": self.dtype = "string" shape = np.array(values[0].shape) for i, array in enumerate(values): if not isinstance(array, np.ndarray): raise ValueError("Rows of jagged tensor must be numpy ndarrays") if self.rank != array.ndim + 1: raise ValueError("Rows of jagged tensor cannot be mixed rank") if self.dtype != array.dtype: raise ValueError("Rows of jagged tensor cannot be mixed dtype") if self.dtype == "string": if not all([isinstance(x, str) for x in array.flat]): raise TypeError("string tensors (numpy dtype='object') must contain only string elements") if array.ndim != len(shape): raise ValueError(f"Rank mismatch: shape {array.shape} of subarray at row {i} is not the same rank as shape {shape} at row 0") shape = np.maximum(shape, array.shape) self.shape = tuple([len(values)] + list(shape)) else: self.jagged = False self.rank = values.ndim self.dtype = values.dtype.name if self.dtype == "object": self.dtype = "string" if not all([isinstance(x, str) for x in values.flat]): raise TypeError("string tensors (numpy dtype='object') must contain only string elements") self.shape = values.shape if self.dtype not in Tensor.valid_types: raise TypeError(f"Invalid dtype '{self.dtype}' for tensor value") @property def size(self) -> int: return np.prod(self.shape) def __len__(self) -> int: if self.rank > 0: return self.shape[0] return 1 def __iter__(self): for row in self.values: yield row def __getitem__(self, slice_) -> "TensorValue": if self.jagged: if isinstance(slice_, slice): slice_ = (slice_) slice_ = slice_ + (slice(None),) * (self.rank - len(slice_)) sliced = [vals[slice_[1:]] for vals in self.values[slice_[0]]] return TensorValue(sliced) return TensorValue(self.values[slice_]) def size_in_bytes(self) -> int: n_bytes = 0 if not self.jagged: if self.dtype == "string": n_bytes += sum([len(s) + 1 for s in self.values]) * 2 else: n_bytes += self.values.size * self.values.itemsize # type: ignore else: for row in self.values: if self.dtype == "string": n_bytes += sum([len(s) + 1 for s in row]) * 2 else: n_bytes += row.size * row.itemsize return n_bytes
Instance variables
var size : int
-
Expand source code
@property def size(self) -> int: return np.prod(self.shape)
Methods
def size_in_bytes(self) ‑> int
-
Expand source code
def size_in_bytes(self) -> int: n_bytes = 0 if not self.jagged: if self.dtype == "string": n_bytes += sum([len(s) + 1 for s in self.values]) * 2 else: n_bytes += self.values.size * self.values.itemsize # type: ignore else: for row in self.values: if self.dtype == "string": n_bytes += sum([len(s) + 1 for s in row]) * 2 else: n_bytes += row.size * row.itemsize return n_bytes