Package shoji
Shoji is a tensor database, suitable for storing and working with very large-scale datasets organized as vectors, matrices and higher-dimensional tensors.
Key features
- Multi-petabyte scalable, distributed, high-performance database
- Data modelled as N-dimensional tensors with boolean, string or numeric elements
- Supports both regular and jagged tensors
- Automatic chunking and compression
- Relationships expressed through shared named dimensions
- Read and write data through views created by powerful filter expressions
- Automatic indexing for fast filtering
- Data safety through transactions and ACID properties (atomicity, consistency, isolation, durability)
- Concurrent read/write access
- Elegant, convenient Python API, aligned with numpy
Oh, and it's pretty fast.
Overview
Data model
In Shoji, data is stored as tensors, and relationships are expressed using shared dimensions.
Dimensions can be named, and named dimensions express relationships and constraints between tensors. Tensors that share a named dimension must have the same length along that dimension (and this relationship is enforced when adding data).
You can think of rows as your data objects, dimensions as object types, and the tensors as object
attributes. For example, a set of vectors (e.g. SampleID
, Age
, Tissue
, Date
) defined on a
samples
dimension could be seen as the attributes of samples, and an individual sample would correspond
to an individual row across all tensors.
Tensors can also be related to multiple named dimensions. For example, omics data (e.g. gene expression)
is often represented as matrices, which can be represented in Shoji as rank-2 tensors with two named
dimensions, e.g. cells
and genes
. Metadata about cells and genes would be stored as rank-1 tensors
(vectors) along the cells
and genes
dimensions, respectively. Similarly, multichannel timelapse
image data can be represented as high-rank tensors with dimensions
such as x
, y
, channel
, and timepoint
. This makes Shoji fundamentally
different from tabular (relational) databases, which struggle to represent multidimensional data.
The fundamental operations in shoji are: creating a tensor, appending values, reading values, updating values. Tensors can be deleted, but individual tensor rows cannot.
ACID guarantees
Shoji treats the slice as the atomic unit when writing data. This means that if your program crashes in the middle of an operation, you are guaranteed that there will be no half-created rows, or partially updated elements in the database.
When more than one tensor shares their first dimension, the atomic unit for writing new data (i.e. for
Dimension.append()
) is a slice across all tensors that share the same first dimension.
In other words, if your program crashes in the middle of an append()
operation, shoji guarantees
that some number of complete indices (or nothing at all) will have been written across all the relevant tensors,
ensuring that they stay in sync.
If you need stronger guarantees, you can wrap multiple database operations in a shoji.transaction
.
Limitations
Shoji is built on FoundationDB, a powerful open-source key-value store developed by Apple. It is FoundationDB that gives Shoji a solid foundation of performance, scalability and ACID guarantees. In order to gain these features, there are a few limitations though:
-
Transactions cannot exceed 5 seconds. If a transaction takes longer, it's terminated and rolled back. For Shoji, this limits the total feasible size of a slice (or a set of rows for append operations), since Shoji reads and writes slices transactionally.
-
Transactions exceeding 1 MB can cause performance issues, and transactions cannot exceed 10 MB. This also limits the total feasible size of a tensor slice, since Shoji reads and writes slices transactionally to ensure consistency.
-
FoundationDB is optimized to run on SSDs. Running on mechanical disks is discouraged.
For more details about these and some other limitations, see the FoundationDB docs
Getting started
Installation
Shoji requires Python 3.7+ (we recommend Anaconda)
First, in your terminal, install the shoji Python package:
$ git clone https://github.com/linnarsson-lab/shoji.git
$ pip install -e shoji
Check that you can now connect to the database:
import shoji
db = shoji.connect()
db
Typing db
alone at the last line above should return a representation of the contents of the database
(which might be empty at this point).
First steps
Let's create a workspace and fill it with some data:
db.scRNA = shoji.Workspace()
db.scRNA.cells = shoji.Dimension(shape=None)
db.scRNA.genes = shoji.Dimension(shape=5000)
db.scRNA.Expression = shoji.Tensor("int16", ("cells", "genes"), inits=np.random.randint(0, 10, size=(1000, 5000), dtype="int16"))
db.scRNA.Age = shoji.Tensor("uint16", ("cells",), inits=np.random.randint(0, 50, size=1000, dtype="uint16"))
db.scRNA.GeneLength = shoji.Tensor("uint16", ("genes",), inits=np.random.randint(0, 5000, size=5000, dtype="uint16"))
db.scRNA.Chromosome = shoji.Tensor("string", ("genes",), inits=np.full(5000, "chr1", dtype="object"))
Now we can query the database using shoji.filter
s. We'll load the Expression matrix,
including only rows ("cells"
dimension) where Age > 10
and columns ("genes"
dimension)
where GeneLength < 1000
:
ws = db.scRNA
view = ws[ws.GeneLength < 1000, ws.Age > 10]
view.Expression.shape
# Returns something like (813, 4999)
Learn more
Workspaces (shoji.workspace
)
Workspaces let you organise collections of data that belong together. Tensors and dimensions are created in workspaces, and tensors and dimensions that live in different workspaces are unrelated to each other.
Tensors (shoji.tensor
)
Tensors are N-dimensional arrays of numbers, booleans or strings. All data in shoji is stored as tensors.
Dimensions (shoji.dimension
)
Dimensions define the relationship between tensors, and impose constraints that ensure your database is consistent.
Filters (shoji.filter
)
Filters are expressions used to select tensor rows. Filters create views, and views are the only way to read and write data in shoji.
Views (shoji.view
)
Views are windows into the database, created by applying filters. Views are the only way to read and write data in shoji.
Transactions (shoji.transaction
)
Perform complex database operations atomically.
Expand source code
"""
Shoji is a tensor database, suitable for storing and working with very large-scale datasets
organized as vectors, matrices and higher-dimensional tensors.
## Key features
- Multi-petabyte scalable, distributed, high-performance database
- Data modelled as N-dimensional tensors with boolean, string or numeric elements
- Supports both regular and jagged tensors
- Automatic chunking and compression
- Relationships expressed through shared named dimensions
- Read and write data through views created by powerful filter expressions
- Automatic indexing for fast filtering
- Data safety through transactions and [ACID](https://en.wikipedia.org/wiki/ACID) properties (atomicity, consistency, isolation, durability)
- Concurrent read/write access
- Elegant, convenient Python API, aligned with numpy
Oh, and it's pretty fast.
## Overview
### Data model
In Shoji, data is stored as tensors, and relationships are expressed using shared dimensions.
Dimensions can be named, and named dimensions express relationships and constraints between tensors.
Tensors that share a named dimension must have the same length along that dimension (and this relationship
is enforced when adding data).
You can think of rows as your data *objects*, dimensions as object *types*, and the tensors as object
*attributes*. For example, a set of vectors (e.g. `SampleID`, `Age`, `Tissue`, `Date`) defined on a
`samples` dimension could be seen as the attributes of samples, and an individual sample would correspond
to an individual row across all tensors.
Tensors can also be related to multiple named dimensions. For example, omics data (e.g. gene expression)
is often represented as matrices, which can be represented in Shoji as rank-2 tensors with two named
dimensions, e.g. `cells` and `genes`. Metadata about cells and genes would be stored as rank-1 tensors
(vectors) along the `cells` and `genes` dimensions, respectively. Similarly, multichannel timelapse
image data can be represented as high-rank tensors with dimensions
such as `x`, `y`, `channel`, and `timepoint`. This makes Shoji fundamentally
different from tabular (relational) databases, which struggle to represent multidimensional data.
The fundamental operations in shoji are: *creating a tensor*, *appending values*, *reading values*,
*updating values*. Tensors can be deleted, but individual tensor rows cannot.
### ACID guarantees
Shoji treats the *slice* as the atomic unit when writing data. This means that if your program crashes in the
middle of an operation, you are guaranteed that there will be no half-created rows, or partially
updated elements in the database.
When more than one tensor shares their first dimension, the atomic unit for writing new data (i.e. for
`shoji.dimension.Dimension.append`) is a slice across all tensors that share the same first dimension.
In other words, if your program crashes in the middle of an `append()` operation, shoji guarantees
that some number of complete indices (or nothing at all) will have been written across all the relevant tensors,
ensuring that they stay in sync.
If you need stronger guarantees, you can wrap multiple database operations in a `shoji.transaction`.
### Limitations
Shoji is built on [FoundationDB](https://www.foundationdb.org), a powerful open-source key-value store
developed by [Apple](https://www.apple.com). It is FoundationDB that gives Shoji a solid foundation
of performance, scalability and ACID guarantees. In order to gain these features, there are a few limitations
though:
* Transactions cannot exceed 5 seconds. If a transaction takes longer, it's terminated and rolled back.
For Shoji, this limits the total feasible size of a slice (or a set of rows for append operations), since
Shoji reads and writes slices transactionally.
* Transactions exceeding 1 MB can cause performance issues, and transactions cannot exceed 10 MB. This
also limits the total feasible size of a tensor slice, since Shoji reads and writes slices transactionally
to ensure consistency.
* FoundationDB is optimized to run on SSDs. Running on mechanical disks is discouraged.
For more details about these and some other limitations, see the [FoundationDB docs](https://apple.github.io/foundationdb/known-limitations.html)
## Getting started
### Installation
Shoji requires Python 3.7+ (we recommend [Anaconda](https://www.anaconda.com/products/individual))
First, in your terminal, install the shoji Python package:
```shell
$ git clone https://github.com/linnarsson-lab/shoji.git
$ pip install -e shoji
```
Check that you can now connect to the database:
```python
import shoji
db = shoji.connect()
db
```
Typing `db` alone at the last line above should return a representation of the contents of the database
(which might be empty at this point).
### First steps
Let's create a workspace and fill it with some data:
```python
db.scRNA = shoji.Workspace()
db.scRNA.cells = shoji.Dimension(shape=None)
db.scRNA.genes = shoji.Dimension(shape=5000)
db.scRNA.Expression = shoji.Tensor("int16", ("cells", "genes"), inits=np.random.randint(0, 10, size=(1000, 5000), dtype="int16"))
db.scRNA.Age = shoji.Tensor("uint16", ("cells",), inits=np.random.randint(0, 50, size=1000, dtype="uint16"))
db.scRNA.GeneLength = shoji.Tensor("uint16", ("genes",), inits=np.random.randint(0, 5000, size=5000, dtype="uint16"))
db.scRNA.Chromosome = shoji.Tensor("string", ("genes",), inits=np.full(5000, "chr1", dtype="object"))
```
Now we can query the database using `shoji.filter`s. We'll load the Expression matrix,
including only rows (`"cells"` dimension) where `Age > 10` and columns (`"genes"` dimension)
where `GeneLength < 1000`:
```python
ws = db.scRNA
view = ws[ws.GeneLength < 1000, ws.Age > 10]
view.Expression.shape
# Returns something like (813, 4999)
```
## Learn more
..image:: assets/bitmap/overview@2x.png
### Workspaces (`shoji.workspace`)
Workspaces let you organise collections of data that belong together. Tensors and dimensions
are created in workspaces, and tensors and dimensions that live in different workspaces
are unrelated to each other.
### Tensors (`shoji.tensor`)
Tensors are N-dimensional arrays of numbers, booleans or strings. All data in shoji is
stored as tensors.
### Dimensions (`shoji.dimension`)
Dimensions define the relationship between tensors, and impose constraints that ensure
your database is consistent.
### Filters (`shoji.filter`)
Filters are expressions used to select tensor rows. Filters create views, and views
are the only way to read and write data in shoji.
### Views (`shoji.view`)
Views are windows into the database, created by applying filters. Views
are the only way to read and write data in shoji.
### Transactions (`shoji.transaction`)
Perform complex database operations atomically.
"""
import fdb
try:
fdb.api_version(630)
except RuntimeError:
fdb.api_version(620)
from .dimension import Dimension
from .tensor import Tensor, TensorValue
from .workspace import Workspace, WorkspaceManager, create_workspace
from .connect import connect
from .filter import Filter, CompoundFilter, TensorFilter, ConstFilter, DimensionBoolFilter, DimensionIndicesFilter, DimensionSliceFilter, TensorBoolFilter, TensorIndicesFilter, TensorSliceFilter
from .transaction import Transaction
from .groupby import GroupViewBy, GroupDimensionBy, Accumulator
from .view import View
Sub-modules
shoji.connect
-
Connecting to a Shoji database cluster …
shoji.dimension
-
Dimensions represent named, shared tensor axes. When two tensors share an axis, they are constrained to have the same number of elements along that …
shoji.filter
-
Using filters …
shoji.groupby
-
Grouping tensors and applying aggregations …
shoji.io
shoji.tensor
-
All data in Shoji is stored as N-dimensional tensors. A tensor is a generalisation of scalars, vectors and matrices to N dimensions …
shoji.tests
shoji.transaction
-
Transactions, supporting atomic multi-statement operations. Usage: …
shoji.view
-
Views let you work with a selected subset of a workspace. Reading from the view automatically returns values from the selected subset of the database …
shoji.workspace
-
Workspaces let you organise collections of data that belong together. Workspaces can be nested, like folders in a file system …