Create from fastq

This section introduces the loompy command-line tool, and shows how it is used to create .loom files directly from fastq files.

The loompy command-line tool

After installing loompy 3, a new command-line tool loompy will be available. To verify, launch a Terminal window and type loompy, and you should see the following output:

Usage: loompy [OPTIONS] COMMAND [ARGS]...

Options:
  --show-message / --hide-message
  --verbosity [error|warning|info|debug]
  --help                          Show this message and exit.

Commands:
  fromfq

Using the loompy fromfq command

1. Install kallisto

The excellent kallisto tool performs ultra-fast pseudoalignment, which loompy uses to count reads (UMIs) on genes.

2. Download an index (or build your own)

We provide a pre-built index of the human genome.

Warning

This index is really only suitable for use with 10x Chromium data (it makes strong assumptions about the read distribution).

Unzip the index to a directory, which will have the following content:

10xv1_whitelist.txt
10xv2_whitelist.txt
10xv3_whitelist.txt
fragments2genes.txt
gencode.v31.fragments.idx
gencode.v31.metadata.tab
manifest.json
spliced_fragments.txt
unspliced_fragments.txt

To build your own index, see the build instructions.

3. Download metadata and fastq files

You need to provide the input fastq files, and metadata for your samples. For this tutorial, please download the 1k PBMC v3 reference dataset from 10x Genomics: pbmc_1k_v3_fastqs.tar.

Metadata can be provided either in the form of a tab-delimited file with a single header row, or as a sqlite3 database with a table named sample.

One metadata column must be named name and contain sample names (sample IDs). One column must be named technology and contain the technology name (currently one of 10xv1, 10xv2 or 10xv3). The technology name will be used to locate the barcode whitelist file (e.g. 10xv3_whitelist.txt). Finally, one column must be named targetnumcells and contain the target (or expected) cell number in the samples. If absent, the default will be 5000.

The metadata file (or database) will typically contain metadata for all the samples you are working with, one row per sample. You can add any number of additional columns of metadata, which will all be copied to the newly generated loom file as global attributes.

For this tutorial, create a tab-delimited file named metadata.tab with the following content:

name    technology  targetnumcells
1kPBMC  10xv3       1000

(If you copy-paste the above, make sure the fields are tab delimited)

4. Run the loompy fromfq command

Run the following command (replace human_GRCh38_gencode.v31 with the path to the human genome index you just downloaded):

loompy fromfq 1kPBMC.loom 1kPBMC human_GRCh38_gencode.v31 metadata.tab pbmc_1k_v3_S1_L001_R1_001.fastq.gz pbmc_1k_v3_S1_L001_R2_001.fastq.gz pbmc_1k_v3_S1_L002_R1_001.fastq.gz pbmc_1k_v3_S1_L002_R2_001.fastq.gz

Note that the fastq files are listed in pairs of R1 (read 1) and R2 (read 2) files. The index files (I1) are not used.

After about half an hour, you will have a 1kPBMC.loom file with separate spliced and unspliced layers (the main matrix will be the sum of the two), and rich metadata for both genes, cells and the sample itself stored as attributes. The output should look something like this:

2019-09-29 16:28:08,186 - INFO - kallisto bus -i human_GRCh38_gencode.v31/gencode.v31.fragments.idx -o /tmp/tmp7yk3rf07 -x 10xv3 -t 56 pbmc_1k_v3_S1_L001_R1_001.fastq.gz pbmc_1k_v3_S1_L001_R2_001.fastq.gz pbmc_1k_v3_S1_L002_R1_001.fastq.gz pbmc_1k_v3_S1_L002_R2_001.fastq.gz
2019-09-29 16:28:08,307 - INFO - [index] k-mer length: 31
2019-09-29 16:28:08,307 - INFO - [index] number of targets: 845,338
2019-09-29 16:28:08,307 - INFO - [index] number of k-mers: 178,605,364
2019-09-29 16:28:29,537 - INFO - [index] number of equivalence classes: 4,191,221
2019-09-29 16:28:40,951 - INFO - [quant] will process sample 1: pbmc_1k_v3_S1_L001_R1_001.fastq.gz
2019-09-29 16:28:40,951 - INFO -                                pbmc_1k_v3_S1_L001_R2_001.fastq.gz
2019-09-29 16:28:40,951 - INFO - [quant] will process sample 2: pbmc_1k_v3_S1_L002_R1_001.fastq.gz
2019-09-29 16:28:40,951 - INFO -                                pbmc_1k_v3_S1_L002_R2_001.fastq.gz
2019-09-29 16:31:44,144 - INFO - [quant] finding pseudoalignments for the reads ... done
2019-09-29 16:31:44,145 - INFO - [quant] processed 66,601,887 reads, 46,119,840 reads pseudoaligned
2019-09-29 16:31:52,543 - INFO - Loading gene metadata
2019-09-29 16:31:52,818 - INFO - Loading fragments-to-gene mappings
2019-09-29 16:31:53,426 - INFO - Indexing genes
2019-09-29 16:31:53,846 - INFO - Loading equivalence classes
2019-09-29 16:32:22,273 - INFO - Mapping equivalence classes to genes
2019-09-29 16:32:32,817 - INFO - Loading fragment IDs
2019-09-29 16:32:33,280 - INFO - Loading BUS records
2019-09-29 16:33:46,692 - INFO - Sorting cell IDs
2019-09-29 16:33:49,611 - INFO - Found 46,119,840 records for 60,662 genes and 551,892 uncorrected cell barcodes.
2019-09-29 16:33:49,611 - INFO - Correcting cell barcodes
2019-09-29 16:35:58,753 - INFO - Found 307,677 corrected cell barcodes.
2019-09-29 16:35:58,754 - INFO - Removing redundant reads using UMIs
2019-09-29 16:36:45,546 - INFO - 71% sequencing saturation.
2019-09-29 16:36:45,546 - INFO - Counting pseudoalignments for main matrix
2019-09-29 16:36:52,752 - INFO - Found 5,027,188 UMIs.
2019-09-29 16:36:53,536 - INFO - Counting pseudoalignments for layer 'unspliced'
2019-09-29 16:38:00,099 - INFO - Found 2,376,590 UMIs.
2019-09-29 16:38:00,706 - INFO - Counting pseudoalignments for layer 'spliced'
2019-09-29 16:39:09,718 - INFO - Found 3,231,999 UMIs.
2019-09-29 16:39:09,718 - INFO - Calling cells
2019-09-29 16:42:32,387 - INFO - Found 1189 valid cells and ~77 ambient UMIs.
2019-09-29 16:42:32,388 - INFO - Creating loom file '1kPBMC.loom'
2019-09-29 16:42:32,388 - INFO - Saving

As you can see, 46,119,840 of 66,601,887 reads pseudoaligned (~70%) which is typical. The sequencing saturation was 71%, and the cell calling algorithm found 1189 valid cells (similar to the 1,222 cells reported by cellranger). Empty beads carried a median of 77 UMIs, presumably from cell-free ambient RNA.

Connect to the loom file and examine its global attributes:

import loompy
with loompy.connect("1kPBMC.loom") as ds:
  print(ds.attrs.keys())

['AmbientPValue', 'AmbientUMIs', 'BarcodeTotalUMIs', 'CellBarcodes', 'CreationDate', 'KallistoCommand', 'KallistoVersion', 'LOOM_SPEC_VERSION', 'NumPseudoaligned', 'NumReadsProcessed', 'RedundantReadFraction', 'SampleID', 'Saturation', 'Species', 'name', 'targetnumcells', 'technology']

…column attributes…

import loompy
with loompy.connect("1kPBMC.loom") as ds:
  print(ds.ca.keys())

['CellID', 'TotalUMIs']

…row attributes (see the index build instructions for an explanation of these)…

import loompy
with loompy.connect("1kPBMC.loom") as ds:
  print(ds.ra.keys())

['Accession', 'AccessionVersion', 'Aliases', 'CcdsID', 'Chromosome', 'ChromosomeEnd', 'ChromosomeStart', 'CosmicID', 'DnaBindingDomain', 'FullName', 'Gene', 'GeneType', 'HgncID', 'IsTF', 'Location', 'LocationSortable', 'LocusGroup', 'LocusType', 'MgdID', 'MirBaseID', 'OmimID', 'PubmedID', 'RefseqID', 'RgdID', 'UcscID', 'UniprotID', 'VegaID']

…and layers:

import loompy
with loompy.connect("1kPBMC.loom") as ds:
  print(ds.layers.keys())

['', 'spliced', 'unspliced']