Skip to article frontmatterSkip to article content

Tool Options

Usage

The seqspec tool operates on seqspec files and

  1. Facilitates the standardization of preprocessing steps across different assays,
  2. Enables data management and tracking,
  3. Simplifies the interpretation and reuse of sequencing data.

seqspec consists of the following subcommands:

usage: seqspec [-h] <CMD> ...

seqspec 0.3.0: A machine-readable file format for genomic library sequence and structure.

GitHub: https://github.com/pachterlab/seqspec
Documentation: https://pachterlab.github.io/seqspec/

positional arguments:
  <CMD>
    build     Generate a complete seqspec with natural language (LLM-assisted)
    check     Validate seqspec file against specification
    find      Find objects in seqspec file
    file      List files present in seqspec file
    format    Autoformat seqspec file
    index     Identify position of elements in seqspec file
    info      Get information from seqspec file
    init      Generate a new empty seqspec file
    insert    Insert regions or reads into an existing spec
    methods   Convert seqspec file into methods section
    modify    Modify attributes of various elements in seqspec file
    onlist    Get onlist file for elements in seqspec file
    print     Display the sequence and/or library structure from seqspec file
    split     Split seqspec file by modality
    upgrade   Upgrade seqspec file to current version (hidden)
    version   Get seqspec tool version and seqspec file version

optional arguments:
  -h, --help  show this help message and exit

seqspec operates on seqspec compatible YAML files that follow the specification. All of the following examples will use the seqspec specification for the DOGMAseq-DIG assay which can be found here: seqspec/examples/specs/dogmaseq-dig/spec.yaml.

seqspec check: Validate seqspec file against specification

Check that the seqspec file is correctly formatted and consistent with the specification.

seqspec check [-h] [-o OUT] [--skip {igvf,igvf_onlist_skip}] yaml
from seqspec.seqspec_check import run_check

run_check(schema_fn: str, spec_fn: str, o: str)

A list of checks performed:

  1. Check that the spec validates against the JSON Schema.
  2. Check that modalities are unique.
  3. Check that region_ids of the first level of the library_spec correspond to modalities (one per modality).
  4. Check that onlist files exist (either as local paths or reachable URLs).
  5. Check that the read_ids in the sequence_spec are unique.
  6. Check that read files exist (either as local paths or reachable URLs).
  7. Check that read (primer_id, strand) pairs are unique across all reads.
  8. Check that the region_ids are unique across all regions.
  9. Check that each read modality exists in the assay list of modalities.
  10. Check that each read primer_id exists among the region IDs in the library_spec.
  11. Check sequence_type and region annotation consistencies:
  1. Check that the min_len is less than or equal to the max_len.
  2. Check that the length of the sequence in every region is between the min_len and max_len.
  3. Check that the number of files in each Read is the same across all reads.
  4. Check that for every region with subregions, the region min_len/max_len equals the sum of the subregions’ min_len/max_len.
  5. Check that for every region with subregions, the region sequence equals the left-to-right concatenation of the subregions’ sequences.
  6. Check that each read’s max_len does not exceed the sequence-able range of library elements after (pos strand) or before (neg strand) the primer.

Below are a list of example errors one may encounter when checking a spec:

# The "assay" value was not specified in the spec
[error 1] None is not of type 'string' in spec['assay']

# The "modalities" are not using the controlled vocabulary
[error 2] 'Ribonucleic acid' is not one of ['rna', 'tag', 'protein', 'atac', 'crispr'] in spec['modalities'][0]

# The "region_type" is not using the controlled vocabulary
[error 3] 'link_1' is not one of ['atac', 'barcode', 'cdna', 'crispr', 'fastq', 'gdna', 'hic', 'illumina_p5', 'illumina_p7', 'index5', 'index7', 'linker', 'ME1', 'ME2', 'methyl', 'nextera_read1', 'nextera_read2', 'poly_A', 'poly_G', 'poly_T', 'poly_C', 'protein', 'rna', 's5', 's7', 'tag', 'truseq_read1', 'truseq_read2', 'umi'] in spec['library_spec'][0]['regions'][3]['region_type']

# The "sequence_type" is not using the controlled vocabulary
[error 4] 'linker' is not one of ['fixed', 'random', 'onlist', 'joined'] in spec['library_spec'][0]['regions'][3]['sequence_type']

# The "region_id" is not unique across the spec
[error 5] region_id 'cell_bc' is not unique across all regions

# The length of the given "sequence" is less than the "min_len" specified for the sequence
[error 6] 'sample_bc' sequence 'NNNNNNNN' length '8' is less than min_len '10'

# The "filename" for the specified "onlist" does not exist in the same location as the spec.
[error 7] i5_index_onlist.txt does not exist

# The provided "sequence" contains invalid characters (only A, C, G, T, N, and X are permitted)
[error 8] 'NNNNNNNNZN' does not match '^[ACGTNX]+$' in spec['library_spec'][0]['regions'][4]['sequence']

# The "md5" for the given "onlist" file is not a valid md5sum
[error 9] '7asddd7asd7' does not match '^[a-f0-9]{32}$' in spec['library_spec'][0]['regions'][8]['onlist']['md5']

Examples

# check the spec against the formal specification
$ seqspec check spec.yaml
[error 1] None is not of type 'string' in spec['assay']
[error 2] 'Ribonucleic acid' is not one of ['rna', 'tag', 'protein', 'atac', 'crispr'] in spec['modalities'][0]

seqspec find: Find objects in seqspec file

seqspec find [-h] [-o OUT] [-s SELECTOR] -m MODALITY [-i ID] yaml
from seqspec.seqspec_find import run_find

run_find(spec_fn: str, modality: str, id: str, idtype: str, o: str)

Examples

# Find reads by id
$ seqspec find -m rna -s read -i rna_R1 spec.yaml
- !Read
  read_id: rna_R1
  name: rna Read 1
  modality: rna
  primer_id: rna_truseq_read1
  min_len: 28
  max_len: 28
  strand: pos
  files:
  - !File
    file_id: rna_R1_SRR18677638.fastq.gz
    filename: rna_R1_SRR18677638.fastq.gz
    filetype: fastq
    filesize: 18499436
    url: fastqs/rna_R1_SRR18677638.fastq.gz
    urltype: local
    md5: 7eb15a70da9b729b5a87e30b6596b641

# Find regions with `barcode` region type
$ seqspec find -m rna -s region-type -i barcode spec.yaml
- !Region
  region_id: rna_cell_bc
  region_type: barcode
  name: Cell Barcode
  sequence_type: onlist
  sequence: NNNNNNNNNNNNNNNN
  min_len: 16
  max_len: 16
  onlist: !Onlist
    location: local
    filename: RNA-737K-arc-v1.txt
    filetype: txt
    filesize: 0
    url: RNA-737K-arc-v1.txt
    urltype: local
    md5: a88cd21e801ae6f9a7d9a48b67ccf693
    file_id: RNA-737K-arc-v1.txt
  regions: null
  parent_id: rna

seqspec file: List files present in seqspec file

seqspec file [-h] [-o OUT] [-i IDs] -m MODALITY [-s SELECTOR] [-f FORMAT] [-k KEY] [--fullpath] yaml
from seqspec.seqspec_file import run_file

run_file(spec_fn: str, m: str, ids: List[str], idtype: str, fmt: str, k: str, o: str, fp=False)

Examples

# List paired read files
$ seqspec file -m rna spec.yaml
rna_R1_SRR18677638.fastq.gz     rna_R2_SRR18677638.fastq.gz

# List interleaved read files
$ seqspec file -m rna -f interleaved spec.yaml
rna_R1_SRR18677638.fastq.gz
rna_R2_SRR18677638.fastq.gz

# List urls of all read files
$ seqspec file -m rna -f list -k url spec.yaml
rna_R1  rna_R1_SRR18677638.fastq.gz     fastqs/rna_R1_SRR18677638.fastq.gz
rna_R2  rna_R2_SRR18677638.fastq.gz     fastqs/rna_R2_SRR18677638.fastq.gz

# List all files in regions
$ seqspec file -m rna -f list -s region -k all spec.yaml
rna_cell_bc     RNA-737K-arc-v1.txt     RNA-737K-arc-v1.txt     txt     2142553 https://github.com/pachterlab/qcbc/raw/main/tests/10xMOME/RNA-737K-arc-v1.txt.gz        https   a88cd21e801ae6f9a7d9a48b67ccf693

# List files for barcode regions in json
$ seqspec file -m rna -f json -s region-type -k all -i barcode spec.yaml
[
    {
        "file_id": "RNA-737K-arc-v1.txt",
        "filename": "RNA-737K-arc-v1.txt",
        "filetype": "txt",
        "filesize": 2142553,
        "url": "https://github.com/pachterlab/qcbc/raw/main/tests/10xMOME/RNA-737K-arc-v1.txt.gz",
        "urltype": "https",
        "md5": "a88cd21e801ae6f9a7d9a48b67ccf693"
    }
]

Note: seqspec file -s read gets the files for the read, not the files contained in the regions mapped to the read.

seqspec format: Autoformat seqspec file

Automatically fill in missing fields in the spec.

seqspec format [-h] [-o OUT] yaml
from seqspec.seqspec_format import run_format
run_format(spec_fn: str, o: str)

Examples

# format the spec and print the spec to stdout
$ seqspec format spec.yaml

# note you can also overwrite the spec you are formatting
$ seqspec format -o spec.yaml spec.yaml

seqspec index: Identify position of elements in seqspec file

Identify the position of elements in a spec for use in downstream tools. Returns the 0-indexed position of elements contained in a given region in the 5’->3’ direction.

seqspec index [-h] [-o OUT] [-t TOOL] [-s SELECTOR] [--rev] [--subregion-type SUBREGIONTYPE] [--no-overlap] -m MODALITY [-i IDs] yaml
from seqspec.seqspec_index import run_index
run_index(spec_fn: str, modality: str, ids: List[str], idtype: str, fmt: str, rev: str, subregion_type: str, o)

Examples

# get the indices of the elements contained within the FASTQs specified in the spec in tab format
$ seqspec index -m atac -s file -i atac_R1_SRR18677642.fastq.gz,atac_R2_SRR18677642.fastq.gz,atac_R3_SRR18677642.fastq.gz spec.yaml
atac_R1 gdna    gdna    0       53
atac_R2 atac linker     linker  0       8
atac_R2 Cell Barcode    barcode 8       24
atac_R3 gdna    gdna    0       53

# do the same but in the kb format
$ seqspec index -m atac -t kb -s file -i atac_R1_SRR18677642.fastq.gz,atac_R2_SRR18677642.fastq.gz,atac_R3_SRR18677642.fastq.gz spec.yaml
1,8,24:-1,-1,-1:0,0,53,2,0,53

# If the files are specified in the spec then -i can be omitted
$ seqspec index -m atac -t kb -s file spec.yaml
1,8,24:-1,-1,-1:0,0,53,2,0,53

seqspec info: get info about seqspec file

seqspec info [-h] [-k KEY] [-f FORMAT] [-o OUT] yaml
from seqspec.seqspec_info import run_info
run_info(spec_fn: str, f: str, k=None, o=None)

Examples

# Get meta information in json format
$ seqspec info -f json spec.yaml
{
    "seqspec_version": "0.3.0",
    "assay_id": "DOGMAseq-DIG",
    "name": "DOGMAseq-DIG/Illumina",
    "doi": "https://doi.org/10.1186/s13059-022-02698-8",
    "date": "23 June 2022",
    "description": "DOGMAseq with digitonin (DIG) is a single-cell multi-omics assay that simultaneously measures protein, RNA, and chromatin accessibility in single cells. The assay is based on the DOGMAseq technology, which uses a DNA-barcoded antibody library to capture proteins of interest, followed by a single-cell RNA-seq protocol and a single-cell ATAC-seq protocol. The DOGMAseq-LLL assay is designed to be compatible with the 10x Genomics Chromium platform.",
    "lib_struct": "",
    ...
    # long output omitted

# Get the list of modalities
$ seqspec info -k modalities spec.yaml
protein tag     rna     atac

# Get library spec in json format
$ seqspec info -f json -k library_spec spec.yaml
{
    "protein": [
        {
            "region_id": "ghost_protein_truseq_read1",
            "region_type": "named",
            "name": "Truseq Read 1",
            "sequence_type": "fixed",
            "onlist": null,
            "sequence": "",
            "min_len": 0,
            "max_len": 0,
            "regions": []
        },
        ...
        # long output omitted

# Get sequence spec in json format
$ seqspec info -f json -k sequence_spec spec.yaml
[
    {
        "read_id": "protein_R1",
        "name": "protein Read 1",
        "modality": "protein",
        "primer_id": "protein_truseq_read1",
        "min_len": 28,
        "max_len": 28,
        "strand": "pos",
        "files": [
        ...
        # long output omitted

seqspec init: Generate a new empty seqspec draft

Create a minimal, valid draft containing only meta Regions (one per modality). This is intended as a starting point to then insert regions and reads.

seqspec init [-h] -n NAME -m MODALITIES [--doi DOI] [--description DESC] [--date YYYY-MM-DD] [-o OUT]

Example:

seqspec init -n myassay -m rna,atac -o spec.yaml

seqspec methods: Convert seqspec file into methods section

Generate a methods section from a seqspec file.

seqspec methods [-h] -m MODALITY [-o OUT] yaml
from seqspec.seqspec_methods import run_methods
run_methods(spec_fn: str, m: str, o: str)

Examples

# print methods for rna modality
$ seqspec methods -m rna spec.yaml
Methods
The rna portion of the DOGMAseq-DIG/Illumina assay was generated on 23 June 2022.

Libary structure

The library was generated using the CG000338 Chromium Next GEM Multiome ATAC + Gene Expression Rev. D protocol (10x Genomics) library protocol and Illumina Truseq Single Index library kit. The library contains the following elements:

1. Truseq Read 1: 33-33bp fixed sequence (ACACTCTTTCCCTACACGACGCTCTTCCGATCT).
2. Cell Barcode: 16-16bp onlist sequence (NNNNNNNNNNNNNNNN), onlist file: RNA-737K-arc-v1.txt.
3. umi: 12-12bp random sequence (XXXXXXXXXXXX).
4. cdna: 102-102bp random sequence (XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX).
5. Truseq Read 2: 34-34bp fixed sequence (AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC).


Sequence structure

The library was sequenced on a Illumina NovaSeq 6000 (EFO:0008637) using the NovaSeq 6000 S2 Reagent Kit v1.5 (100 cycles) sequencing kit. The library was sequenced using the following configuration:

- rna Read 1: 28 cycles on the positive strand using the rna_truseq_read1 primer. The following files contain the sequences in Read 1:
  - File 1: rna_R1_SRR18677638.fastq.gz
- rna Read 2: 102 cycles on the negative strand using the rna_truseq_read2 primer. The following files contain the sequences in Read 2:
  - File 1: rna_R2_SRR18677638.fastq.gz

seqspec modify: Modify attributes of various elements (JSON-based)

Modify objects by passing a JSON array of partial objects via --keys. Only provided fields are applied.

seqspec modify [-h] -m MODALITY -s SELECTOR -k JSON [-o OUT] yaml

Selectors: read, region, file, seqkit, seqprotocol, libkit, libprotocol, assay.

Examples:

# Update a read name
seqspec modify -m rna -s read -k '[{"read_id":"rna_R1","name":"renamed_rna_R1"}]' spec.yaml

# Update a region name
seqspec modify -m rna -s region -k '[{"region_id":"rna_cell_bc","name":"Cell Barcode"}]' spec.yaml

# Update a file url
seqspec modify -m rna -s file -k '[{"file_id":"R1.fastq.gz","url":"./fastq/R1.fastq.gz"}]' spec.yaml

Examples

# modify the read id
$ seqspec modify -m atac -o mod_spec.yaml -i atac_R1 --read-id renamed_atac_R1 spec.yaml

# modify the region id
$ seqspec modify -m atac -o mod_spec.yaml -s region -i atac_cell_bc --region-id renamed_atac_cell_bc spec.yaml

# modify the files for R1 fastq
$ seqspec modify -m atac -o mod_spec.yaml -i atac_R1 --files "R1_1.fastq.gz,fastq,0,./fastq/R1_1.fastq.gz,local,null:R1_2.fastq.gz,fastq,0,./fastq/R1_2.fastq.gz,local,null" spec.yaml

seqspec onlist: Get onlist file(s) for elements in seqspec file

seqspec onlist [-h] [-o OUT] [-s SELECTOR] [-f {product,multi}] -m MODALITY [-i ID] yaml
from seqspec.seqspec_onlist import run_onlist

run_onlist(spec_fn, modality, ids, idtype, fmt, o)

Note: If, for example, there are multiple regions with the specified region_type in the modality (e.g. multiple barcodes), then seqspec onlist will return a path to an onlist that it generates where the entries in that onlist are the cartesian product of the onlists for all of the regions found.

Examples

# Get onlist for the element in the rna_R1 read
$ seqspec onlist -m rna -s read -i rna_R1 spec.yaml
/path/to/spec/folder/RNA-737K-arc-v1.txt

# Get onlist for barcode region type
$ seqspec onlist -m rna -s region-type -i barcode spec.yaml
/path/to/spec/folder/RNA-737K-arc-v1.txt

seqspec print: Display the sequence and/or library structure from seqspec file

Print sequence and/or library structure as ascii, png, or html.

seqspec print [-h] [-o OUT] [-f FORMAT] yaml
from seqspec.seqspec_print import run_seqspec_print
run_seqspec_print(spec_fn, fmt, o)

Examples

# Print the library structure as ascii
$ seqspec print spec.yaml
           ┌─'ghost_protein_truseq_read1:0'
           ├─'protein_truseq_read1:33'
           ├─'protein_cell_bc:16'
 ┌─protein─┤
 │         ├─'protein_umi:12'
 │         ├─'protein_seq:15'
 │         └─'protein_truseq_read2:34'
 │         ┌─'tag_truseq_read1:33'
 │         ├─'tag_cell_bc:16'
 ├─tag─────┼─'tag_umi:12'
 │         ├─'tag_seq:15'
─┤         └─'tag_truseq_read2:34'
 │         ┌─'rna_truseq_read1:33'
 │         ├─'rna_cell_bc:16'
 ├─rna─────┼─'rna_umi:12'
 │         ├─'cdna:102'
 │         └─'rna_truseq_read2:34'
 │         ┌─'atac_truseq_read1:33'
 │         ├─'gDNA:100'
 └─atac────┼─'atac_truseq_read2:34'
           ├─'spacer:8'
           └─'atac_cell_bc:16'

# Print the sequence and library structure as ascii
$ seqspec print -f seqspec-ascii spec.yaml
protein
---
                                |--------------------------->(1) protein_R1
ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNNNNNNNNNXXXXXXXXXXXXXXXXXXXXXXXXXXXAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNNNNNNNNNNNNNNXXXXXXXXXXXXXXXXXXXXXXXXXXXTCTAGCCTTCTCGTGTGCAGACTTGAGGTCAGTG
                                                             <--------------|(2) protein_R2
tag
---
                                |--------------------------->(1) tag_R1
ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNNNNNNNNNXXXXXXXXXXXXXXXXXXXXXXXXXXXAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNNNNNNNNNNNNNNXXXXXXXXXXXXXXXXXXXXXXXXXXXTCTAGCCTTCTCGTGTGCAGACTTGAGGTCAGTG
                                                             <--------------|(2) tag_R2
rna
---
                                |--------------------------->(1) rna_R1
ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNNNNNNNNNXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNNNNNNNNNNNNNNXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXTCTAGCCTTCTCGTGTGCAGACTTGAGGTCAGTG
                                                             <-----------------------------------------------------------------------------------------------------|(2) rna_R2
atac
---
                                |---------------------------------------------------->(1) atac_R1
                                                                                                                                                                      |----------------------->(2) atac_R2
ACACTCTTTCCCTACACGACGCTCTTCCGATCTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGACGCGNNNNNNNNNNNNNNNN
TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXTCTAGCCTTCTCGTGTGCAGACTTGAGGTCAGTGGTCTGCGCNNNNNNNNNNNNNNNN
                                                                                <----------------------------------------------------|(3) atac_R3


# Print the sequence and library structure as html
$ seqspec print -f seqspec-html spec.yaml
  <!DOCTYPE html>
  <html>
    <head>
      <meta name="viewport" content="width=device-width, initial-scale=1" />
      <style>
      highlight {
      color: green;
      }
      ...
      # long output omitted

# Print the library structure as a png
$ seqspec print -o spec.png -f seqspec-png spec.yaml

seqspec split: Split seqspec file by modality

seqspec split [-h] -o OUT yaml
from seqspec.seqspec_split import run_split
run_split(spec_fn, o)

Examples

# split spec into modalities
$ seqspec split -o split spec.yaml
$ ls -1
spec.yaml
split.atac.yaml
split.protein.yaml
split.rna.yaml
split.tag.yaml

seqspec version: Get seqspec tool version and seqspec file version

seqspec version [-h] [-o OUT] yaml
from seqspec.seqspec_version import run_version
run_version(spec_fn, o)

Examples

# Get versions of tool and file
$ seqspec version spec.yaml
seqspec version: 0.3.0
seqspec file version: 0.3.0

(HIDDEN) seqspec upgrade: Upgrade seqspec file from older versions to the current version

This is a hidden subcommand that upgrades an old version of the spec to the current one. It is not intended to be used in a production environment.

seqspec upgrade [-h] [-o OUT] yaml
from seqspec.seqspec_upgrade import run_upgrade
run_upgrade(spec_fn, o)

Examples

# upgrade spec
$ seqspec upgrade -o spec.yaml spec.yaml
References
  1. Xu, Z., Heidrich-O’Hare, E., Chen, W., & Duerr, R. H. (2022). Comprehensive benchmarking of CITE-seq versus DOGMA-seq single cell multimodal omics. Genome Biology, 23(1). 10.1186/s13059-022-02698-8