Skip to article frontmatterSkip to article content

Technical Specification

Introduction

seqspec is an open-source file format specification and command-line tool for annotating sequencing libraries that utilized YAML for data representation. This document outlines the specification and explains various use-cases.

Schema Overview

The seqspec schema is designed to annotate sequencing libraries through three main Pydantic models: Assay, Region, and Read. An Assay contains the library_spec (a tree of Region objects, possibly nested) and the sequence_spec (a list of Read objects). Files (e.g., FASTQ/BAM/SRA) can be associated with individual reads via a list of File objects.

Each seqspec file is associated with a sequencing run and documents the designed library structure and the designed read structure. A simple (but incomplete example) looks like the following:

library_protocol: 10xv3 Chromium scRNAseq
library_kit: Truseq dual index
sequence_protocol: Illumina Novaseq 6000
sequence_kit: Illumina Novaseq 6000 v1.5 kit
modalities:
  - Modality1
  - Modality2
sequence_spec:
  - read_id: Read1
    modality: Modality1
    primer_id: Region2
    strand: pos
    min_len: 10
    max_len: 100
    files:
    - file_id: R1.fastq.gz
      ...
library_spec:
  - region_id: Modality1
    regions:
    - region_id: Region1
      ...
    - region_id: Region2
      ...
  - region_id: Modality2

Each object has clearly defined fields and helpful input variants (e.g., ReadInput, RegionInput) used by tools. The full JSON Schema is in seqspec/schema/seqspec.schema.json.

Assay Object

The Assay object contains overall metadata for the sequencing run.

Fields:

Example:

!Assay
seqspec_version: 0.3.0
assay_id: SPLiT-seq/Illumina
name: SPLiT-seq
doi: https://doi.org/10.1126/science.aam8999
date: 15 March 2018
description: split-pool ligation-based transcriptome sequencing
modalities:
  - rna
lib_struct: https://teichlab.github.io/scg_lib_structs/methods_html/SPLiT-seq.html
library_protocol: SPLiT-seq
library_kit: Custom
sequence_protocol: Illumina NovaSeq 6000 (EFO:0008637)
sequence_kit:
  - !SeqKit
    kit_id: "NovaSeq 6000 S2 Reagent Kit v1.5 (100\u2009cycles)"
    name: illumina
    modality: rna
sequence_spec: ...
library_spec: ...

Region Object

The library_spec contains a list of, possibly nested, Region objects which detail individual segments within the sequencing library molecule, specifying types, sequences, and relationships between segments. The order of the Regions in the library_spec (top to bottom) corresponds to their linear ordering in the library molecule from the 5’ -> 3’ end.

modalities:
- rna
library_spec:
  - region_id: rna # <-- must be a "modality" region
    regions: # <-- a list containing the linear ordering of the "regions" for the "rna" library molecule
    - region_id: illumina_p5
      ...
    - region_id: read1_primer
      ...
    - region_id: cell_bc
      ...
    - region_id: umi
      ...

Each Region has the following properties which are useful to annotate the element of the library molecule:

Example:

!Region
region_id: barcode-1
region_type: barcode
sequence_type: onlist
sequence: NNNNNNNN
min_len: 8
max_len: 8
onlist: !Onlist
  file_id: barcode-1_onlist.txt
  filename: barcode-1_onlist.txt
  filetype: txt
  filesize: 120
  url: ./
  urltype: local
  md5: 5b62453df2771f5aa856f78797f16591
regions: null

For more information about the various fields, please see the JSON schema specification (seqspec/schema/seqspec.schema.json). For consistency across assays I suggest following a standard naming conventions for common regions. I’ve made a collection of “named” regions available; please see seqspec/docs/regions for a list of example regions.

Read Object

The sequence_spec contains a list of Read objects which describe the sequencing “reads” that are generated from sequencing the molecule described in the library_spec. A crucial concept is that Read objects contain a primer_id which maps to a single region_id in the library_spec. Importantly, Reads can contain Files which I describe in the subsequent section.

sequence_spec:
  - read_id: Read1
    modality: Modality1
    primer_id: Region2
    strand: pos
    min_len: 10
    max_len: 100
    files:
    - file_id: R1.fastq.gz
    ...

A Read object is annotated with the following attributes:

Example:

- !Read
  read_id: read_001
  name: Read 1 of Sample A
  modality: rna
  primer_id: primer_25
  min_len: 50
  max_len: 300
  strand: pos
  files:
  - !File
  - file_id: read_001.fastq.gz
    ...

File Object

Files are annotated with the File object. Files can be local or remote (e.g., FASTQ, BAM, POD5, TXT, SRA). File objects contain the following attributes:

File objects are used in the Onlist object within “onlist” Regions. They are also used in the Read objects as a list of File objects.

Python library

seqspec files can be loaded into python as a python object. Manipulation becomes straightforward with dot notation:

from seqspec.utils import load_spec

spec = load_spec("seqspec/assays/10x-RNA-v3/spec.yaml")

print(spec.get_libspec("RNA").sequence)
# AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNXAGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG