Introduction¶
seqspec is an open-source file format specification and command-line tool for annotating sequencing libraries that utilized YAML for data representation. This document outlines the specification and explains various use-cases.
Schema Overview¶
The seqspec schema is designed to annotate sequencing libraries through three main Pydantic models: Assay, Region, and Read. An Assay contains the library_spec (a tree of Region objects, possibly nested) and the sequence_spec (a list of Read objects). Files (e.g., FASTQ/BAM/SRA) can be associated with individual reads via a list of File objects.
Each seqspec file is associated with a sequencing run and documents the designed library structure and the designed read structure. A simple (but incomplete example) looks like the following:
library_protocol: 10xv3 Chromium scRNAseq
library_kit: Truseq dual index
sequence_protocol: Illumina Novaseq 6000
sequence_kit: Illumina Novaseq 6000 v1.5 kit
modalities:
- Modality1
- Modality2
sequence_spec:
- read_id: Read1
modality: Modality1
primer_id: Region2
strand: pos
min_len: 10
max_len: 100
files:
- file_id: R1.fastq.gz
...
library_spec:
- region_id: Modality1
regions:
- region_id: Region1
...
- region_id: Region2
...
- region_id: Modality2Each object has clearly defined fields and helpful input variants (e.g., ReadInput, RegionInput) used by tools. The full JSON Schema is in seqspec/schema/seqspec.schema.json.
Assay Object¶
The Assay object contains overall metadata for the sequencing run.
Fields:
seqspec_version: String specifying the version of the seqspec specification, adhering to semantic versioning.assay_id: Identifier for the assay.name: The name of the assay.doi: The doi of the paper that describes the assay.date: The seqspec creation date.description: A short description of the assay.modalities: The modalities the assay targets. E.g. “dna”, “rna”, “tag”, “protein”, “atac”, “crispr”.lib_struct: Optional link to Teichmann lab library structure page.library_protocol: The protocol/machine/tool to generate the library insert. (can be a modality-specific list)library_kit: The kit used to make the library sequence_protocol compatible. (can be a modality-specific list)sequence_protocol: The protocol/machine/tool to generate sequences. (can be a modality-specific list)sequence_kit: The kit used with the protocol to sequence the library. (can be a modality-specific list)sequence_spec: The spec for the sequence structure, an array of Read objects.library_spec: The spec for the library structure, an array of Region objects.
Example:
!Assay
seqspec_version: 0.3.0
assay_id: SPLiT-seq/Illumina
name: SPLiT-seq
doi: https://doi.org/10.1126/science.aam8999
date: 15 March 2018
description: split-pool ligation-based transcriptome sequencing
modalities:
- rna
lib_struct: https://teichlab.github.io/scg_lib_structs/methods_html/SPLiT-seq.html
library_protocol: SPLiT-seq
library_kit: Custom
sequence_protocol: Illumina NovaSeq 6000 (EFO:0008637)
sequence_kit:
- !SeqKit
kit_id: "NovaSeq 6000 S2 Reagent Kit v1.5 (100\u2009cycles)"
name: illumina
modality: rna
sequence_spec: ...
library_spec: ...Region Object¶
The library_spec contains a list of, possibly nested, Region objects which detail individual segments within the sequencing library molecule, specifying types, sequences, and relationships between segments. The order of the Regions in the library_spec (top to bottom) corresponds to their linear ordering in the library molecule from the 5’ -> 3’ end.
modalities:
- rna
library_spec:
- region_id: rna # <-- must be a "modality" region
regions: # <-- a list containing the linear ordering of the "regions" for the "rna" library molecule
- region_id: illumina_p5
...
- region_id: read1_primer
...
- region_id: cell_bc
...
- region_id: umi
...Each Region has the following properties which are useful to annotate the element of the library molecule:
region_idis a free-form string and must be unique across all regions in theseqspecfile.- if the assay contains multiple regions of the same
region_typeit may be useful to append an integer to the end of theregion_idto differentiate those regions. For example, if the assay had fourbarcodesthen each of the individualbarcoderegions could have theregion_idsbarcode-1,barcode-2,barcode-3,barcode-4.
- if the assay contains multiple regions of the same
region_typecan be one of the following:atac: The modality for chromatin accesibility capturebarcode: A region corresponding to a synthetic barcode sequence often associated with samples or cellscdna: Complementary DNA generated from an RNA productcrispr: The modality for barcode-based CRISPR assaycustom_primer: A synthesized segment of nucleic acid used to initiate DNA synthesis.dna: Deoxyribonucleic acid, targets often generated for MPRA assays.fastq: A region corresponding to a FASTQ file.fastq_link: A region corresponding to a FASTQ file that is stored remotely (via url).gdna: Genomic DNA, targets often obtained with ATACseq.hic: The modality corresponding to high-throughput chromosome conformation capture, a technique for studying the three-dimensional structure of genomes.illumina_p5: A sequencing primer specific to Illumina platforms, used to bind the library molecule to the flow cell.illumina_p7: A sequencing primer specific to Illumina platforms, used to bind the library molecule to the flow cell.index5: A barcode sequence used for multiplexing and sample identification in sequencing, associated with the P5 end.index7: A barcode sequence used for multiplexing and sample identification in sequencing, associated with the P7 end.linker: A short, synthetic DNA sequence used to connect two molecules or fragments.ME1: Mosaic end 1, used in the Nextera Library kit for library preparation.ME2: Mosaic end 2, used in the Nextera Library kit for library preparation.methyl: The modality for methylation sequencing which assays the presence of a methyl group.named: A custom named region for grouping other regions.meta: A top-level modality placeholder used byseqspec init.nextera_read1: A read sequence obtained from the first end in paired-end Nextera library sequencing.nextera_read2: A read sequence obtained from the second end in paired-end Nextera library sequencing.poly_A: A sequence of multiple adenine nucleotides.poly_G: A sequence of multiple guanine nucleotides.poly_T: A sequence of multiple thymine nucleotides.poly_C: A sequence of multiple cytosine nucleotides.protein: The modality corresponding to assaying cell-surface proteins.rna: The modality corresponding to assaying RNA.s5: A sequencing primer or adaptor typically used in the Nextera kit in conjunction with ME1.s7: A sequencing primer or adaptor typically used in the Nextera kit in conjunction with ME2.sgrna_target: A sequence corresponding to the guide RNA spacer region that determines the genomic target of CRISPR-based perturbations.tag: A short sequence of DNA or RNA used to label or identify a sample, protein, or other grouping.truseq_read1: The first read primer in a paired-end sequencing run using the Illumina TruSeq Library preparation kit.truseq_read2: The second read primer in a paired-end sequencing run using the Illumina TruSeq Library preparation kit.umi: Unique Molecular Identifier, a short nucleotide sequence used to tag individual molecules.
sequence_typecan be one of the following:fixed: indicates that sequence string is known and fixed in length and nucleotide composition (if specified, thensequencemust contain the fixed nucleotide sequence.)joined: indicates that the sequence is created (joined) from nested regions (if specified, then theregions:property for thatRegionmust containRegions, aka must be non-null.)onlist: indicates that the sequence is derived from an onlist (if specified, thenonlistmust be non-null andsequencemust comprise allN’s)random: indicates that the sequence is not known a-priori (if specified, then thesequencemust comprise allXs)
sequence:a representation of the sequence, must match the pattern^[ACGTRYMKSWHBVDNX]+$- if the
sequence_typeisfixedthen the actual sequence string is provided - if the
sequence_typeisjoinedthen field must be the concatenation of the nested regions - if the
sequence_typeisonlistthen field must anNstring of length of the shortest sequence on the onlist - if the
sequence_typeisrandomthen the field must be anXstring
- if the
min_len: an integer greater than or equal to 0 and less than or equal to 2048. It represents the minimum possible length of thesequencemax_len: an integer greater than or equal to 0 and less than or equal to 2048. It represents the maximum length of thesequenceonlist: can benullor contain aFileobject (seeFileObject section below)file_id: a freeform string that uniquely identifies the file.filename: a freeform string that matches the name of the file being annotatedfilesize: an integer that represents the size of the compressed file (in bytes)filetype:a free form string that specifies the file type (usually the extension of thefilename, e.g. R1.fastq.gz hasfiletype: fastq.)url: a freeform string that specifies either the url location of the file, or the local path of the file (relative to this seqspec file)urltype: can be one of [“local”, “ftp”, “http”, “https”] specifies the type of theurlmd5: the md5sum of the uncompressed file infilename, must match the pattern^[a-f0-9]{32}$
regionscan either benullor contain a list ofregionsas specified above.
Example:
!Region
region_id: barcode-1
region_type: barcode
sequence_type: onlist
sequence: NNNNNNNN
min_len: 8
max_len: 8
onlist: !Onlist
file_id: barcode-1_onlist.txt
filename: barcode-1_onlist.txt
filetype: txt
filesize: 120
url: ./
urltype: local
md5: 5b62453df2771f5aa856f78797f16591
regions: nullFor more information about the various fields, please see the JSON schema specification (seqspec/schema/seqspec.schema.json). For consistency across assays I suggest following a standard naming conventions for common regions. I’ve made a collection of “named” regions available; please see seqspec/docs/regions for a list of example regions.
Read Object¶
The sequence_spec contains a list of Read objects which describe the sequencing “reads” that are generated from sequencing the molecule described in the library_spec. A crucial concept is that Read objects contain a primer_id which maps to a single region_id in the library_spec. Importantly, Reads can contain Files which I describe in the subsequent section.
sequence_spec:
- read_id: Read1
modality: Modality1
primer_id: Region2
strand: pos
min_len: 10
max_len: 100
files:
- file_id: R1.fastq.gz
...A Read object is annotated with the following attributes:
read_id: A freeform string that functions as a unique identifier for the read.name: A freeform string that functions as the name of the read.modality: A string that matches the modality of the assay generating the read.primer_id: A string that matches the region id of the primer used to generate the read (in thelibrary_spec).min_len: An integer greater than or equal to zero specifying the minimum length of the read.max_len: An integer greater than or equal to zero specifying the maximum length of the read.strand: One of [“pos”, “neg”], denotes the strandedness of the read.files: A list ofFileobjects that contain sequences that match the structure of the parentRead.
Example:
- !Read
read_id: read_001
name: Read 1 of Sample A
modality: rna
primer_id: primer_25
min_len: 50
max_len: 300
strand: pos
files:
- !File
- file_id: read_001.fastq.gz
...File Object¶
Files are annotated with the File object. Files can be local or remote (e.g., FASTQ, BAM, POD5, TXT, SRA). File objects contain the following attributes:
file_id: a freeform string that uniquely identifies the file.filename: a freeform string that matches the name of the file being annotatedfilesize: an integer that represents the size of the compressed file (in bytes)filetype:a free form string that specifies the file type (usually the extension of thefilename, e.g. R1.fastq.gz hasfiletype: fastq.)url: a freeform string that specifies either the url location of the file, or the local path of the file (relative to this seqspec file)urltype: can be one of [“local”, “ftp”, “http”, “https”] specifies the type of theurlmd5: the md5sum of the uncompressed file infilename, must match the pattern^[a-f0-9]{32}$
File objects are used in the Onlist object within “onlist” Regions. They are also used in the Read objects as a list of File objects.
Python library¶
seqspec files can be loaded into python as a python object. Manipulation becomes straightforward with dot notation:
from seqspec.utils import load_spec
spec = load_spec("seqspec/assays/10x-RNA-v3/spec.yaml")
print(spec.get_libspec("RNA").sequence)
# AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNXAGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG