Skip to article frontmatterSkip to article content

Getting started

Open In Colab

A seqspec file provides complete information about the structure of a sequencing library and seqeuencing reads generated from it. Herein we document common tasks that users may find useful when working with seqspec files. In this document we will work with the examples/specs/SPLiT-seq/spec.yaml file.

! git clone https://github.com/pachterlab/seqspec/
! cd seqspec && git checkout devel && pip install --quiet -e .
! ln -s seqspec/examples/specs/SPLiT-seq/*onlist* .
! ln -s seqspec/examples/specs/SPLiT-seq/spec.yaml .
Cloning into 'seqspec'...
remote: Enumerating objects: 1670, done.
remote: Counting objects: 100% (847/847), done.
remote: Compressing objects: 100% (427/427), done.
remote: Total 1670 (delta 498), reused 689 (delta 412), pack-reused 823
Receiving objects: 100% (1670/1670), 6.52 MiB | 6.10 MiB/s, done.
Resolving deltas: 100% (1019/1019), done.
Branch 'devel' set up to track remote branch 'devel' from 'origin'.
Switched to a new branch 'devel'
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 12.4 MB/s eta 0:00:00

What does a seqspec look look?

! head spec.yaml
!Assay
seqspec_version: 0.2.0
assay_id: SPLiTSeq
name: SPLiTSeq
doi: https://doi.org/10.1126/science.aam8999
date: 15 March 2018
description: split-pool ligation-based transcriptome sequencing
modalities:
- rna
lib_struct: https://teichlab.github.io/scg_lib_structs/methods_html/SPLiT-seq.html

Is my seqspec formatted correctly?

! seqspec check spec.yaml
[error 1] 'read1_primer' is not one of ['atac', 'barcode', 'cdna', 'crispr', 'custom_primer', 'dna', 'fastq', 'fastq_link', 'gdna', 'hic', 'illumina_p5', 'illumina_p7', 'index5', 'index7', 'linker', 'ME1', 'ME2', 'methyl', 'named', 'nextera_read1', 'nextera_read2', 'poly_A', 'poly_G', 'poly_T', 'poly_C', 'protein', 'rna', 's5', 's7', 'tag', 'truseq_read1', 'truseq_read2', 'umi'] in spec['library_spec'][0]['regions'][2]['region_type']
[error 2] 'primer' is not one of ['atac', 'barcode', 'cdna', 'crispr', 'custom_primer', 'dna', 'fastq', 'fastq_link', 'gdna', 'hic', 'illumina_p5', 'illumina_p7', 'index5', 'index7', 'linker', 'ME1', 'ME2', 'methyl', 'named', 'nextera_read1', 'nextera_read2', 'poly_A', 'poly_G', 'poly_T', 'poly_C', 'protein', 'rna', 's5', 's7', 'tag', 'truseq_read1', 'truseq_read2', 'umi'] in spec['library_spec'][0]['regions'][4]['region_type']
[error 3] 'primer' is not one of ['atac', 'barcode', 'cdna', 'crispr', 'custom_primer', 'dna', 'fastq', 'fastq_link', 'gdna', 'hic', 'illumina_p5', 'illumina_p7', 'index5', 'index7', 'linker', 'ME1', 'ME2', 'methyl', 'named', 'nextera_read1', 'nextera_read2', 'poly_A', 'poly_G', 'poly_T', 'poly_C', 'protein', 'rna', 's5', 's7', 'tag', 'truseq_read1', 'truseq_read2', 'umi'] in spec['library_spec'][0]['regions'][11]['region_type']
[error 3] R1.fastq.gz file does not exist
[error 4] I1.fastq.gz file does not exist
[error 5] R2.fastq.gz file does not exist

What are the contents of my library?

! seqspec print spec.yaml
                                          ┌─'P5:29'
                                          ├─'Spacer:8'
                                          ├─'Read_1_primer:33'
                                          ├─'cDNA:100'
                                          ├─'RT_primer:15'
                                          ├─'Round_1_BC:8'
                                          ├─'linker_1:30'
──────────────────── ──rna────────────────┤
                                          ├─'Round_2_BC:8'
                                          ├─'Linker_2:30'
                                          ├─'Round_3_BC:8'
                                          ├─'UMI:10'
                                          ├─'Read_2_primer:22'
                                          ├─'Round_4_BC:6'
                                          └─'P7:24'

What modalities are annotated in my spec?

! seqspec info -k modalities spec.yaml
rna

What library kits/protocols and sequencing kit/protocols were used to generated the library?

! seqspec info -f json -k meta spec.yaml
{
    "seqspec_version": "0.2.0",
    "assay_id": "SPLiTSeq",
    "name": "SPLiTSeq",
    "doi": "https://doi.org/10.1126/science.aam8999",
    "date": "15 March 2018",
    "description": "split-pool ligation-based transcriptome sequencing",
    "lib_struct": "https://teichlab.github.io/scg_lib_structs/methods_html/SPLiT-seq.html",
    "sequence_protocol": "NextSeq 500",
    "sequence_kit": "NextSeq 550 reagents",
    "library_protocol": "SPLiTSeq RNA",
    "library_kit": "Custom/Nextera/Truseq Single Index"
}

What are the sequencing reads annotated in my spec?

! seqspec info -k sequence_spec spec.yaml
rna	R1.fastq.gz	pos	140	140	Read_1_primer	Read 1
rna	I1.fastq.gz	pos	6	6	Read_2_primer	Index 1 (i7 index)
rna	R2.fastq.gz	neg	86	86	Read_2_primer	Read 2

What does my sequencing library look like?

! seqspec info -k library_spec spec.yaml
rna	P5	illumina_p5	P5	fixed	AATGATACGGCGACCACCGAGATCTACAC	29	29	None
rna	Spacer	linker	Spacer	fixed	TAGATCGC	8	8	None
rna	Read_1_primer	read1_primer	Read_1_primer	fixed	TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG	33	33	None
rna	cDNA	cdna	cDNA	random	XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX	1	100	None
rna	RT_primer	primer	RT_primer	onlist	NNNNNNNNNNNNNNN	6	15	onlist_rt_primer.txt
rna	Round_1_BC	barcode	Round_1_BC	onlist	NNNNNNNN	8	8	onlist_round1.txt
rna	linker_1	linker	linker_1	fixed	GCTTACGAGACCGGAGAGTTCGTGCACCTA	30	30	None
rna	Round_2_BC	barcode	Round_2_BC	onlist	NNNNNNNN	8	8	onlist_round1.txt
rna	Linker_2	linker	Linker_2	fixed	TCAGCATGCGGCTACGCTTTGTAGCCGGTG	30	30	None
rna	Round_3_BC	barcode	Round_3_BC	onlist	NNNNNNNN	8	8	onlist_round1.txt
rna	UMI	umi	UMI	random	XXXXXXXXXX	10	10	onlist_round2.txt
rna	Read_2_primer	primer	Read_2_primer	fixed	TCTAGCCTTCTCGTGTGCAGAC	22	22	onlist_round3.txt
rna	Round_4_BC	barcode	Round_4_BC	onlist	NNNNNN	6	6	onlist_round4.txt
rna	P7	illumina_p7	P7	fixed	ATCTCGTATGCCGTCTTCTGCTTG	24	24	None

Which library elements are contained in my sequencing reads?

Using the tables above we specify the sequencing read and the associated modality.

Elements in R1.fastq.gz

! seqspec index -m rna -i "R1.fastq.gz" spec.yaml
R1.fastq.gz	cDNA	cdna	0	100
R1.fastq.gz	RT_primer	primer	100	115
R1.fastq.gz	Round_1_BC	barcode	115	123
R1.fastq.gz	linker_1	linker	123	140

Elements in R2.fastq.gz

! seqspec index -m rna -i "R2.fastq.gz" spec.yaml
R2.fastq.gz	UMI	umi	0	10
R2.fastq.gz	Round_3_BC	barcode	10	18
R2.fastq.gz	Linker_2	linker	18	48
R2.fastq.gz	Round_2_BC	barcode	48	56
R2.fastq.gz	linker_1	linker	56	86

Elements in I1.fastq.gz

! seqspec index -m rna -i "I1.fastq.gz" spec.yaml
I1.fastq.gz	Round_4_BC	barcode	0	6

Finding elements in seqspec

! seqspec find -m rna -i "Round_1_BC" spec.yaml
- !Region
  parent_id: rna
  region_id: Round_1_BC
  region_type: barcode
  name: Round_1_BC
  sequence_type: onlist
  sequence: NNNNNNNN
  min_len: 8
  max_len: 8
  onlist: !Onlist
    filename: onlist_round1.txt
    md5: d3818caa32bb707c98e17aa614be58ef
    location: local
  regions: null

Which elements use an onlist?

! seqspec file -m rna -f list -s onlist -k all spec.yaml
RT_primer	onlist_rt_primer.txt	local	076f32ce8a96038bfc0c618da2204c77
Round_1_BC	onlist_round1.txt	local	d3818caa32bb707c98e17aa614be58ef
Round_2_BC	onlist_round1.txt	local	d3818caa32bb707c98e17aa614be58ef
Round_3_BC	onlist_round1.txt	local	d3818caa32bb707c98e17aa614be58ef
UMI	onlist_round2.txt	local	d3818caa32bb707c98e17aa614be58ef
Read_2_primer	onlist_round3.txt	local	d3818caa32bb707c98e17aa614be58ef
Round_4_BC	onlist_round4.txt	local	39435765800482d7ff4cea4d33fc403c

How do I generate an onlist for multiple barcodes?

! seqspec onlist -m rna -s region-type -i barcode spec.yaml
/content/onlist_joined.txt
! head -2 *onlist*
==> onlist_joined.txt <==
ATCACGTTATCACGTTATCACGTTCAGATC
ATCACGTTATCACGTTATCACGTTACTTGA

==> onlist_round1.txt <==
ATCACGTT
CGATGTTT

==> onlist_round2.txt <==
ATCACGTT
CGATGTTT

==> onlist_round3.txt <==
ATCACGTT
CGATGTTT

==> onlist_round4.txt <==
CAGATC
ACTTGA

==> onlist_rt_primer.txt <==
XXXXXX
AAAAAAAAAAAAAAA

How do I initialize a new spec?

! seqspec init -o myassay.yaml -n myassay -m rna -r "rna,R1.fastq.gz,r1_primer,16,pos:rna,R2.fastq.gz,r2_primer,100,neg" "((r1_primer:0,barcode:16,umi:12,cdna:150,r2_primer:0)rna)"
! seqspec print myassay.yaml
                                  ┌─'r1_primer:10'
                                  ├─'barcode:16'
                                  ├─'umi:12'
──────────────── ──rna────────────┤
                                  ├─'cDNA:150'
                                  ├─'r2_primer:10'
                                  └─'index7:8'