A seqspec
file provides complete information about the structure of a sequencing library and seqeuencing reads generated from it. Herein we document common tasks that users may find useful when working with seqspec
files. In this document we will work with the examples/specs/SPLiT-seq/spec.yaml
file.
! git clone https://github.com/pachterlab/seqspec/
! cd seqspec && git checkout devel && pip install --quiet -e .
! ln -s seqspec/examples/specs/SPLiT-seq/*onlist* .
! ln -s seqspec/examples/specs/SPLiT-seq/spec.yaml .
Cloning into 'seqspec'...
remote: Enumerating objects: 1670, done.
remote: Counting objects: 100% (847/847), done.
remote: Compressing objects: 100% (427/427), done.
remote: Total 1670 (delta 498), reused 689 (delta 412), pack-reused 823
Receiving objects: 100% (1670/1670), 6.52 MiB | 6.10 MiB/s, done.
Resolving deltas: 100% (1019/1019), done.
Branch 'devel' set up to track remote branch 'devel' from 'origin'.
Switched to a new branch 'devel'
Preparing metadata (setup.py) ... done
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 12.4 MB/s eta 0:00:00
What does a seqspec look look?¶
! head spec.yaml
!Assay
seqspec_version: 0.2.0
assay_id: SPLiTSeq
name: SPLiTSeq
doi: https://doi.org/10.1126/science.aam8999
date: 15 March 2018
description: split-pool ligation-based transcriptome sequencing
modalities:
- rna
lib_struct: https://teichlab.github.io/scg_lib_structs/methods_html/SPLiT-seq.html
Is my seqspec formatted correctly?¶
! seqspec check spec.yaml
[error 1] 'read1_primer' is not one of ['atac', 'barcode', 'cdna', 'crispr', 'custom_primer', 'dna', 'fastq', 'fastq_link', 'gdna', 'hic', 'illumina_p5', 'illumina_p7', 'index5', 'index7', 'linker', 'ME1', 'ME2', 'methyl', 'named', 'nextera_read1', 'nextera_read2', 'poly_A', 'poly_G', 'poly_T', 'poly_C', 'protein', 'rna', 's5', 's7', 'tag', 'truseq_read1', 'truseq_read2', 'umi'] in spec['library_spec'][0]['regions'][2]['region_type']
[error 2] 'primer' is not one of ['atac', 'barcode', 'cdna', 'crispr', 'custom_primer', 'dna', 'fastq', 'fastq_link', 'gdna', 'hic', 'illumina_p5', 'illumina_p7', 'index5', 'index7', 'linker', 'ME1', 'ME2', 'methyl', 'named', 'nextera_read1', 'nextera_read2', 'poly_A', 'poly_G', 'poly_T', 'poly_C', 'protein', 'rna', 's5', 's7', 'tag', 'truseq_read1', 'truseq_read2', 'umi'] in spec['library_spec'][0]['regions'][4]['region_type']
[error 3] 'primer' is not one of ['atac', 'barcode', 'cdna', 'crispr', 'custom_primer', 'dna', 'fastq', 'fastq_link', 'gdna', 'hic', 'illumina_p5', 'illumina_p7', 'index5', 'index7', 'linker', 'ME1', 'ME2', 'methyl', 'named', 'nextera_read1', 'nextera_read2', 'poly_A', 'poly_G', 'poly_T', 'poly_C', 'protein', 'rna', 's5', 's7', 'tag', 'truseq_read1', 'truseq_read2', 'umi'] in spec['library_spec'][0]['regions'][11]['region_type']
[error 3] R1.fastq.gz file does not exist
[error 4] I1.fastq.gz file does not exist
[error 5] R2.fastq.gz file does not exist
What are the contents of my library?¶
! seqspec print spec.yaml
┌─'P5:29'
├─'Spacer:8'
├─'Read_1_primer:33'
├─'cDNA:100'
├─'RT_primer:15'
├─'Round_1_BC:8'
├─'linker_1:30'
──────────────────── ──rna────────────────┤
├─'Round_2_BC:8'
├─'Linker_2:30'
├─'Round_3_BC:8'
├─'UMI:10'
├─'Read_2_primer:22'
├─'Round_4_BC:6'
└─'P7:24'
What modalities are annotated in my spec?¶
! seqspec info -k modalities spec.yaml
rna
What library kits/protocols and sequencing kit/protocols were used to generated the library?¶
! seqspec info -f json -k meta spec.yaml
{
"seqspec_version": "0.2.0",
"assay_id": "SPLiTSeq",
"name": "SPLiTSeq",
"doi": "https://doi.org/10.1126/science.aam8999",
"date": "15 March 2018",
"description": "split-pool ligation-based transcriptome sequencing",
"lib_struct": "https://teichlab.github.io/scg_lib_structs/methods_html/SPLiT-seq.html",
"sequence_protocol": "NextSeq 500",
"sequence_kit": "NextSeq 550 reagents",
"library_protocol": "SPLiTSeq RNA",
"library_kit": "Custom/Nextera/Truseq Single Index"
}
What are the sequencing reads annotated in my spec?¶
! seqspec info -k sequence_spec spec.yaml
rna R1.fastq.gz pos 140 140 Read_1_primer Read 1
rna I1.fastq.gz pos 6 6 Read_2_primer Index 1 (i7 index)
rna R2.fastq.gz neg 86 86 Read_2_primer Read 2
What does my sequencing library look like?¶
! seqspec info -k library_spec spec.yaml
rna P5 illumina_p5 P5 fixed AATGATACGGCGACCACCGAGATCTACAC 29 29 None
rna Spacer linker Spacer fixed TAGATCGC 8 8 None
rna Read_1_primer read1_primer Read_1_primer fixed TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG 33 33 None
rna cDNA cdna cDNA random XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 1 100 None
rna RT_primer primer RT_primer onlist NNNNNNNNNNNNNNN 6 15 onlist_rt_primer.txt
rna Round_1_BC barcode Round_1_BC onlist NNNNNNNN 8 8 onlist_round1.txt
rna linker_1 linker linker_1 fixed GCTTACGAGACCGGAGAGTTCGTGCACCTA 30 30 None
rna Round_2_BC barcode Round_2_BC onlist NNNNNNNN 8 8 onlist_round1.txt
rna Linker_2 linker Linker_2 fixed TCAGCATGCGGCTACGCTTTGTAGCCGGTG 30 30 None
rna Round_3_BC barcode Round_3_BC onlist NNNNNNNN 8 8 onlist_round1.txt
rna UMI umi UMI random XXXXXXXXXX 10 10 onlist_round2.txt
rna Read_2_primer primer Read_2_primer fixed TCTAGCCTTCTCGTGTGCAGAC 22 22 onlist_round3.txt
rna Round_4_BC barcode Round_4_BC onlist NNNNNN 6 6 onlist_round4.txt
rna P7 illumina_p7 P7 fixed ATCTCGTATGCCGTCTTCTGCTTG 24 24 None
Which library elements are contained in my sequencing reads?¶
Using the tables above we specify the sequencing read and the associated modality.
Elements in R1.fastq.gz
! seqspec index -m rna -i "R1.fastq.gz" spec.yaml
R1.fastq.gz cDNA cdna 0 100
R1.fastq.gz RT_primer primer 100 115
R1.fastq.gz Round_1_BC barcode 115 123
R1.fastq.gz linker_1 linker 123 140
Elements in R2.fastq.gz
! seqspec index -m rna -i "R2.fastq.gz" spec.yaml
R2.fastq.gz UMI umi 0 10
R2.fastq.gz Round_3_BC barcode 10 18
R2.fastq.gz Linker_2 linker 18 48
R2.fastq.gz Round_2_BC barcode 48 56
R2.fastq.gz linker_1 linker 56 86
Elements in I1.fastq.gz
! seqspec index -m rna -i "I1.fastq.gz" spec.yaml
I1.fastq.gz Round_4_BC barcode 0 6
Finding elements in seqspec¶
! seqspec find -m rna -i "Round_1_BC" spec.yaml
- !Region
parent_id: rna
region_id: Round_1_BC
region_type: barcode
name: Round_1_BC
sequence_type: onlist
sequence: NNNNNNNN
min_len: 8
max_len: 8
onlist: !Onlist
filename: onlist_round1.txt
md5: d3818caa32bb707c98e17aa614be58ef
location: local
regions: null
Which elements use an onlist?¶
! seqspec file -m rna -f list -s onlist -k all spec.yaml
RT_primer onlist_rt_primer.txt local 076f32ce8a96038bfc0c618da2204c77
Round_1_BC onlist_round1.txt local d3818caa32bb707c98e17aa614be58ef
Round_2_BC onlist_round1.txt local d3818caa32bb707c98e17aa614be58ef
Round_3_BC onlist_round1.txt local d3818caa32bb707c98e17aa614be58ef
UMI onlist_round2.txt local d3818caa32bb707c98e17aa614be58ef
Read_2_primer onlist_round3.txt local d3818caa32bb707c98e17aa614be58ef
Round_4_BC onlist_round4.txt local 39435765800482d7ff4cea4d33fc403c
How do I generate an onlist for multiple barcodes?¶
! seqspec onlist -m rna -s region-type -i barcode spec.yaml
/content/onlist_joined.txt
! head -2 *onlist*
==> onlist_joined.txt <==
ATCACGTTATCACGTTATCACGTTCAGATC
ATCACGTTATCACGTTATCACGTTACTTGA
==> onlist_round1.txt <==
ATCACGTT
CGATGTTT
==> onlist_round2.txt <==
ATCACGTT
CGATGTTT
==> onlist_round3.txt <==
ATCACGTT
CGATGTTT
==> onlist_round4.txt <==
CAGATC
ACTTGA
==> onlist_rt_primer.txt <==
XXXXXX
AAAAAAAAAAAAAAA
How do I initialize a new spec?¶
! seqspec init -o myassay.yaml -n myassay -m rna -r "rna,R1.fastq.gz,r1_primer,16,pos:rna,R2.fastq.gz,r2_primer,100,neg" "((r1_primer:0,barcode:16,umi:12,cdna:150,r2_primer:0)rna)"
! seqspec print myassay.yaml
┌─'r1_primer:10'
├─'barcode:16'
├─'umi:12'
──────────────── ──rna────────────┤
├─'cDNA:150'
├─'r2_primer:10'
└─'index7:8'