Constructing a transcriptome index with kb
This tutorial provides instructions for how to generate a transcriptome index to use with kallisto | bustools using kb.
Download reference files
Download the genomic (DNA) FASTA and GTF annotations for your desired organism from the database of your choice. This tutorial uses mouse reference files downloaded from Ensembl .
%%time
!wget -q ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
!wget -q ftp://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/Mus_musculus.GRCm38.98.gtf.gz
CPU times: user 173 ms, sys: 37.3 ms, total: 211 ms
Wall time: 28.8 s
Install kb
!pip install --quiet kb-python
[K | ████████████████████████████████| 59 .1 MB 71 kB / s
[K | ████████████████████████████████| 51 kB 4 .9 MB / s
[K | ████████████████████████████████| 122 kB 46 .5 MB / s
[K | ████████████████████████████████| 13 .2 MB 42 .9 MB / s
[K | ████████████████████████████████| 10 .3 MB 30 .2 MB / s
[K | ████████████████████████████████| 112 kB 61 .3 MB / s
[K | ████████████████████████████████| 81 kB 7 .9 MB / s
[K | ████████████████████████████████| 1 .2 MB 42 .2 MB / s
[K | ████████████████████████████████| 51 kB 3 .8 MB / s
[K | ████████████████████████████████| 71 kB 5 .2 MB / s
[?25 h Building wheel for loompy ( setup .py ) ... [?25 l [?25 hdone
Building wheel for numpy - groupies ( setup .py ) ... [?25 l [?25 hdone
Building wheel for umap - learn ( setup .py ) ... [?25 l [?25 hdone
Building wheel for sinfo ( setup .py ) ... [?25 l [?25 hdone
Building wheel for pynndescent ( setup .py ) ... [?25 l [?25 hdone
Build the index
kb automatically splits the genome into a cDNA FASTA file and uses that to build a kallisto index.
%%time
!kb ref -i transcriptome.idx -g transcripts_to_genes.txt -f1 cdna.fa \
Mus_musculus.GRCm38.dna.primary_assembly.fa.gz \
Mus_musculus.GRCm38.98.gtf.gz
[2021 - 03 - 31 19 :42 :16 ,748 ] INFO Preparing Mus_musculus .GRCm38 .dna .primary_assembly .fa .gz , Mus_musculus .GRCm38 .98 .gtf .gz
[2021 - 03 - 31 19 :42 :16 ,748 ] INFO Decompressing Mus_musculus .GRCm38 .98 .gtf .gz to tmp
[2021 - 03 - 31 19 :42 :20 ,207 ] INFO Creating transcript - to - gene mapping at / content / tmp / tmp8orgc74k
[2021 - 03 - 31 19 :43 :01 ,135 ] INFO Decompressing Mus_musculus .GRCm38 .dna .primary_assembly .fa .gz to tmp
[2021 - 03 - 31 19 :43 :25 ,710 ] INFO Sorting tmp / Mus_musculus .GRCm38 .dna .primary_assembly .fa to / content / tmp / tmpees_4ry7
[2021 - 03 - 31 19 :50 :50 ,364 ] INFO Sorting tmp / Mus_musculus .GRCm38 .98 .gtf to / content / tmp / tmpssw7nu7e
[2021 - 03 - 31 19 :51 :51 ,290 ] INFO Splitting genome tmp / Mus_musculus .GRCm38 .dna .primary_assembly .fa into cDNA at / content / tmp / tmpg65jndu0
[2021 - 03 - 31 19 :51 :51 ,290 ] WARNING The following chromosomes were found in the FASTA but does not have any " transcript " features in the GTF : JH584302 .1 , GL456387 .1 , GL456396 .1 , GL456367 .1 , GL456366 .1 , GL456394 .1 , GL456383 .1 , GL456382 .1 , GL456393 .1 , GL456368 .1 , GL456379 .1 , GL456390 .1 , GL456378 .1 , GL456360 .1 , GL456389 .1 , JH584301 .1 , JH584300 .1 , GL456392 .1 , GL456370 .1 , GL456359 .1 , GL456213 .1 . No sequences will be generated for these chromosomes .
[2021 - 03 - 31 19 :53 :04 ,043 ] INFO Wrote 142446 cDNA transcripts
[2021 - 03 - 31 19 :53 :04 ,047 ] INFO Concatenating 1 transcript - to - gene mappings to transcripts_to_genes .txt
[2021 - 03 - 31 19 :53 :04 ,264 ] INFO Concatenating 1 cDNAs to cdna .fa
[2021 - 03 - 31 19 :53 :05 ,204 ] INFO Indexing cdna .fa to transcriptome .idx
CPU times : user 7 .92 s , sys : 1 .04 s , total : 8 .96 s
Wall time : 21 min 57 s