Getting Started

The short tutorial below explains how to run kallisto on bulk RNA-seq data using a small example distributed with the program. kallisto can also be used to pre-process single-cell RNA-seq, and a tutorial on that is available at the kallisto | bustools page.

Download and installation

Begin by downloading and installing the program by following instructions on the download page. The files needed to confirm that kallisto is working are included with the binaries downloadable from the download page.

After downloading and installing kallisto you should be able to type kallisto and see:

kallisto 0.44.0

Usage: kallisto <CMD> [arguments] ..

Where <CMD> can be one of:

    index         Builds a kallisto index 
    quant         Runs the quantification algorithm 
    pseudo        Runs the pseudoalignment step 
    h5dump        Converts HDF5-formatted results to plaintext
    inspect       Inspects and gives information about an index
    version       Prints version information
    cite          Prints citation information

Running kallisto <CMD> without arguments prints usage information for <CMD>

Building an index

kallisto quantifies read files directly without the need for read alignment, but it does perform a procedure called pseudoalignment. Pseudoalignment requires processing a transcriptome file to create a “transcriptome index”. To begin, first change directories to where the test files distributed with the kallisto executable are located:

cd kallisto/tests

Next, build an index type:

kallisto index -i transcripts.idx transcripts.fasta.gz

Quantification

Now you can quantify abundances of the transcripts using the two read files reads_1.fastq.gz and reads_2.fastq.gz (the .gz suffix means the read files have been gzipped; kallisto can read in either plain-text or gzipped read files). To quantify abundances type:

kallisto quant -i transcripts.idx -o output -b 100 reads_1.fastq.gz reads_2.fastq.gz

You can also call kallisto with

kallisto quant -i transcripts.idx -o output -b 100 <(gzcat reads_1.fastq.gz) <(gzcat reads_2.fastq.gz)

or with linux, you replace gzcat with zcat or any other program that writes the FASTQ to stdout. This utilizes an additional core to uncompress the FASTQ files, and speeds up the program by 10–15%.

Single end reads

If your reads are single end only you can run kallisto by specifying the --single flag,

kallisto quant -i transcripts.idx -o output -b 100 --single -l 180 -s 20 reads_1.fastq.gz

however you must supply the length and standard deviation of the fragment length (not the read length).

Results

The results of a kallisto run are placed in the specified output directory (the -o option), and therefore the test results should be located in the subdirectory “output”. The contents of the directory should look like this:

total 568
-rw-r--r--  1 username  staff  282480 May  3 10:10 abundance.h5
-rw-r--r--  1 username  staff     589 May  3 10:10 abundance.tsv
-rw-r--r--  1 username  staff     227 May  3 10:10 run_info.json

The results of the main quantification, i.e. the abundance estimate using kallisto on the data is in the abundance.tsv file. Abundances are reported in “estimated counts” (est_counts) and in Transcripts Per Million (TPM). The abundance.tsv file you get should look like this:

target_id	length	eff_length	est_counts	tpm
ENST00000513300.5	1924	1746.98	102.328	11129.2
ENST00000282507.7	2355	2177.98	1592.02	138884
ENST00000504685.5	1476	1298.98	68.6528	10041.8
ENST00000243108.4	1733	1555.98	343.499	41944.9
ENST00000303450.4	1516	1338.98	664	94221.8
ENST00000243082.4	2039	1861.98	55	5612.36
ENST00000303406.4	1524	1346.98	304.189	42908.2
ENST00000303460.4	1936	1758.98	47	5076.85
ENST00000243056.4	2423	2245.98	42	3553.05
ENST00000312492.2	1805	1627.98	228	26609.9
ENST00000040584.5	1889	1711.98	4295	476675
ENST00000430889.2	1666	1488.98	623.628	79578.2
ENST00000394331.3	2943	2765.98	85.6842	5885.85
ENST00000243103.3	3335	3157.98	962	57879.3

The file is tab delimited so that it can easily parsed. The output can also be analyzed with the sleuth tool.

The run_info.json file contains a summary of the run, including data on the number targets used for quantification, the number of bootstraps performed, the version of the program used and how it was called. You should see this:

{
	"n_targets": 14,
	"n_bootstraps": 30,
	"n_processed": 10000,
	"n_pseudoaligned": 9413,
	"n_unique": 7174,
	"p_pseudoaligned": 94.1,
	"p_unique": 71.7,
	"kallisto_version": "0.44.0",
	"index_version": 10,
	"start_time": "Tue Jan 30 09:34:31 2018",
	"call": "kallisto quant -i transcripts.kidx -b 30 -o kallisto_out reads_1.fastq.gz reads_2.fastq.gz"
}

The h5 file contains the main quantification together with the boostraps in HDF5 format. The reason for this binary format is to compress the large output of runs with many bootstraps. The h5dump command in kallisto can be used to convert the file to plain-text.

To visualize the pseudoalignments we need to run kallisto with the --genomebam option. To do this we need two additional files, a GTF file, which describes where the transcripts lie in the genome, and a text file containing the length of each chromosome. These files are part of the test directory. To run kallisto we type

kallisto quant -i transcripts.kidx -b 30 -o kallisto_out --genomebam --gtf transcripts.gtf.gz --chromosomes chrom.txt reads_1.fastq.gz reads_2.fastq.gz

this is the same run as above, but now we supply --gtf transcripts.gtf.gz for the GTF file and the chromoeme file --chromosomes chrom.txt. For a larger transcriptome we recommend downloading the GTF file from the same release and data source as the FASTA file used to construct the index. The output now contains two additional files pseudoalignments.bam and pseudoalignments.bam.bai. The files can be viewed and processed using Samtools or a genome browser such as IGV. There is no need to sort or index the BAM file since kallisto does that directly. For windows users we recommend using the IGV browser, since there are no native Samtools releases (except using Linux Subsystem on Windows 10).

That’s it.

You can now run kallisto on your dataset of choice. For convenience, we have placed some transcriptome fasta files for human and model organisms here. Publicly available RNA-Seq data can be found on the short read archive (a convenient mirror and interface to the SRA is available here). While kallisto cannot process .sra files, such files can be converted to FASTQ with the fastq-dump tool which is part of the SRA Toolkit.