Python arguments are equivalent to long-option arguments (--arg), unless otherwise specified. Flags are True/False arguments in Python. The manual for any gget tool can be called from the command-line using the -h --help flag.

gget mutate 🧟

Takes in nucleotide sequences and mutations (in standard mutation annotation and returns mutated versions of the input sequences according to the provided mutations.
Return format: Saves mutated sequences in FASTA format (or returns a list containing the mutated sequences if out=None).

This module was written by Joseph Rich.

** Update: The more complex functionality of gget mutate has been ported to https://github.com/pachterlab/kvar. kvar expands on this functionality in the context of screening for variants/mutations in sequencing data. If this sounds interesting to you, please check it out! **

Positional argument
sequences
Path to the FASTA file containing the sequences to be mutated, e.g., 'path/to/seqs.fa'.
Sequence identifiers following the '>' character must correspond to the identifiers in the seq_ID column of mutations.

Example format of the FASTA file:

>seq1 (or ENSG00000106443)  
ACTGCGATAGACT  
>seq2  
AGATCGCTAG

Alternatively: Input sequence(s) as a string or list, e.g. 'AGCTAGCT'.

NOTE: Only the letters until the first space or dot will be used as sequence identifiers - Version numbers of Ensembl IDs will be ignored.
NOTE: When the sequences input is a genome fasta file, also see the gtf argument below.

Required arguments
-m --mutations
Path to the csv or tsv file (e.g., 'path/to/mutations.csv') or data frame (DataFrame object) containing information about the mutations in the following format (the 'notes' and 'mut_ID' columns are optional):

mutation	mut_ID	seq_ID	notes
c.2C>T	mut1	seq1	-> Apply mutation 1 to sequence 1
c.9_13inv	mut2	seq2	-> Apply mutation 2 to sequence 2
c.9_13inv	mut2	seq4	-> Apply mutation 2 to sequence 4
c.9_13delinsAAT	mut3	seq4	-> Apply mutation 3 to sequence 4
...	...	...

'mutation' = Column containing the mutations to be performed written in standard mutation annotation
'mut_ID' = Column containing the identifier for each mutation
'seq_ID' = Column containing the identifiers of the sequences to be mutated (must correspond to the string following the '>' character in the 'sequences' FASTA file; do NOT include spaces or dots)

Alternatively: Input mutation(s) as a string or list, e.g., 'c.2C>T'.
If a list is provided, the number of mutations must equal the number of input sequences.

For use from the terminal (bash): Enclose individual mutation annotations in quotation marks to prevent parsing errors.

Optional input-related arguments
-mc --mut_column
Name of the column containing the mutations to be performed in mutations. Default: 'mutation'.

-sic --seq_id_column
Name of the column containing the IDs of the sequences to be mutated in mutations. Default: 'seq_ID'.

-mic --mut_id_column
Name of the column containing the IDs of each mutation in mutations. Default: Same as mut_column.

Optional mutant sequence generation/filtering arguments
-k --k
Length of sequences flanking the mutation. Default: 30.
If k > total length of the sequence, the entire sequence will be kept.

Optional general arguments
-o --out
Path to output FASTA file containing the mutated sequences, e.g., 'path/to/output_fasta.fa'.
Default: None -> returns a list of the mutated sequences to standard out.
The identifiers (following the '>') of the mutated sequences in the output FASTA will be '>[seq_ID]_[mut_ID]'.

Optional general flags
-q --quiet
Command-line only. Prevents progress information from being displayed.
Python: Use verbose=False to prevent progress information from being displayed.

Examples

gget mutate ATCGCTAAGCT -m 'c.4G>T'

# Python
gget.mutate("ATCGCTAAGCT", "c.4G>T")

→ Returns ATCTCTAAGCT.

List of sequences with a mutation for each sequence provided in a list:

gget mutate ATCGCTAAGCT TAGCTA -m 'c.4G>T' 'c.1_3inv' -o mut_fasta.fa

# Python
gget.mutate(["ATCGCTAAGCT", "TAGCTA"], ["c.4G>T", "c.1_3inv"], out="mut_fasta.fa")

→ Saves 'mut_fasta.fa' file containing:

>seq1_mut1  
ATCTCTAAGCT  
>seq2_mut2  
GATCTA

One mutation applied to several sequences with adjusted k:

gget mutate ATCGCTAAGCT TAGCTA -m 'c.1_3inv' -k 3

# Python
gget.mutate(["ATCGCTAAGCT", "TAGCTA"], "c.1_3inv", k=3)

→ Returns ['CTAGCT', 'GATCTA'].

References

If you use gget mutate in a publication, please cite the following articles:

Luebbert, L., & Pachter, L. (2023). Efficient querying of genomic reference databases with gget. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac836