Manual

Type PROBer to see the all usage commands:

usage: PROBer [-h] {prepare,estimate,simulate,iCLIP,version} ...

PROBer is a program to quantify chemical modification profiles for a general set of 'toeprinting' assays.

optional arguments:
  -h, --help            show this help message and exit

commands:
    prepare             This command is used to prepare PROBer references.
    estimate            This command is used to estimate chemical modification
                        profiles.
    simulate            This command is used to simulate reads.
    iCLIP               This command is used to allocate multi-mapping reads
                        for iCLIP data.
    version             Show version information.

PROBer has four usage commands. They are:

prepare

PROBer prepare extracts transcript sequences, prepares its reference files, and optionally builds Bowtie/Bowtie2 indices. For iCLIP data, you should turn on --genome option and this command will only build genome indices for aligners. Type PROBer prepare --help to get its usage information:

usage: PROBer prepare [-h] [--gtf <file> | --gff3 <file>]
                      [--gff3-RNA-pattern <pattern>]
                      [--transcript-to-gene-map <file>] [--genome] [--bowtie]
                      [--bowtie-path <path>] [--bowtie2]
                      [--bowtie2-path <path>] [-q]
                      reference_fasta_files reference_name

This program lets PROBer to build its references and optionally build
Bowtie/Bowtie2 indices. For iCLIP data, users can use this program to build
Bowtie/Bowtie2 indices for their genomes

positional arguments:
  reference_fasta_file(s)
                        Either a comma-separated list of Multi-FASTA formatted
                        files OR a directory name. If a directory name is
                        specified, This program will read all files with
                        suffix ".fa" or ".fasta" in this directory. The files
                        should contain either the sequences of transcripts or
                        an entire genome, depending on whether the --gtf or
                        --gff3 option is used.
  reference_name        The name of the reference used. This program will
                        generate several reference-related files that are
                        prefixed by this name. This name can contain path
                        information (e.g. /ref/mm9).

optional arguments:
  -h, --help            show this help message and exit
  --gtf <file>          <file> is in GTF format. This program will assume
                        reference_fasta_file(s) contains genome sequences and
                        extract transcript sequences using the gene annotation
                        specified in <file>. (default: None)
  --gff3 <file>         <file> is in GFF3 format. This program will assume
                        reference_fasta_file(s) contains genome sequences and
                        extract transcript sequences using the gene annotation
                        specified in <file>. (default: None)
  --gff3-RNA-pattern <pattern>
                        <pattern> is a comma-separated list of transcript
                        categories, e.g. 'mRNA,rRNA'. Only transcripts that
                        match the <pattern> will be extracted. (default: mRNA)
  --transcript-to-gene-map <file>
                        Use information from <file> to map from transcript (isoform) ids to gene ids.
                        Each line of <file> should be of the form:
                        
                        transcript_id gene_id
                        
                        with the two fields separated by a tab character. (default: None)
  --genome              This option is required and only used for iCLIP data;
                        it allows PROBer to call Bowtie/Bowtie2 to build their
                        indices. (default: False)
  --bowtie              Build Bowtie indices. (default: False)
  --bowtie-path <path>  The path to the Bowtie executables. (default: None)
  --bowtie2             Build Bowtie2 indices. (default: False)
  --bowtie2-path <path>
                        The path to the Bowtie2 executables. (default: None)
  -q, --quiet           Suppress the output of logging information. (default:
                        False)

OUTPUT:
  PROBer reference files prefixed by 'reference_name'.

estimate

PROBer estimate quantifies chemical modification profiles using sequenced toeprinting data. Its accepts both single-end and paired-end reads either as unprocessed FASTA/FASTQ files or aligned SAM/BAM/CRAM files. Type PROBer estimate --help to get its usage information:

usage: PROBer estimate [options] reference_name sample_name (--alignments input_plus.(sam|bam|cram) [input_minus.(sam|bam|cram)] | --reads plus_channel_mate1_read_file(s) [plus_channel_mate2_read_file(s)] [minus_channel_mate1_read_file(s) [minus_channel_mate2_read_file(s)]])

DESCRIPTION: This program helps users to align reads and estimate RNA
structure parameters. By default, it requires data from both control and
treatment groups. But it works with only treatment data as well.

positional arguments:
  reference_name        The name of the reference used. Users should have run
                        'PROBer prepare' with this name before running this
                        program.
  sample_name           The output name of this run. All outputs use this name
                        as their prefixes.

optional arguments:
  -h, --help            show this help message and exit
  --time                Output time consumed by each step. (default: False)
  --memory              Output memory used by each step. (default: False)
  -q, --quiet           Suppress the output of logging information. (default:
                        False)

Input:
  Input alignments or reads, options are mutually exclusive. If input are
  alignments, all alignments of a same read should group together and each
  paired-end alignment's two mates should be adjacent.

  --alignments input_plus.(sam/bam/cram) [input_minus.(sam/bam/cram)] [input_plus.(sam/bam/cram) [input_minus.(sam/bam/cram)] ...]
                        Input are alignments in SAM/BAM/CRAM formats. If only
                        one alignment file is provided, PROBer assumes control
                        data are not available. (default: None)
  --reads mate_read_file(s) [mate_read_file(s) ...]
                        Input are read files.
                        plus_channel_mate1_read_file(s) and minus_channel_mate1_read_file(s) are comma-separated lists of files containing single-end reads or first mates of paired-end reads
                        plus_mate2_read_file(s) and minus_mate2_read_file(s), present only if '--paired-end' is enabled, are comma-separated lists of files containing second mates of paired-end reads
                        By default, these files should be in FASTQ format. If '--no-quality-scores' is specified, multi-FASTA format files are expected instead.
                        Minus channel reads may be omitted if no control data are available.
                         (default: None)

Basic options:
  --no-quality-scores   Input reads do not contain quality scores. (default:
                        False)
  --paired-end          Input reads are paired-end reads. (default: False)
  -p <int>, --number-of-threads <int>
                        Number of threads this program can use. (default: 1)
  --output-bam          Output transcript BAM file. (default: False)
  --output-logMAP       Output the log MAP probability, which can be used to
                        select priors. (default: False)
  --keep-intermediate-files
                        If PROBer should keep intermediate files. (default:
                        False)

Structure-seq related:
  Set necessary parameters for generating a config file.

  --primer-length <int>
                        Random primer length. (default: 6)
  --size-selection-min <int>
                        The minimum fragment length that can pass the size
                        selection step. (default: None)
  --size-selection-max <int>
                        The maximum fragment length that can pass the size
                        selection step. (default: None)
  --gamma-init <float>  Initial value for all gammas. (default: 0.0001)
  --beta-init <float>   Initial value for all betas. (default: 0.0001)
  --read-length <int>   Read length before trimming adaptors. (default: None)
  --maximum-likelihood  Use maximum likelihood estimates. (default: False)

Alignment options:
  User can choose from Bowtie and Bowtie2. All reads with more than 200
  alignments will be filtered by this script.

  --bowtie              Use bowtie aligner to align reads, with Bowtie
                        parameters "--norc -p number_of_threads -a -m 200 -S".
                        If "--paired-end" is set, additionaly enable Bowtie
                        parameters "-I 1 -X 1000 --chunkmbs 1024". (default:
                        True)
  --bowtie-path <path>  The path to Bowtie executables. (default: None)
  --bowtie2             Use bowtie2 aligner to align reads, indel alignments
                        enabled, with Bowtie2 parameters "--norc -p
                        number_of_threads -k 201". If "--paired-end" is set,
                        additionaly enable Bowtie2 parameters "-I 1 -X 1000
                        --no-mixed --no-discordant". (default: False)
  --bowtie2-path <path>
                        The path to Bowtie2 executables. (default: None)

OUTPUTS:
  sample_name.expr
    Isoform level expression estimates. The first line contains column names separated by a tab character:
    
    transcript_id length effective_length expected_count_minus expected_count_plus TPM FPKM
    
    transcript_id gives the transcript's name. length is the transcript length. effective_length represents the number of positions that can generate a fragment. It is equal to length - primer_length + 1. expected_count_minus is the sum of posterior probabilities of reads coming from this transcript in the (-) channel. expected_count_plus is the counts from (+) channel. TPM is transcript per million. FPKM is fragment per kilobase per millon reads.

    In the rest lines of the file, each line describes a transcript according to the defined columns.

  sample_name.beta
    Estimated beta parameters for each transcript. The first line contains the total number of transcripts. Then each line describes estimated parameters for a different transcript. Within each line, the first field gives the transcript name, the second field provides the number of estimated beta parameters, which is equal to transcript length - primer length. In the end, estimated beta values at each position were given (from 5' end to 3' end).

  sample_name.gamma
    Estimated gamma parameters for each transcript. The first line contains the total number of transcripts. Then each line describes estimated parameters for a different transcript. Within each line, the first field gives the transcript name, the second field provides the number of estimated gamma parameters, which is equal to transcript length - primer length. In the end, estimated gamma values at each position were given (from 5' end to 3' end).

  sample_name_plus.bam
    Only generated when '--output-bam' option is set.

    It is a BAM-formatted file that contains annotated '+' channel read alignments in transcript coordinates. For each alignable BAM line, The MAPQ field is set to min(100, floor(-10 * log10(1.0 - w) + 0.5)), where w is the posterior probability of that alignment being the true mapping of a read. In addition, a new tag ZW:f:value is added, where the value is a single precision floating number representing the posterior probability. All filtered alignment lines has a ZF:A:! tag to identify that it is filtered. Please note that 'ZW' and 'ZF' tags are reserved for PROBer and users need to make sure the aligner output or input BAM/SAM file does not contain these two tags unless the input BAM file is produced by PROBer and alignment/filtering criteria are not changed. Because this file contains all alignment lines produced by the aligner, it can also be used as a replacement of the aligner generated BAM/SAM file.

  sample_name_minus.bam
    Only generated when '--output-bam' option is set.

    It is a BAM-formatted file that contains annotated '-' channel read alignments in transcript coordinates. The annotation format is exactly the same as the one used in 'sample_name_plus.bam'.

  sample_name.logMAP
    Only generated when '--output-logMAP' option is set.

    This file contains the log MAP probability of the observed data given current parameter settings, which can be used to select appropriate priors.

  sample_name.stat
    This folder contains learned model parameters from data. In the folder, 'sample_name_minus.theta' contains the estimated read generating probabilities from '-' channel. 'sample_name_minus.read_model' contains the estimated sequencing error model from '-' channel. 'sample_name_plus.theta' contains the estimated read generating probabilities from '+' channel. 'sample_name_plus.read_model' contains the estimated sequencing error model from '+' channel. The files contained in this folder can be used for simulation.

  sample_name.temp
    This is a temporary folder contains intermediate files. It will be deleted automatically after the program finishes unless '--keep-intermediate-files' option is on.

simulate

PROBer simulate is used to generate simulation data based on model parameters learned from real data. If the real data are paired-end reads, it will simulate paired-end reads. Otherwise, it will simulate single-end reads. Type PROBer simulate --help to get its usage information:

usage: PROBer simulate [-h] [--seed <uint32>] [--no-control]
                       reference_name config_file sample_name channel
                       number_of_reads output_name

This program simulates reads using parameters learned from real data by
program 'estimate'.

positional arguments:
  reference_name   The reference's name, should be same as the one used in
                   programs 'prepare' and 'estimate'.
  config_file      A configuration file containting primer length, size
                   selection min and max fragment size etc.
                   'sample_name.temp/sample_name_minus.config' and
                   'sample_name.temp/sample_name_plus.config' can be used
                   here.
  sample_name      This should be the 'sample_name' used in 'PROBer-estimate-
                   parameters'. No slash should be in the end of this string.
  channel          Which channel to simulate. 'minus' stands for the mock-
                   treated channel and 'plus' stands for the modification-
                   treated channel.
  number_of_reads  Number of reads to simulate.
  output_name      Output files' prefix

optional arguments:
  -h, --help       show this help message and exit
  --seed <uint32>  The seed initializing the random number generator used in
                   the simulation. (default: None)
  --no-control     Indicate if the data used to learn simulation parameters do
                   not have a control. (default: False)

OUTPUT:
  If single-end reads are simulated, this program produces 'output_name_(minus|plus).(fa|fq)'. If paired-end reads are simulated, this program produces 'output_name_(minus|plus)_1.(fa|fq)' and 'output_name_(minus|plus)_2.(fa|fq).

iCLIP

PROBer iCLIP allocates multi-mapping reads for iCLIP data. Type PROBer iCLIP --help to get its usage information:

usage: PROBer iCLIP [options] sample_name {--alignments input_alignments.[sam/bam/cram] | --reads mate1_read_file(s) [mate2_read_file(s)]}

This program allocates multi-mapping reads for iCLIP data.

positional arguments:
  sample_name           The output name of this run. All outputs use this name
                        as their prefixes.

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           Suppress the output of logging information. (default:
                        False)

Input:
  Input alignments or reads, options are mutually exclusive. If input are
  alignments, all alignments of a same read should group together and each
  paired-end alignment's two mates should be adjacent.

  --alignments alignment_file.[sam/bam/cram]
                        Input are alignments in SAM/BAM/CRAM format. (default:
                        None)
  --reads mate_read_file(s) [mate_read_file(s) ...]
                        Input are comma-separated lists of files containing
                        single-end or paired-end reads. If input are single-
                        end reads, only one list is required. If input are
                        paired-end reads, i.e. '--paired-end' is set, PROBer
                        needs two lists --- one for the first mates and the
                        other for the second mates. By default, these files
                        should be in FASTQ format. If '--no-quality-scores' is
                        specified, multi-FASTA format files are expected
                        instead. (default: None)

Basic options:
  --no-quality-scores   Input reads do not contain quality scores. (default:
                        False)
  --paired-end          Input reads are paired-end reads. (default: False)
  -p <int>, --number-of-threads <int>
                        Number of threads this program can use. (default: 1)
  --keep-intermediate-files
                        If PROBer should keep intermediate files. (default:
                        False)

iCLIP options:
  --half-window-size <int>
                        PROBer will borrow information from adjacent crosslink
                        sites within plus/minus half window size to help
                        allocating multi-mapping reads. (default: 25)
  --rounds <int>        Number of EMS iterations to run. (default: 100)
  --maximum-read-length <int>
                        The maximum possible read length. You may set this
                        option only if '--no-qualities' is set. (default:
                        1000)

Alignment options:
  User can choose from Bowtie and Bowtie2. All reads with more than 100
  alignments will be filtered by this script.

  --bowtie              Use bowtie aligner to align reads, with Bowtie
                        parameters "-p number_of_threads -a -m 100 -S
                        --chunkmbs 1024". If "--paired-end" is set,
                        additionaly enable Bowtie parameters "-I 1 -X 1000".
                        (default: True)
  --bowtie-path <path>  The path to Bowtie executables. (default: None)
  --bowtie2             Use bowtie2 aligner to align reads, indel alignments
                        enabled, with Bowtie2 parameters "-p number_of_threads
                        -k 101". If "--paired-end" is set, additionaly enable
                        Bowtie2 parameters "-I 1 -X 1000 --no-mixed --no-
                        discordant". (default: False)
  --bowtie2-path <path>
                        The path to Bowtie2 executables. (default: None)
  --index-name <name>   The base name for Bowtie/Bowtie2 indices (default:
                        None)
  --keep-alignments     Turn on this option will enable PROBer to keep a copy
                        of aligner-produced alignments in
                        'sample_name.alignments.bam'. (default: False)

OUTPUT:
  sample_name.site_info
    This file contains the expected read counts at each unique crosslink site. Each line describes one site and has the following format:
    
    chr ori pos    n_unique    n_multi
    
    chr is the chromosome name, ori is the orientation (+/-) and pos gives the 0-based genomic coordinate in the '+' strand of chromosome chr. chr, ori, and pos together define the genomic location of the crosslink site and they are separated by single spaces. Then separated by single tabs, n_unique gives the number of uniquely mapped reads, and n_multi provides the expected number of multi-mapping reads at this site.

  sample_name.alignments.bam
    Only generated when '--keep-alignments' option is set.

    This file stores the aligner-produced alignments in BAM format.

  sample_name.stat
    This folder contains learned model parameters from data. In the folder, 'sample_name.model' contains the estimated sequencing model parameters.

  sample_name.temp
    This is a temporary folder contains intermediate files. It will be deleted automatically after the program finishes unless '--keep-intermediate-files' option is on.

version

PROBer version displays the current version of the software.