Paper: https://doi.org/10.1016/j.molcel.2018.08.018
Whippet v1.6 works on the long-term support release of Julia (v1.6.7) which is available here (https://julialang.org). (Note: Whippet.jl does not yet work on Julia v1.9). If you are new to julia, there is a helpful guide on how to get it up and running here
Download and install dependencies
git clone https://github.com/timbitz/Whippet.jl.git
cd Whippet.jl
julia --project -e 'using Pkg; Pkg.instantiate(); Pkg.test()'
This should tell you Testing Whippet tests passed
Notes:
git pull
Whippet.jl/bin
, you can use the -h
flag to get a list of the available command line options, their usage and defaults.You need your genome sequence in fasta, and a gene annotation file in Ensembl-style GTF format. Default GENCODE annotation supplied for hg19 and mm10 in Whippet/anno
. You can also obtain Ensembl GTF files from these direct links for Human: Ensembl_hg38_release_92 and Mouse: Ensembl_mm10_release_92. Other Ensembl GTF files can be downloaded here.
Download the genome.
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
Build an index.
julia bin/whippet-index.jl --fasta hg19.fa.gz --gtf anno/gencode_hg19.v25.tsl1.gtf.gz
Notes:
exon
lines (others are ignored). These must contain both gene_id
and transcript_id
attributes (which should not be the same as one another!). This GTF file should be consistent with the GTF2.2 specification, and should have all entries for a transcript in a continuous block. Warning: The UCSC table browser will not produce valid GTF2.2 files. Similarly, GTF files obtained from iGenomes or the Refseq websites do not satisfy these specifications.-x / --index
parameter. The default (for both whippet-index.jl and whippet-quant.jl) is a generic index named graph
located at Whippet.jl/index/graph.jls
, so you must have write-access to this location to use the default.Whippet v0.11+ allows you to build an index that includes unannotated splice-sites and exons found in a spliced RNA-seq alignment file. In order to build a BAM supplemented index, you need your BAM file sorted and indexed (using samtools):
# If using multiple BAM files (tissue1, ..., tissue3 etc), merge them first:
samtools merge filename.bam tissue1.bam tissue2.bam tissue3.bam
# If using a single BAM file start here:
samtools sort -o filename.sort filename.bam
samtools rmdup -S filename.sort.bam filename.sort.rmdup.bam
samtools index filename.sort.rmdup.bam
ls filename.sort.rmdup.bam*
filename.sort.rmdup.bam filename.sort.rmdup.bam.bai
Then build an index but with the additional --bam
parameter:
julia bin/whippet-index.jl --fasta hg19.fa.gz --bam filename.sort.rmdup.bam --gtf anno/gencode_hg19.v25.tsl1.gtf.gz
Notes:
--bam
option is sensitive to alignment strand, therefore using strand-specific alignments is recommended.--bam-both-novel
flag to override this requirement for greater Recall of unannotated splice-sites.--bam-min-reads
parameter (default is 1). Increase this parameter with large bam files to reduce artifacts and one-off cryptic splice junctions.julia bin/whippet-quant.jl file.fastq.gz
Note: Whippet only accepts standard four-line FASTQ file (described here: https://support.illumina.com/bulletins/2016/04/fastq-files-explained.html)
Also, as of version 1.0.0, --ebi
and --url
flags have been deprecated to ease maintenance. EBI file paths can be found at the URL http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=$ebi_id&result=read_run&fields=fastq_ftp&display=txt. Use your own accession id (SRR id) in place of $ebi_id.
For example:
curl "https://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=SRR1199010&result=read_run&fields=fastq_ftp&display=txt"
fastq_ftp
ftp.sra.ebi.ac.uk/vol1/fastq/SRR119/000/SRR1199010/SRR1199010.fastq.gz
julia bin/whippet-quant.jl fwd_file.fastq.gz rev_file.fastq.gz
To locate paired-end SRR id files, use the same ebi.ac.uk URL:
curl "https://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERR1994736&result=read_run&fields=fastq_ftp&display=txt"
fastq_ftp
ftp.sra.ebi.ac.uk/vol1/fastq/ERR199/006/ERR1994736/ERR1994736_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/ERR199/006/ERR1994736/ERR1994736_2.fastq.gz
To specify output location or a specific index:
julia bin/whippet-quant.jl fwd_file.fastq.gz -o outputname -x customindex.jls
You can also output the alignments in SAM format with the --sam
flag and convert to bam with samtools:
julia bin/whippet-quant.jl fwd_file.fastq.gz --sam > fwd_file.sam
samtools view -bS fwd_file.sam > fwd_file.bam
For greater stability of quantifications across multiple RNA-seq protocols, try the --biascorrect
flag, which will apply GC-content and 5’ sequence bias correction methods:
julia bin/whippet-quant.jl fwd_file.fastq.gz --biascorrect
It is also possible to pool fastq files at runtime using shell commands, and the optional (--force-gz
) for pooled gz files (files without .gz suffix)
julia bin/whippet-quant.jl <( cat time-series_{1,2,3,4,5}.fastq.gz ) --force-gz -o interval_1-5
Compare .psi.gz
files from from two samples -a
and -b
with any number of replicates (comma delimited list of files or common pattern matching) per sample.
ls *.psi.gz
#sample1-r1.psi.gz sample1-r2.psi.gz sample2-r1.psi.gz sample2-r2.psi.gz
julia bin/whippet-delta.jl -a sample1 -b sample2
#OR
julia bin/whippet-delta.jl -a sample1-r1.psi.gz,sample1-r2.psi.gz -b sample2-r1.psi.gz,sample2-r2.psi.gz
Note: comparisons of single files still need a comma: -a singlefile_a.psi.gz, -b singlefile_b.psi.gz,
The output format for whippet-quant.jl
is saved into two core quant filetypes, .psi.gz
and .tpm.gz
files.
Each .tpm.gz
file contains a simple format compatible with many downstream tools (one for the TPM of each annotated transcript, and another at the gene-level):
Gene/Isoform | TpM | Read Counts |
---|---|---|
NFIA | 2897.11 | 24657.0 |
Meanwhile the .psi.gz
file is a bit more complex and requires more explanation. Quantified nodes can fall into a number of “node-type” categories:
Type | Interpretation |
---|---|
CE | Core exon, which may be bounded by one or more alternative AA/AD nodes |
AA | Alternative Acceptor splice site |
AD | Alternative Donor splice site |
RI | Retained intron |
TS | Tandem transcription start site |
TE | Tandem alternative polyadenylation site |
AF | Alternative First exon |
AL | Alternative Last exon |
BS | Circular back-splicing (output only when --circ flag is given) |
Each node is defined by a type (above) and has a corresponding value for Psi
or the Percent-Spliced-In, see below:
The .psi.gz
file itself is outputted in the form:
Gene | Node | Coord | Strand | Type | Psi | CI Width | CI Lo,Hi | Total Reads | Complexity | Entropy | Inc Paths | Exc Paths |
---|---|---|---|---|---|---|---|---|---|---|---|---|
NFIA | 2 | chr1:61547534-61547719 | + | AF | 0.782 | 0.191 | 0.669,0.86 | 49.0 | K1 | 0.756 | 2-4-5:0.782 | 1-5:0.218 |
NFIA | 4 | chr1:61548433-61548490 | + | CE | 0.8329 | 0.069 | 0.795,0.864 | 318.0 | K2 | 1.25 | 2-4-5:0.3342,3-4-5:0.4987 | 1-5:0.1671 |
NFIA | 5 | chr1:61553821-61554352 | + | CE | 0.99 | NA | NA | NA | NA | NA | NA | NA |
NFIA | 6 | chr1:61743192-61743257 | + | CE | 0.99 | NA | NA | NA | NA | NA | NA | NA |
Whippet outputs quantification at the node-level (one PSI value for each node, not exon). For example, Whippet CE (‘core exon’) nodes may be bounded by one or more AA or AD nodes. So when those AA or AD nodes are spliced in, the full exon consists of the nodes combined (AA+CE).
We believe that this is (and should be) the correct generalization of event-level splicing quantification for the following reasons:
node
.This must be taken into account when intersecting .psi.gz
coordinate output with other formats that only represent full exons (which can be one or more adjacent nodes combined). To ease this process, whippet-index.jl
outputs an index.exons.gz
file that contains all theoretical full exon coordinates mapped to Whippet nodes for each gene, and whether or not the exon is found in annotation.
Note:
:
and the relative expression, such that the sum of all these paths is 1.0.The raw junctions are output in the format of .jnc.gz
files, which look like:
Chrom | Donor_Coord | Acceptor_Coord | GeneID:Nodes:Type | ReadCount | Strand |
---|---|---|---|---|---|
chr1 | 150600068 | 150601890 | ENSG00000143420.17_1:2-3:CON_ANNO | 113.04 | - |
chr1 | 150598284 | 150599943 | ENSG00000143420.17_1:3-6:ALT_ANNO | 110.12 | - |
chr1 | 150595335 | 150598118 | ENSG00000143420.17_1:6-9:CON_ANNO | 35.74 | - |
chr1 | 161185160 | 161187776 | ENSG00000158869.10_1:1-2:CON_ANNO | 20.47 | + |
chr1 | 161185160 | 161188031 | ENSG00000158869.10_1:1-3:ALT_UNIQ | 3.11 | + |
The Type
can be one of three:
Type | Meaning |
---|---|
CON_ANNO | Annotated constitutive junction |
ALT_ANNO | Annotated alternative junction |
ALT_UNIQ | Unannotated alternative junction |
The output format of .diff.gz
files from whippet-delta.jl
is:
Gene | Node | Coord | Strand | Type | Psi_A | Psi_B | DeltaPsi | Probability | Complexity | Entropy |
---|---|---|---|---|---|---|---|---|---|---|
ENSG00000117448.13_2 | 9 | chr1:46033654-46033849 | + | CE | 0.95971 | 0.97876 | -0.019047 | 0.643 | K0 | 0.0 |
ENSG00000117448.13_2 | 10 | chr1:46034157-46034356 | + | CE | 0.9115 | 0.69021 | 0.22129 | 0.966 | K1 | 0.874 |
Between the set of replicates from -a and -b whippet-delta.jl outputs a mean Psi_A (from emperical sampling of the joint posterior distribution) and mean Psi_B. It also outputs the mean deltaPsi (Psi_A - Psi_B) and the probability that there is some change in deltaPsi given the read depth of the AS event, such that P( |
deltaPsi | > 0.0). To determine the significance of output in .diff.gz files, two heuristic filters are encouraged, however please note that the exact values for these filters have not yet been systematically optimized. These should be adjusted by the user according to the desired stringency, read depth, and experiment set-up: (1) In general, significant nodes should have a higher probability (e.g. >=0.90) suggesting there is likely to be a difference between the samples, and (2) significant nodes should have an absolute magnitude of change ( |
DeltaPsi | > x) where x is some value you consider to be a biologically relevant magnitude of change (we suggest | DeltaPsi | > 0.1). |
Currently, Whippet supports the output of RNA-seq alignments in SAM/BAM format when the (--sam
) flag is used. These alignment files can be displayed using a number of different downstream visualization software tools, such as the IGV browser, which also supports display in the popular “sashimi-plot” format (e.g. instructions on making sashimi-plots with IGV can be found here).
If you are building an index for another organism, there are some general guidelines that can help to ensure that the index you build is as effective as it can be. In general you should seek to:
With all of the executables in Whippet/bin
, you can use the -h
flag to get a list of the available command line options and their usage. If you are having trouble using or interpreting the output of Whippet
then please ask a question in our gitter chat!.
If you are having trouble running a Whippet executable in /bin, try opening the julia REPL within the Whippet.jl directory and testing:
]
(@v1.5) pkg> activate .
(Whippet) pkg> test
Whippet should run cleanly if your environment is working correctly (although it is mostly untested on Windows, it should still work– or at least it did the one time we tried it ;-). If you still think you have found a bug feel free to open an issue in github or make a pull request!