FLAIR protocol
Align reads to the genome
Obtain novel splice junction reference
Generate transcriptomes
Combine transcriptomes between samples
Quantify splicing relative to the combined transcriptome
Note: this is for FLAIR obtained through pulling master from GitHub. Some of these updates are not available in the current conda version
Aligning reads to the genome
FLAIR is now very flexible about read alignment strategies; please use whatever makes the most sense for your data. We now recommend aligning each sample individually. The exception is if your samples have less than 10M reads. In that case, you may want to combine replicates for better coverage. However, we have tested and obtained good results with ~1M reads.
FLAIR align
Is a wrapper around minimap that uses specific options for what we find to be good spliced alignment; will also do some additional filtering if requested
Pros: this is what FLAIR was tested on, will provide consistent results
Cons: this requires FASTA/FASTQ read files, so if your raw reads are provided as BAM files, you would need to convert them to FASTA/FASTQ, losing information stored in the BAM tags. Also, if your reads are already aligned, realigning is cumbersome.
Pbmm2 / Dorado align
These are the minimap2 wrappers developed by PacBio and ONT
Pros: will align raw BAM files directly and preserve tags, UMIs, and modification information
Cons: minimal flexibility in terms of alignment options, less stable than minimap (more frequent updates)
Minimap2
This is the basal aligner used by all of these methods, which can also be run directly
Pros: maximum flexibility and control in terms of alignment options
Cons: Similar to FLAIR align
Recommended alignment command:
minimap2 -ax splice -s 40 -G 350k -t 25 --MD --secondary=no GENOME.fa SAMPLE.fa | samtools view -hb - | samtools sort - > SAMPLE.genomealigned.bam; samtools index SAMPLE.genomealigned.bamIf your library is stranded, you may want to add
-ufto the minimap2 command (forces splice junctions to be on the same strand as the read)
Obtaining novel splice junction reference
FLAIR requires some kind of splice junction reference. Even if you have a well-annotated reference transcriptome, we highly recommend including some reference for novel splice junctions. FLAIR will not report any splice junctions not found in either the reference transcriptome or the supported junctions file. You can either use short-read junctions (from paired short-read sequencing or from some other sequencing of the same tissue type) or calculate supported junctions directly from the long-read data
NOTE: We usually recommend that you use the same reference splice junctions file for all samples within a dataset, but when that is not possible (when the dataset is growing over time), using sample-specific files is okay.
For combining files between samples, concatenating the files into one is fine.
Using short-read junctions:
Pros: provides orthogonal support, minimizes long-read specific biases/artifacts
Cons: requires additional sequencing, or if non-matched sequencing is used, could be less sensitive to sample-specific junctions
Usage:
Preferred: align short-reads with STAR. STAR will output a SJ.out.tab file by default; you can use this file directly with the –junction_tab option in FLAIR transcriptome
If your short-reads have already been aligned with another aligner, you can extract the junctions with the junctions_from_sam script included with FLAIR or using intronProspector.
Using long-read junctions:
Pros: detects and removes artifacts, direct junction detection from the same reads you’re assembling transcripts from (best match)
Cons: Potentially less sensitive to rare events with low read support, not orthogonal support (has the same lrRNA biases)
Usage:
After genomic alignment of your reads (have indexed BAM file), use intronProspector to generate a junctions.bed file:
intronProspector --genome-fasta=GENOME.fa --intron-bed6=SAMPLE.IPjunctions.bed -C 0.0 SAMPLE.genomealigned.bam
Generating transcriptomes
This is the core function of FLAIR. We have recently switched from using two modules to assemble the transcriptome (correct and collapse) to using FLAIR transcriptome. This single module allows parallelized transcriptome building directly from the aligned BAM file, reducing memory usage and increasing speed. FLAIR transcriptome is also more sensitive to novel genes and transcripts than previous versions. FLAIR transcriptome can be run with or without an annotated GTF file, but having a good reference does improve performance. Here we provide a standard recommended command, but FLAIR transcriptome has many options that can be tuned for your application, see its dedicated page to learn more.
With short-read orthogonal support:
flair transcriptome -g GENOME.fa -f ANNOTATION.gtf --junction_tab SHORTREAD.SJ.out.tab -b SAMPLE.genomealigned.bam -o SAMPLE
With junctions from long reads:
flair transcriptome -g GENOME.fa -f ANNOTATION.gtf --junction_bed LONGREAD.IPjunctions.bed --junction_support 2 -b SAMPLE.genomealigned.bam -o SAMPLE
Combine transcriptomes between samples
If you have a dataset composed of many samples that you want to compare, you want to have a reference transcriptome that works equally well for all samples. To achieve this, first generate a transcriptome for each sample or batch of samples. Next, use FLAIR combine to combine the transcriptomes.
Make a manifest file pointing to your transcriptomes for all of your samples (see FLAIR combine documentation for more details)
Run FLAIR combine:
More stringent combination (keep only spliced isoforms expressed at over 10% of the locus in at least 1 sample)
flair combine -m MANIFEST.txt -o OUTPUT
Less stringent combination (keeps anything supported in any file, just combines based on splice junctions and similar ends)
flair combine -m MANIFEST.txt -o OUTPUT -p 0 -f 1 -s
Quantify splicing relative to the combined transcriptome
For the most consistent transcript quantification, we recommend re-quantifying all of your samples together on the combined transcriptome using FLAIR quantify. FLAIR quantify is a stringent quantification method. It assigns reads to unambiguous transcript matches and does not support EM or partial support algorithms. This is excellent for downstream approaches that require unambiguous read-to-transcript matching and is a stringent quantification method, but you will likely recover only 55-90% of your reads compared to gene-level quantification.
Make a new manifest file pointing to your raw FASTA/FASTQ reads for all samples (if your raw reads are in BAM format, you need to convert them to FASTA/FASTQ).
Read FASTA/FASTQ files, can be gzipped
Run FLAIR quantify:
flair quantify -r MANIFEST.txt -i COMBINED.isoforms.fa --isoform_bed COMBINED.isoforms.bed --stringent --check_splice