FLAIR protocol

Align reads to the genome
Obtain novel splice junction reference
Generate transcriptomes
Combine transcriptomes between samples
Quantify splicing relative to the combined transcriptome

Note: this is for FLAIR obtained through pulling master from GitHub. Some of these updates are not available in the current conda version

Aligning reads to the genome

FLAIR is now very flexible about read alignment strategies; please use whatever makes the most sense for your data. We now recommend aligning each sample individually. The exception is if your samples have less than 10M reads. In that case, you may want to combine replicates for better coverage. However, we have tested and obtained good results with ~1M reads.

FLAIR align
- Is a wrapper around minimap that uses specific options for what we find to be good spliced alignment; will also do some additional filtering if requested
- Pros: this is what FLAIR was tested on, will provide consistent results
- Cons: this requires FASTA/FASTQ read files, so if your raw reads are provided as BAM files, you would need to convert them to FASTA/FASTQ, losing information stored in the BAM tags. Also, if your reads are already aligned, realigning is cumbersome.
Pbmm2 / Dorado align
- These are the minimap2 wrappers developed by PacBio and ONT
- Pros: will align raw BAM files directly and preserve tags, UMIs, and modification information
- Cons: minimal flexibility in terms of alignment options, less stable than minimap (more frequent updates)
Minimap2
- This is the basal aligner used by all of these methods, which can also be run directly
- Pros: maximum flexibility and control in terms of alignment options
- Cons: Similar to FLAIR align
- Recommended alignment command:
  - minimap2 -ax splice -s 40 -G 350k -t 25 --MD --secondary=no GENOME.fa SAMPLE.fa | samtools view -hb - | samtools sort - > SAMPLE.genomealigned.bam; samtools index SAMPLE.genomealigned.bam
  - If your library is stranded, you may want to add -uf to the minimap2 command (forces splice junctions to be on the same strand as the read)

Obtaining novel splice junction reference

FLAIR requires some kind of splice junction reference. Even if you have a well-annotated reference transcriptome, we highly recommend including some reference for novel splice junctions. FLAIR will not report any splice junctions not found in either the reference transcriptome or the supported junctions file. You can either use short-read junctions (from paired short-read sequencing or from some other sequencing of the same tissue type) or calculate supported junctions directly from the long-read data

NOTE: We usually recommend that you use the same reference splice junctions file for all samples within a dataset, but when that is not possible (when the dataset is growing over time), using sample-specific files is okay.
- For combining files between samples, concatenating the files into one is fine.
Using short-read junctions:
- Pros: provides orthogonal support, minimizes long-read specific biases/artifacts
- Cons: requires additional sequencing, or if non-matched sequencing is used, could be less sensitive to sample-specific junctions
- Usage:
  - Preferred: align short-reads with STAR. STAR will output a SJ.out.tab file by default; you can use this file directly with the –junction_tab option in FLAIR transcriptome
  - If your short-reads have already been aligned with another aligner, you can extract the junctions with the junctions_from_sam script included with FLAIR or using intronProspector.
Using long-read junctions:
- Pros: detects and removes artifacts, direct junction detection from the same reads you’re assembling transcripts from (best match)
- Cons: Potentially less sensitive to rare events with low read support, not orthogonal support (has the same lrRNA biases)
- Usage:
  - After genomic alignment of your reads (have indexed BAM file), use intronProspector to generate a junctions.bed file:
  - intronProspector --genome-fasta=GENOME.fa --intron-bed6=SAMPLE.IPjunctions.bed -C 0.0 SAMPLE.genomealigned.bam

Generating transcriptomes

This is the core function of FLAIR. We have recently switched from using two modules to assemble the transcriptome (correct and collapse) to using FLAIR transcriptome. This single module allows parallelized transcriptome building directly from the aligned BAM file, reducing memory usage and increasing speed. FLAIR transcriptome is also more sensitive to novel genes and transcripts than previous versions. FLAIR transcriptome can be run with or without an annotated GTF file, but having a good reference does improve performance. Here we provide a standard recommended command, but FLAIR transcriptome has many options that can be tuned for your application, see its dedicated page to learn more.

With short-read orthogonal support:
- flair transcriptome -g GENOME.fa -f ANNOTATION.gtf --junction_tab SHORTREAD.SJ.out.tab -b SAMPLE.genomealigned.bam -o SAMPLE
With junctions from long reads:
- flair transcriptome -g GENOME.fa -f ANNOTATION.gtf --junction_bed LONGREAD.IPjunctions.bed --junction_support 2 -b SAMPLE.genomealigned.bam -o SAMPLE

Combine transcriptomes between samples

If you have a dataset composed of many samples that you want to compare, you want to have a reference transcriptome that works equally well for all samples. To achieve this, first generate a transcriptome for each sample or batch of samples. Next, use FLAIR combine to combine the transcriptomes.

Make a manifest file pointing to your transcriptomes for all of your samples (see FLAIR combine documentation for more details)
Run FLAIR combine:
- More stringent combination (keep only spliced isoforms expressed at over 10% of the locus in at least 1 sample)
  - flair combine -m MANIFEST.txt -o OUTPUT
- Less stringent combination (keeps anything supported in any file, just combines based on splice junctions and similar ends)
  - flair combine -m MANIFEST.txt -o OUTPUT -p 0 -f 1 -s

Quantify splicing relative to the combined transcriptome

For the most consistent transcript quantification, we recommend re-quantifying all of your samples together on the combined transcriptome using FLAIR quantify. FLAIR quantify is a stringent quantification method. It assigns reads to unambiguous transcript matches and does not support EM or partial support algorithms. This is excellent for downstream approaches that require unambiguous read-to-transcript matching and is a stringent quantification method, but you will likely recover only 55-90% of your reads compared to gene-level quantification.

Make a new manifest file pointing to your raw FASTA/FASTQ reads for all samples (if your raw reads are in BAM format, you need to convert them to FASTA/FASTQ).
- Read FASTA/FASTQ files, can be gzipped
Run FLAIR quantify:
- flair quantify -r MANIFEST.txt -i COMBINED.isoforms.fa --isoform_bed COMBINED.isoforms.bed --stringent --check_splice