Quantifying alternative splicing from RNA-seq data#

Description#

Our pipeline calls alternative splicing events from RNA-seq data using leafcutter and psichomics to call the RNA-seq data from fastq.gz data which has been mapped to a reference genome using STAR with the wasp option. It implements the GTEx pipeline for GTEx/TOPMed project. Please refer to this page for detail. The choice of pipeline modules in this project is supported by internal (unpublished) benchmarks from GTEx group.

We use two different tools to quantify the many types of splicing events which are outlined in [cf. Wang et al (2008)] and [cf. Park et al (2018)]. The first, leafcutter, quantifies the usage of alternatively excised introns. This collectively captures skipped exons, 5’ and 3’ alternative splice site usage and other complex events [cf. Li et al 2018]. This method was previously applied to ROSMAP data as part of the Brain xQTL version 2.0. The second, psichomics, quantifies specific splicing events [cf. Agostinho et al. 2019].

Input#

Various reference data needs to be prepared before using this workflow. Here we provide a module to download and prepare the reference data.

A minimal working example is uploaded in the google drive. It contains example inputs for leafcutter/psichomics, two spliing annotations for psichomics, the meta-data file list, and a example of blacklist chromosome file for leafcutter.

The leafcutter and psichomics tools may be run in parallel and are not dependent on one another.

The product of this workflow can be used in generating phenotype tables using /molecular_phenotyles/QC/splicing_normalization.ipynb.

Both leafcutter and psichomics section, a meta-data file, white space delimited, containing 4 columns: sample ID, RNA strandness and path to the BAM files input for leafcutter section and to SJ.out.tab files for psichomics section:

sample_id       strand          bam_list                                SJ_list
sample_1        rf              sample_1.Aligned.sortedByCoord.out.bam  sample_1.SJ.out.tab
sample_2        fr              sample_2.Aligned.sortedByCoord.out.bam  sample_2.SJ.out.tab
sample_3        strand_missing  sample_3.Aligned.sortedByCoord.out.bam  sample_3.SJ.out.tab

All files listed in this meta-data file should be available in the path provided to the --data-dir argument when running either leafcutter or psichomics.

If only one type of input files is prepared, one of the bam_list column and SJ_list column can be left empty.

leafcutter#

The bam files can be generated by the STAR_align workflow from our RNA_calling.ipynb module.

All the BAM files should be available under specified folder (default assumes the same folder as where the meta-data file is).

If intend to blacklist some chromosomes and not analyze it, add one text file named black_list.txt with one chromosome name per line in the same directory of the meta-data file.

psichomics#

The SJ.out.tab files can be generated by the STAR_align workflow from our RNA_calling.ipynb module.

All the SJ.out.tab files should be available under specified folder (default assumes the same folder as where the meta-data file is).

Output#

leafcutter#

{sample_list} below refers to the name of the meta-data file input.

Main output include:

  • {sample_list}_intron_usage_perind.counts.gz file with row id in format: “chromosome:intron_start:intron_end:cluster_id”, column labeled as input sample names and each type of intron usage ratio under each sample (i.e. #particular intron in a sample / #total introns classified in the same cluster in a sample) in each cells.

  • {sample_list}_intron_usage_perind_numers.counts.gz file with the same row and column label but the count of each intron in each cells.

psichomics#

  • psi_raw_data.tsv A dataframe of PSI values (quantification of the alternative splicing events) with first column splicing event identifier (for instance, SE_1_-_2125078_2124414_2124284_2121220_C1orf86) is composed of:

                 Event type (SE stands for skipped exon)
                 Chromosome (1)
                 Strand (-)
                 Relevant coordinates depending on event type (in this case, the first constitutive exon’s end, the                            alternative exon’ start and end and the second constitutive exon’s start)
                 Associated gene (C1orf86)
    

Splicing Event Type

Abbreviation

Coordinates

Skipped Exon

SE

constitutive exon 1 end, alternative exon (start and end) and constitutive exon 2 start

Mutually exclusive exon

MXE

constitutive exon 1 end, alternative exon 1 and 2 (start and end) and constitutive exon 2 start

Alternative 5’ splice site

A5SS

constitutive exon 1 end, alternative exon 1 end and constitutive exon 2 start

Alternative 3’ splice site

A3SS

constitutive exon 1 end, alternative exon 1 start and constitutive exon 2 start

Alternative first exon

AFE

constitutive exon 1 end, alternative exon 1 end and constitutive exon 2 start

Alternative last exon

ALE

constitutive exon 1 end, alternative exon 1 start and constitutive exon 2 start

Alternative first exon (exon-centered - less reliable)

AFE_exon

constitutive exon 1 end, alternative exon 1 end and constitutive exon 2 start

Alternative last exon (exon-centered - less reliable)

ALE_exon

constitutive exon 1 end, alternative exon 1 start and constitutive exon 2 start

Minimal Working Example Steps#

i. Splicing Quantification#

a. Leafcutter#

Timing: <30min

Quantify the usage of alternatively excised introns.

!sos run splicing_calling.ipynb leafcutter \
    --cwd ../../output_test/leafcutter \
    --samples ../../PCC_sample_list_subset_leafcutter \
    --data-dir ../../output_test/star_output_wasp \
    --container oras://ghcr.io/cumc/leafcutter_apptainer:latest \
    -c ../csg.yml -q neurology
INFO: Running leafcutter_1: 
INFO: leafcutter_1 (index=0) is ignored due to saved signature
INFO: leafcutter_1 (index=1) is ignored due to saved signature
INFO: leafcutter_1 (index=2) is ignored due to saved signature
INFO: leafcutter_1 (index=3) is ignored due to saved signature
INFO: leafcutter_1 (index=4) is ignored due to saved signature
INFO: leafcutter_1 (index=5) is ignored due to saved signature
INFO: leafcutter_1 (index=6) is ignored due to saved signature
INFO: leafcutter_1 (index=7) is ignored due to saved signature
INFO: leafcutter_1 (index=8) is ignored due to saved signature
INFO: leafcutter_1 (index=9) is ignored due to saved signature
INFO: leafcutter_1 output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/leafcutter/1000-PCC.bam.Aligned.sortedByCoord.out_wasp_qc.md.junc /restricted/projectnb/xqtl/xqtl_protocol/output_test/leafcutter/1001-PCC.bam.Aligned.sortedByCoord.out_wasp_qc.md.junc... (10 items in 10 groups)
INFO: Running leafcutter_2: 
INFO: t53ce7f911566d133 submitted to neurology with job id Your job 2489088 ("job_t53ce7f911566d133") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: leafcutter_2 output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/leafcutter/PCC_sample_list_subset_leafcutter_intron_usage_perind.counts.gz
INFO: Workflow leafcutter (ID=w2d4888af18ce85b3) is executed successfully with 1 completed step, 1 ignored step, 10 ignored substeps and 1 completed task.

b. Psichomics#

Timing: ~30min

Quantify specific splicing events.

!sos run splicing_calling.ipynb psichomics \
    --cwd ../../output_test/psichomics/ \
    --samples ../../PCC_sample_list_subset_leafcutter \
    --data-dir ../../output_test/star_output_wasp \
    --splicing_annotation ../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.SUPPA_annotation.rds \
    --container oras://ghcr.io/cumc/psichomics_apptainer:latest \
    -c ../csg.yml -q neurology
INFO: Running psichomics_1: 
INFO: t9842e29b618d1fad restart from status failed
INFO: t9842e29b618d1fad submitted to neurology with job id Your job 2490344 ("job_t9842e29b618d1fad") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: psichomics_1 output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/psichomics/psichomics_junctions.txt
INFO: Running psichomics_2: 
INFO: t4908c2bee8ddf317 submitted to neurology with job id Your job 2490387 ("job_t4908c2bee8ddf317") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: psichomics_2 output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/psichomics/psi_raw_data.tsv
INFO: Workflow psichomics (ID=w2c5d88bef5022bf9) is executed successfully with 2 completed steps and 2 completed tasks.

Troubleshooting#

Step

Substep

Problem

Possible Reason

Solution

Command interface#

sos run splicing_calling.ipynb -h
usage: sos run splicing_calling.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  leafcutter
  psichomics

Global Workflow Options:
  --cwd output (as path)
                        The output directory for generated files.
  --samples VAL (as path, required)
                        Sample meta data list
  --data-dir  path(f"{samples:d}")

                        Raw data directory, default to the same directory as
                        sample list
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 8 (as int)
                        Number of threads
  --container ''
                        Software container option

Sections
  leafcutter_1:
    Workflow Options:
      --anchor-len 8 (as int)
                        anchor length (default 8)
      --min-intron-len 50 (as int)
                        minimum intron length to be analyzed (default 50)
      --max-intron-len 500000 (as int)
                        maximum intron length to be analyzed (default 500000)
  leafcutter_2:
    Workflow Options:
      --min-clu-reads 50 (as int)
                        minimum reads in a cluster (default 50 reads)
      --max-intron-len 500000 (as int)
                        maximum intron length to be analyzed (default 500000)
  psichomics_1:
  psichomics_2:

Setup and global parameters#

[global]
# The output directory for generated files. 
parameter: cwd = path("output")
# Sample meta data list
parameter: samples = path
# Raw data directory, default to the same directory as sample list
parameter: data_dir = path(f"{samples:d}")

# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
from sos.utils import expand_size
cwd = path(f'{cwd:a}')

def get_samples(fn, dr):
    import os
    import pandas as pd
    
    samples = pd.read_csv(fn, sep='\t')
    names = []
    strandness = []
    bam_list = []
    bam_files = []
    SJtab_list = []
    SJtab_files = []
    
    samples = samples.fillna("NA")
    names = samples['sample_id'].tolist()
    strandness = samples['strand'].tolist()
    bam_list = samples['coord_bam_list'].tolist()
    SJtab_list = samples['SJ_list'].tolist()
    
    if ((len(bam_list) == sum(x == "NA" for x in bam_list)) & (len(SJtab_list) == sum(x == "NA" for x in SJtab_list))):
        raise ValueError("At least one type of input should be ready")
        
    for j in range(len(strandness)):
        # for regtools command usage, replace 0 = unstranded/XS, 1 = first-strand/RF, 2 = second-strand/FR
        if strandness[j] == 'rf':
            strandness[j] = 'RF'
        if strandness[j] == 'fr':
            strandness[j] = 'FR'
        if strandness[j] == 'strand_missing':
            strandness[j] = 'XS'
            
    if (len(bam_list) != 0) & (len(bam_list) != sum(x == "NA" for x in bam_list)):
        for y in bam_list:
            y = os.path.join(dr, y)
            if not os.path.isfile(y):
                raise ValueError(f"File {y} does not exist")
            bam_files.append(y)
        
    if len(bam_list) != len(set(bam_list)):
        raise ValueError("Duplicated files are found (but should not be allowed) in BAM file list")
    
    if (len(SJtab_list) != 0) & (len(SJtab_list) != sum(x == "NA" for x in SJtab_list)):
        for y in SJtab_list:
            y = os.path.join(dr, y)
            if not os.path.isfile(y):
                raise ValueError(f"File {y} does not exist")
            SJtab_files.append(y)
        
    if len(SJtab_list) != len(set(SJtab_list)):
        raise ValueError("Duplicated files are found (but should not be allowed) in SJ.tab file list")
        
    return names, strandness, bam_files, SJtab_files

sample_id, strandness, bam_data, SJtab_data = get_samples(samples, data_dir)

leafcutter#

Documentation: leafcutter. The choices of regtool parameters are discussed here.

Other clustering options:#

  • “-q”, “–quiet” : don’t print status messages to stdout, default=True.

  • “-p”, “–mincluratio” : minimum fraction of reads in a cluster that support a junction, default 0.001.

  • “-c”, “–cluster” : refined cluster file when clusters are already made, default = None.

  • “-k”, “–nochromcheck” : Don’t check that the chromosomes are well formated e.g. chr1, chr2, …, or 1, 2, …, default = False.

  • “-C”, “–includeconst” : also include constitutive introns, default = False.

The default parameter we used are:

--min_clu_ratio 0.001 --max_intron_len 500000 --min_clu_reads 30

These parameter is based on GTEX’s sQTL discovery pipeline (Section 3.4.3)

Things to keep in mind:#

  • If .bam.bai index files of the .bam input are ready before using leafCutter, it can be placed in the same directory with input .bam files and the “samtools index ${_input}” line can be skipped.

[leafcutter_1, leafcutter_preprocessing_1]
# anchor length (default 8)
parameter: anchor_len = 8
# minimum intron length to be analyzed (default 50)
parameter: min_intron_len = 50
# maximum intron length to be analyzed (default 500000)
parameter: max_intron_len = 500000
input: bam_data, group_by = 1, group_with = "strandness"
output: f'{cwd}/{_input:bn}.junc' 
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint = entrypoint
    samtools index ${_input}
    regtools junctions extract -a ${anchor_len} -m ${min_intron_len} -M ${max_intron_len} -s ${_strandness} ${_input} -o ${_output}
[leafcutter_2]
# minimum reads in a cluster (default 50 reads)
parameter: min_clu_reads = 30 
# maximum intron length to be analyzed (default 500000)
parameter: max_intron_len = 500000 
# minimum fraction of reads in a cluster that support a junction (default 0.001)
parameter: min_clu_ratio = 0.001
input: group_by = 'all'
output: f'{cwd}/{samples:bn}_intron_usage_perind.counts.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint = entrypoint
    rm -f ${_output:nn}.junc
    for i in ${_input:r}; do
    echo $i >> ${_output:nn}.junc ; done
    leafcutter_cluster_regtools.py -j ${_output:nn}.junc -o ${f'{_output:bnn}'.replace("_perind","")} -m ${min_clu_reads} -l ${max_intron_len} -r ${cwd} -p ${min_clu_ratio}
[leafcutter_preprocessing_2]
input: group_by = 'all'
output: f'{cwd}/{samples:bn}_intron_usage_perind.junc'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint = entrypoint
    rm -f ${_output:r}
    for i in ${_input:r}; do
    echo $i >> ${_output:r} ; done

psichomics#

Documentation: psichomics

Other options#

quantifySplicing( annotation, junctionQuant, eventType = c(“SE”, “MXE”, “ALE”, “AFE”, “A3SS”, “A5SS”), minReads = 10, genes = NULL )

In function quantifySplicing, arguments eventType (Character: splicing event types to quantify), minReads (Integer: values whose number of total supporting read counts is below minReads are returned as NA) and genes (Character: gene symbols for which to quantify splicing events. If NULL, events from all genes are quantified.) can be specified. Usage and default values are shown above.

Alternative Splicing Annotation Information#

Two alternative splicing annotations will be provided in this pipeline which can be download here. The hg38_suppa.rds is created Via SUPPA using the gtf file of the xqtl-protocol, and the modified_psichomics_hg38_splicing_annotation.rds is modified from the default Human hg38 (2018-04-30) annotation provided by psichomics package. Description of the database can be found in the Alternative splicing annotation section in the MATERIALS AND METHODS part. Gene names of the original annotation are replaced by Ensembl ids for format unifying. The Ensembl IDs used in modifiction are matched from the gtf file, HGNC database, SUPPA and VASTTOOL records within the original annotation.

Theoretically the annotation created using the gtf file only will give results more consistent with other part of the pipeline. The annotation modified from psichomics original hg38 annotation can identify more events since it was build based on information maximizing principle, however there will be risk of containing outdated information too.

For details of generation method of the gtf file and the two splicing annotations, please check the GFF3 to GTF formatting, Generation of SUPPA annotation for psichomics, and Modification of psichomics default Hg38 splicing annotation sections in reference_data.ipynb.

[psichomics_1]
input: SJtab_data
output: f'{cwd}/psichomics_junctions.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint = entrypoint
    library("psichomics")
    library("dplyr")
    library("tidyr")
    library("purrr")

    df_replace <- function (df, old, new){
          if (is.na(old)) {
              df[is.na(df)] <- new
          } else {
              df[df == old] <- new
          }
      return (df)
      }

    options(scipen = 15) # to avoid writing in scientific notation
    files = list()
  
    # Get the structure of input data
    for (f in c(${_input:ar,})){
      filename = gsub("^.*/", "", f)
      directory = gsub(filename, "", f)
      if (length(files[[directory]]) == 0){
        files[[directory]] = filename
        } else {
      files[[directory]] = append(files[[directory]], filename)
      }
    }

    batch_junction_list = list()
    res = list()
  
    # Since prepareJunctionQuant() somehow don't read file names with path, if orinigal data are from the same directory, 
    #       just run it and write to the output folder, else run on each directory seperately and merge the junction tables.
    if (length(files) == 1) {
        setwd(names(files)[1])
        prepareJunctionQuant(files[[1]], output = '${cwd}/psichomics_junctions.txt')
    } else {
        for (i in 1: (length(files))) {
        d = names(files[i])
        setwd(d)
        output_name = sprintf("${cwd}/psichomics_junctions_%s.txt",
                               tail(as.list(strsplit(d, '/')[[1]]), 1))
        prepareJunctionQuant(files[[d]], output = output_name)
        batch_junction_list = append(batch_junction_list, output_name)
        }
  
        for (file_name in batch_junction_list) {
        res[[file_name]] = read.table(file_name, sep = '\t', header = TRUE)
        }
  
        # save the information of originally created NA if any
        res = lapply(res, df_replace, old = NA, new = "original_na")

        # merge the junction tables and pad with 0
        res = res %>% reduce(full_join, by = "Junction.ID")
        res[is.na(res)] <- 0

        # fill the original NAs back
        res = df_replace(res, "original_na", NA)
        write.table(res, file = '${cwd}/psichomics_junctions.txt', quote = FALSE, sep = '\t', row.names = FALSE)
  
    }
  
    options(scipen = 0)
[psichomics_2]
# splicing annotation for psichomics
parameter: splicing_annotation = path
output: f'{cwd}/psi_raw_data.tsv'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint = entrypoint
    library("psichomics")
    library("dplyr")
    library("tidyr")
    library("purrr")    
    data <- read.table("${cwd}/psichomics_junctions.txt", sep = '\t', header = TRUE)
    
    names(data) <- sub('X', '', names(data))
    data <- data[- grep("chrM", data$Junction.ID),]
    data <- data[- grep("chrUn", data$Junction.ID),]
    data <- data[- grep("random", data$Junction.ID),]
  
    junctionQuant <- data[,-1]
    rownames(junctionQuant) <- data[,1]

    # Record of how original psichomics example did, we mimicked the format change loadLocalFiles() function did on our
    #       data to avoid bugs.
    #data <- loadLocalFiles("${cwd}")
    #junctionQuant <- data[[1]]$`Junction quantification`
    
    annotation = readRDS("${splicing_annotation:a}")
    psi <- quantifySplicing(annotation, junctionQuant)
    write.table(psi, file='${cwd}/psi_raw_data.tsv', quote=FALSE, sep='\t')