# Alternative splicing from RNA-seq data



#### Miniprotocol Timing

Timing <2 hours

## Overview

Several other modules should be run before generating splicing data to prepare the data. These include:
1. `molecular_phenotypes/calling/RNA_calling.ipynb` (step i): Generate data quality summary with fastqc
2. `molecular_phenotypes/calling/RNA_calling.ipynb` (step ii): Trim adaptors
3. `molecular_phenotypes/calling/RNA_calling.ipynb` (step iii): Align RNASeq reads with STAR using the wasp option specifically for splicing data

This miniprotocol shows the use of modules for splicing quantification and normalization. They are as follows:
1. `molecular_phenotypes/calling/splicing_calling.ipynb` (step i): Quantify splicing with leafcutter or psichomics
2. `molecular_phenotypes/QC/splicing_normalization.ipynb` (step ii): Quality control and normalization of splicing data
3. `data_preprocessing/phenotype/gene_annotation.ipynb` (step iii): Process splicing data for use in TensorQTL

## Steps

## i. Splicing Quantification 
### a. LeafCutter

### Intron usage ratio quantification via `leafCutter`
*  `input`: a meta data file contains locations of all Aligned.sortedByCoord.out.bam files to be analyzed.
*  `output`: a file with intron usage ratios, end with "_intron_usage_perind.counts.gz"

In [None]:
sos run pipeline/splicing_calling.ipynb leafcutter \
    --cwd output/leaf_cutter/ \
    --samples output/rnaseq/xqtl_protocol_data_bam_list \
    --container containers/leafcutter.sif 

### b. Psichomics

### Percent Spliced In (PSI) quantification for alternative splicing events via `Psichomics`
*  `input`: a meta data file contains locations of all SJ.out.tab files to be analyzed.
*  `output`: psi_raw_data.tsv, contains percent spliced in values for each alternative splicing event

In [None]:
sos run pipeline/splicing_calling.ipynb psichomics \
    --cwd output/psichomics/ \
    --samples output/rnaseq/xqtl_protocol_data_bam_list \
    --splicing_annotation hg38_suppa.rds \
    --container containers/psichomics.sif

## ii. Splicing QC and Normalization
### a. Leafcutter

### QC and Normalization of leafCutter outputs
*  `input`: the "_intron_usage_perind.counts.gz" file from previous step
*  `output`: QC'd and normalized phenotype table end with "qqnorm.txt"
Be noted that the `ratio` file to be fed into the leafcutter_norm are the one without `number` tag in its filename. 

In [None]:

sos run pipeline/splicing_normalization.ipynb leafcutter_norm \
    --cwd output/leaf_cutter/ \
    --ratios output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind.counts.gz \
    --container containers/leafcutter.sif 



### b. Psichomics

### QC and Normalization of psichomics outputs
*  `input`: the "psi_raw_data.tsv" file from previous step
*  `output`: QC'd and normalized phenotype table end with "qqnorm.txt"

In [None]:
sos run pipeline/splicing_normalization.ipynb psichomics_norm \
    --cwd psichomics_output \
    --ratios psichomics_output/psi_raw_data.tsv \
    --container containers/psichomics.sif

## iii. Post Processing for TensorQTL
### a. Leafcutter

### Post-process of leafcutter outputs for them to be TensorQTL ready
*  `input`: output of the previous two steps and the gtf file.
*  `output`: a file in bed format end with "formated.bed.gz" 

In [None]:
sos run pipeline/gene_annotation.ipynb annotate_leafcutter_isoforms \
    --cwd output/leaf_cutter/ \
    --intron_count output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind_numers.counts.gz \
    --phenoFile output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind.counts.gz_raw_data.qqnorm.txt \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf \
    --container containers/bioinfo.sif \
    --sample_participant_lookup reference_data/sample_participant_lookup.rnaseq

### b. Psichomics

### Post-process of psichomics outputs for them to be TensorQTL ready
*  `input`: the "qqnorm.txt" output from the previous step and the gtf file.
*  `output`: a file in bed format end with "formated.bed.gz" 

In [None]:
sos run pipeline/code/data_preprocessing/phenotype/gene_annotation.ipynb annotate_psichomics_isoforms \
    --cwd psichomics_output \
    --phenoFile psichomics_output/psichomics_raw_data_bedded.qqnorm.txt \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf \
    --container containers/bioinfo.sif

## Anticipated Results

The final output contains the QCed and normalized splicing data from leafcutter and psichomics.