RNA-seq expression#

Miniprotocol Timing#

Timing <3.5 hours

Overview#

This miniprotocol shows the use of various modules to prepare reference data, perform RNA-seq calling, quantify expression, conduct quality control and normalize data. The modules are as follows:

  1. RNA_calling.ipynb (steps i-v): Quantifying expression from RNA-seq data

  2. bulk_expression_QC.ipynb (step vi): Sample level RNA-seq quality control

  3. bulk_expression_normalization.ipynb (step vii): Bulk RNA-seq counts normalization

Steps#

i. Perform data quality summary via fastqc#

sos run pipeline/RNA_calling.ipynb fastqc \
    --cwd output/rnaseq/fastqc \
    --samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir protocol_data/input_data/RNASeq/fastq \
    --container containers/rna_quantification.sif \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf

ii. Cut adaptor (Optional)#

sos run pipeline/RNA_calling.ipynb fastp_trim_adaptor \
    --cwd output/rnaseq --samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir protocol_data/input_data/RNASeq/fastq --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat

iii. Read alignment via STAR and QC via Picard#

sos run pipeline/RNA_calling.ipynb STAR_align \
    --cwd output/rnaseq --samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir protocol_data/input_data/RNASeq/fastq --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat --uncompressed 

iv. Call gene-level RNA expression via rnaseqc#

sos run pipeline/RNA_calling.ipynb rnaseqc_call \
    --cwd output/rnaseq \
    --samples input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist    \
    --data-dir input_data/RNASeq/fastq \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --bam_list output/rnaseq/xqtl_protocol_data_bam_list

v. Call transcript level RNA expression via RSEM#

sos run pipeline/RNA_calling.ipynb rsem_call  \
    --cwd output/rnaseq   \
    --samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist  \
    --data-dir protocol_data/input_data/RNASeq/fastq/   \
    --RSEM-index reference_data/RSEM_Index/   \
    --container containers/rna_quantification.sif   \
    --bam_list output/rnaseq/xqtl_protocol_data_bam_list   

vi. Multi-sample RNA-seq QC#

sos run pipeline/bulk_expression_QC.ipynb qc \
    --cwd output/rnaseq \ \
    --tpm-gct output/rnaseq/xqtl_protocol_data.rnaseqc.gene_tpm.gct.gz \
    --counts-gct output/rnaseq/xqtl_protocol_data.rnaseqc.gene_readsCount.gct.gz \
    --container containers/rna_quantification.sif 

vii. Multi-sample read count normalization#

sos run pipeline/bulk_expression_normalization.ipynb normalize \
    --cwd output/rnaseq \
    --tpm-gct output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.tpm.gct.gz \
    --counts-gct output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.geneCount.gct.gz \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf  \
    --container containers/rna_quantification.sif \
    --count-threshold 1 --sample_participant_lookup reference_data/sample_participant_lookup.rnaseq

viii. Region list generation (optional)#

By default, the first 4 column of the bed.gz file contains the chr,start (TSS), end (TSS+1), and gene id of each gene. User can extract these information with a simple zcat {phenoFile}.bed.gz | cut -f 1,2,3,4 command. However, when a region list with both gene id and gene name are needed, following utilities are provided.

sos run pipeline/gene_annotation.ipynb region_list_generation \
    --cwd output/rnaseq  \
    --phenoFile output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.tmm.expression.bed.gz \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf  \
    --sample-participant-lookup reference_data/sample_participant_lookup.rnaseq \
    --container containers/bioinfo.sif \
    --phenotype-id-type gene_id

Anticipated Results#

The final output contained QCed and normalized expression data in a bed.gz file. This file is ready for use in TensorQTL.

Figure 1A. Bulk RNA-Seq Quality Control D-Statistic Distribution.

Figure 1B. Bulk RNA-Seq Quality Control Relative Log Expression Residuals.

Figure 1C. Bulk RNA-Seq Quality Control Mahalanobis Distance P-Value Clustering.