RNA-seq expression#
Miniprotocol Timing#
Timing <3.5 hours
Overview#
This miniprotocol shows the use of various modules to prepare reference data, perform RNA-seq calling, quantify expression, conduct quality control and normalize data. The modules are as follows:
RNA_calling.ipynb
(steps i-v): Quantifying expression from RNA-seq databulk_expression_QC.ipynb
(step vi): Sample level RNA-seq quality controlbulk_expression_normalization.ipynb
(step vii): Bulk RNA-seq counts normalization
Steps#
i. Perform data quality summary via fastqc
#
sos run pipeline/RNA_calling.ipynb fastqc \
--cwd output/rnaseq/fastqc \
--samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
--data-dir protocol_data/input_data/RNASeq/fastq \
--container containers/rna_quantification.sif \
--gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf
ii. Cut adaptor (Optional)#
sos run pipeline/RNA_calling.ipynb fastp_trim_adaptor \
--cwd output/rnaseq --samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
--data-dir protocol_data/input_data/RNASeq/fastq --STAR-index reference_data/STAR_Index/ \
--gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
--container containers/rna_quantification.sif \
--reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
--ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat
iii. Read alignment via STAR and QC via Picard#
sos run pipeline/RNA_calling.ipynb STAR_align \
--cwd output/rnaseq --samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
--data-dir protocol_data/input_data/RNASeq/fastq --STAR-index reference_data/STAR_Index/ \
--gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
--container containers/rna_quantification.sif \
--reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
--ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat --uncompressed
iv. Call gene-level RNA expression via rnaseqc#
sos run pipeline/RNA_calling.ipynb rnaseqc_call \
--cwd output/rnaseq \
--samples input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
--data-dir input_data/RNASeq/fastq \
--gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf \
--container containers/rna_quantification.sif \
--reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
--bam_list output/rnaseq/xqtl_protocol_data_bam_list
v. Call transcript level RNA expression via RSEM#
sos run pipeline/RNA_calling.ipynb rsem_call \
--cwd output/rnaseq \
--samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
--data-dir protocol_data/input_data/RNASeq/fastq/ \
--RSEM-index reference_data/RSEM_Index/ \
--container containers/rna_quantification.sif \
--bam_list output/rnaseq/xqtl_protocol_data_bam_list
vi. Multi-sample RNA-seq QC#
sos run pipeline/bulk_expression_QC.ipynb qc \
--cwd output/rnaseq \ \
--tpm-gct output/rnaseq/xqtl_protocol_data.rnaseqc.gene_tpm.gct.gz \
--counts-gct output/rnaseq/xqtl_protocol_data.rnaseqc.gene_readsCount.gct.gz \
--container containers/rna_quantification.sif
vii. Multi-sample read count normalization#
sos run pipeline/bulk_expression_normalization.ipynb normalize \
--cwd output/rnaseq \
--tpm-gct output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.tpm.gct.gz \
--counts-gct output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.geneCount.gct.gz \
--annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
--container containers/rna_quantification.sif \
--count-threshold 1 --sample_participant_lookup reference_data/sample_participant_lookup.rnaseq
viii. Region list generation (optional)#
By default, the first 4 column of the bed.gz file contains the chr,start (TSS), end (TSS+1), and gene id of each gene. User can extract these information with a simple zcat {phenoFile}.bed.gz | cut -f 1,2,3,4
command. However, when a region list with both gene id and gene name are needed, following utilities are provided.
sos run pipeline/gene_annotation.ipynb region_list_generation \
--cwd output/rnaseq \
--phenoFile output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.tmm.expression.bed.gz \
--annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
--sample-participant-lookup reference_data/sample_participant_lookup.rnaseq \
--container containers/bioinfo.sif \
--phenotype-id-type gene_id
Anticipated Results#
The final output contained QCed and normalized expression data in a bed.gz file. This file is ready for use in TensorQTL.
Figure 1A. Bulk RNA-Seq Quality Control D-Statistic Distribution.
Figure 1B. Bulk RNA-Seq Quality Control Relative Log Expression Residuals.
Figure 1C. Bulk RNA-Seq Quality Control Mahalanobis Distance P-Value Clustering.