Reference Data#

Miniprotocol Timing#

Timing ~4 hours

Overview#

This miniprotocol shows the use of various modules to download, index and preprocess reference data for use throughout the pipeline. The modules are as follows:

  1. reference_data_preparation.ipynb (steps i-viii): Download and format reference files

  2. generalized_TADB.ipynb (step ix): generate topologically associated domain files and their boundaries

  3. notebook_for_LD_block_reference_panel.ipynb (step x): production of LD blocks and reference panel

Steps#

i. Download Reference Data#

sos run reference_data.ipynb download_hg_reference --cwd ../reference_data
sos run reference_data.ipynb download_gene_annotation --cwd ../reference_data
sos run reference_data.ipynb download_ercc_reference --cwd ../reference_data
sos run reference_data.ipynb download_dbsnp --cwd ../reference_data

ii. Format Reference Data#

sos run reference_data.ipynb hg_reference \
    --cwd ../reference_data \
    --ercc-reference ../reference_data/ERCC92.fa \
    --hg-reference ../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest 

iii. Format Gene Feature Data#

sos run reference_data.ipynb gene_annotation \
    --cwd ../reference_data \
    --ercc-gtf ../reference_data/ERCC92.gtf \
    --hg-gtf ../reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference ../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --stranded \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest 

iv. Generate STAR Index#

sos run reference_data.ipynb STAR_index \
    --cwd ../reference_data \
    --hg-reference ../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest 

v. Generate RSEM Index#

sos run reference_data.ipynb RSEM_index \
    --cwd ../reference_data \
    --hg-reference ../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf ../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest 

vi. Generate RefFlat Annotation for Picard#

sos run reference_data.ipynb RefFlat_generation \
    --cwd ../reference_data \
    --hg-gtf ../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest 

vii. Generate SUPPA Annotation for Psichomics#

sos run reference_data.ipynb SUPPA_annotation \
    --cwd ../reference_data \
    --hg_gtf ../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container oras://ghcr.io/cumc/psichomics_apptainer:latest 

viii. Extract rsIDs for known variants#

sos run VCF_QC.ipynb dbsnp_annotate \
    --genoFile ../reference_data/00-All.vcf.gz \
    --cwd ../reference_data \
    --container oras://ghcr.io/cumc/bioinfo_apptainer:latest 

ix. Generation of topologically associated domains and their boundaries#

# interactive notebook
generalized_TAD.ipynb

x. production of LD blocks and reference panel#

FIXME

Anticipated Results#

Our pipeline uses the following reference data for RNA-seq expression quantification:

  1. GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.{dict,fasta,fasta.fai}

  2. Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf for stranded protocol, and Homo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtf for unstranded protocol.

  3. Everything under STAR_Index folder

  4. Everything under RSEM_Index folder

  5. Optionally, for quality control, gtf_ref.flat

The following reference files are used for methylation:

  1. To be added by Alexandre

The following reference files are used for alternative splicing:

  1. Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.SUPPA_annotation.rds for psichomics.

The following reference files are used for topologically associated domain and boundary files:

  1. generalized_TAD.tsv

  2. generalized_TADB.tsv

  3. TADB_enhanced_cis.bed

  4. extended_TADB.bed