Reference Data#
Miniprotocol Timing#
Timing ~4 hours
Overview#
This miniprotocol shows the use of various modules to download, index and preprocess reference data for use throughout the pipeline. The modules are as follows:
reference_data_preparation.ipynb
(steps i-viii): Download and format reference filesgeneralized_TADB.ipynb
(step ix): generate topologically associated domain files and their boundariesnotebook_for_LD_block_reference_panel.ipynb
(step x): production of LD blocks and reference panel
Steps#
i. Download Reference Data#
sos run reference_data.ipynb download_hg_reference --cwd ../reference_data
sos run reference_data.ipynb download_gene_annotation --cwd ../reference_data
sos run reference_data.ipynb download_ercc_reference --cwd ../reference_data
sos run reference_data.ipynb download_dbsnp --cwd ../reference_data
ii. Format Reference Data#
sos run reference_data.ipynb hg_reference \
--cwd ../reference_data \
--ercc-reference ../reference_data/ERCC92.fa \
--hg-reference ../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa \
--container oras://ghcr.io/cumc/rna_quantification_apptainer:latest
iii. Format Gene Feature Data#
sos run reference_data.ipynb gene_annotation \
--cwd ../reference_data \
--ercc-gtf ../reference_data/ERCC92.gtf \
--hg-gtf ../reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
--hg-reference ../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
--stranded \
--container oras://ghcr.io/cumc/rna_quantification_apptainer:latest
iv. Generate STAR Index#
sos run reference_data.ipynb STAR_index \
--cwd ../reference_data \
--hg-reference ../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
--container oras://ghcr.io/cumc/rna_quantification_apptainer:latest
v. Generate RSEM Index#
sos run reference_data.ipynb RSEM_index \
--cwd ../reference_data \
--hg-reference ../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
--hg-gtf ../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
--container oras://ghcr.io/cumc/rna_quantification_apptainer:latest
vi. Generate RefFlat Annotation for Picard#
sos run reference_data.ipynb RefFlat_generation \
--cwd ../reference_data \
--hg-gtf ../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
--container oras://ghcr.io/cumc/rna_quantification_apptainer:latest
vii. Generate SUPPA Annotation for Psichomics#
sos run reference_data.ipynb SUPPA_annotation \
--cwd ../reference_data \
--hg_gtf ../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
--container oras://ghcr.io/cumc/psichomics_apptainer:latest
viii. Extract rsIDs for known variants#
sos run VCF_QC.ipynb dbsnp_annotate \
--genoFile ../reference_data/00-All.vcf.gz \
--cwd ../reference_data \
--container oras://ghcr.io/cumc/bioinfo_apptainer:latest
ix. Generation of topologically associated domains and their boundaries#
# interactive notebook
generalized_TAD.ipynb
x. production of LD blocks and reference panel#
FIXME
Anticipated Results#
Our pipeline uses the following reference data for RNA-seq expression quantification:
GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.{dict,fasta,fasta.fai}
Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf
for stranded protocol, andHomo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtf
for unstranded protocol.Everything under
STAR_Index
folderEverything under
RSEM_Index
folderOptionally, for quality control,
gtf_ref.flat
The following reference files are used for methylation:
To be added by Alexandre
The following reference files are used for alternative splicing:
Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.SUPPA_annotation.rds
for psichomics.
The following reference files are used for topologically associated domain and boundary files:
generalized_TAD.tsv
generalized_TADB.tsv
TADB_enhanced_cis.bed
extended_TADB.bed