Here we show an example of our LDpred2 pipeline for height PRS on an example public data-set from github user @choishingwan.
Obtained via download_1000G()
in bigsnpr
.
We extracted unrelated European individuals (~500 samples) and ~1.7M SNPs in common with either HapMap3 or the UK Biobank. Classification of European population can be found at IGSR. European individuals ID are from IGSR data portal.
Height.QC.gz
, from public data-set provided by github user @choishingwan, of height GWAS in European samples.
EUR.height
, EUR.cov
, and EUR.eigenvec
contain phenotypes, covariates and genotype principle components of samples.
Auto model runs the algorithm for 30 different $p$ (the proportion of causal variants) values range from 10e-4 to 0.9, and heritability $h^2$ from LD score regression as initial value.
Grid model tries a grid of parameters $p$, ranges from 0 to 1 and three $h^2$ which are 0.7/1/1.4 times of initial $h^2$ estimated by LD score regression.
Please download the GWAS data and pipeline script to your computer. We have pre-downloaded and extracted the European genotypes from 1000 Genomes. We have preprocessed the GWAS data as follows, to fit in our pipeline.
%cd ~/Downloads/tutorial
sumstats <- bigreadr::fread2("GWAS_data/Height.QC.gz")
# LDpred2 require the header to follow the exact naming
names(sumstats) <-
c("chr",
"pos",
"rsid",
"a1",
"a0",
"n_eff",
"beta_se",
"p",
"OR",
"INFO",
"MAF")
# Transform the OR into log(OR)
sumstats$beta <- log(sumstats$OR)
saveRDS(sumstats, "GWAS_data/Height.QC.rds")
options(stringsAsFactors=F)
fam = read.table("GWAS_data/EUR.QC.fam", header=F)
colnames(fam) = c("FID", "IID", "PID", "MID", "S", "D")
pheno = read.table("GWAS_data/EUR.height", header=T)
covar = read.table("GWAS_data/EUR.cov", header=T)
pcs = read.table("GWAS_data/EUR.eigenvec", header=F)
colnames(pcs) = c("FID", "IID", "PC1", "PC2", "PC3", "PC4", "PC5", "PC6")
require(dplyr)
pheno_out = left_join(fam, pheno, by = c("FID", "IID"))
pheno_out = left_join(pheno_out, covar, by = c("FID", "IID"))
pheno_out = left_join(pheno_out, pcs, by = c("FID", "IID"))
names(pheno_out)
write.table(pheno_out[, "Height"], "GWAS_data/EUR.height.matched.txt", col.names="Height",row.names=F)
write.table(pheno_out[, c("Sex",'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6')], "GWAS_data/EUR.cov.matched.txt", col.names=T,row.names=F)
The directory should have the following:
tree
We set the work directory to height_results
folder (to be created by the workflow). This will also be used as part of the filenames in the outputs to identify this analysis.
work_dir="height_results"
Here we assume the GWAS genotype data EUR.*
has already been QC-ed. We perform here QC for reference panel,
sos run ldpred.ipynb snp_qc \
--cwd $work_dir \
--genoFiles 1000G.EUR/1000G.EUR.bed
SNPs shared between summary statistics, reference panels and target genotype data (for which PRS will be computed) are extracted.
sos run ldpred.ipynb snp_intersect \
--cwd $work_dir \
--ss GWAS_data/Height.QC.rds \
--genoFiles $work_dir/1000G.EUR.$work_dir.bed GWAS_data/EUR.QC.bed
cat $work_dir/Height.QC.intersect.stdout
To handle major/minor allele, strand flips and consequently possible flips in sign for summary statistics.
sos run ldpred.ipynb snp_match \
--cwd $work_dir \
--reference_geno $work_dir/1000G.EUR.$work_dir.snp_intersect.extracted.rds \
--ss GWAS_data/Height.QC.rds
Please refer to documentation in the pipeline notebook ldpred.ipynb
for an explanation of summary statistics QC.
sos run ldpred.ipynb sumstats_qc \
--cwd $work_dir \
--reference_geno $work_dir/1000G.EUR.$work_dir.snp_intersect.extracted.rds \
--ss $work_dir/Height.QC.snp_matched.rds \
--sdy 1
%preview height_results/Height.QC.snp_matched.qc.png
As an illustration, hereafter all analysis are performed on both summary statistics before and after QC in Step 4.
sos run ldpred.ipynb ldsc \
--cwd $work_dir \
--ss $work_dir/Height.QC.snp_matched.rds \
--reference-geno $work_dir/1000G.EUR.$work_dir.snp_intersect.extracted.rds
sos run ldpred.ipynb ldsc \
--cwd $work_dir \
--ss $work_dir/Height.QC.snp_matched.qc.rds \
--reference-geno $work_dir/1000G.EUR.$work_dir.snp_intersect.extracted.rds
For QC-ed data, we perform 3 PRS models implemented in ldpred2
.
sos run ldpred.ipynb inf_prs \
--cwd $work_dir \
--ss $work_dir/Height.QC.snp_matched.qc.rds \
--target-geno $work_dir/EUR.QC.snp_intersect.extracted.rds \
--ldsc $work_dir/Height.QC.snp_matched.qc.ld.rds
cat $work_dir/Height.QC.snp_matched.qc.inf_prs.stdout
sos run ldpred.ipynb auto_prs \
--cwd $work_dir \
--ss $work_dir/Height.QC.snp_matched.qc.rds \
--target-geno $work_dir/EUR.QC.snp_intersect.extracted.rds \
--ldsc $work_dir/Height.QC.snp_matched.qc.ld.rds
sos run ldpred.ipynb grid_prs \
--cwd $work_dir \
--ss $work_dir/Height.QC.snp_matched.qc.rds \
--target-geno $work_dir/EUR.QC.snp_intersect.extracted.rds \
--ldsc $work_dir/Height.QC.snp_matched.qc.ld.rds \
--phenoFile GWAS_data/EUR.height.matched.txt \
--covFile GWAS_data/EUR.cov.matched.txt \
--response continuous
For original data,
sos run ldpred.ipynb inf_prs \
--cwd $work_dir \
--ss $work_dir/Height.QC.snp_matched.rds \
--target-geno $work_dir/EUR.QC.snp_intersect.extracted.rds \
--ldsc $work_dir/Height.QC.snp_matched.ld.rds
cat $work_dir/Height.QC.snp_matched.inf_prs.stdout
sos run ldpred.ipynb auto_prs \
--cwd $work_dir \
--ss $work_dir/Height.QC.snp_matched.rds \
--target-geno $work_dir/EUR.QC.snp_intersect.extracted.rds \
--ldsc $work_dir/Height.QC.snp_matched.ld.rds
sos run ldpred.ipynb grid_prs \
--cwd $work_dir \
--ss $work_dir/Height.QC.snp_matched.rds \
--target-geno $work_dir/EUR.QC.snp_intersect.extracted.rds \
--ldsc $work_dir/Height.QC.snp_matched.ld.rds \
--phenoFile GWAS_data/EUR.height.matched.txt \
--covFile GWAS_data/EUR.cov.matched.txt \
--response continuous
For grid
model, in practice we should use another subset of individuals to train the model, independent from the subset to make PRS predictions. Here we use the same target genotype data only to illustrate the workflow.
Baseline model: Trait ~ Sex + PCs
sos run ldpred.ipynb pred_eval \
--cwd $work_dir \
--phenoFile GWAS_data/EUR.height.matched.txt \
--covFile GWAS_data/EUR.cov.matched.txt \
--response continuous
res = readRDS("height_results/UKB.hdl.baseline.rds")
summary(res$fitted)
res$summary
Inf/grid/auto model: Traits ~ Sex + Age + PRS
sos run ldpred.ipynb pred_eval \
--cwd $work_dir \
--prs $work_dir/gwas_hdl.snp_matched.qc.inf_prs.rds \
--phenoFile GWAS_data/EUR.height.matched.txt \
--covFile GWAS_data/EUR.cov.matched.txt \
--response continuous
sos run ldpred.ipynb pred_eval \
--cwd $work_dir \
--prs $work_dir/gwas_hdl.snp_matched.inf_prs.rds \
--phenoFile GWAS_data/EUR.height.matched.txt \
--covFile GWAS_data/EUR.cov.matched.txt \
--response continuous
res = readRDS("mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.qc.inf_prs.rds")
summary(res$fitted)
res$summary
res = readRDS("mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.inf_prs.rds")
summary(res$fitted)
res$summary
sos run ldpred.ipynb pred_eval \
--cwd $work_dir \
--prs $work_dir/gwas_hdl.snp_matched.qc.grid_prs.rds \
--phenoFile GWAS_data/EUR.height.matched.txt \
--covFile GWAS_data/EUR.cov.matched.txt \
--response continuous
sos run ldpred.ipynb pred_eval \
--cwd $work_dir \
--prs $work_dir/gwas_hdl.snp_matched.grid_prs.rds \
--phenoFile GWAS_data/EUR.height.matched.txt \
--covFile GWAS_data/EUR.cov.matched.txt \
--response continuous
res = readRDS("mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.qc.grid_prs.rds")
summary(res$fitted)
res$summary
res = readRDS("mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.grid_prs.rds")
summary(res$fitted)
res$summary
sos run ldpred.ipynb pred_eval \
--cwd $work_dir \
--prs $work_dir/gwas_hdl.snp_matched.qc.auto_prs.rds \
--phenoFile GWAS_data/EUR.height.matched.txt \
--covFile GWAS_data/EUR.cov.matched.txt \
--response continuous
sos run ldpred.ipynb pred_eval \
--cwd $work_dir \
--prs $work_dir/gwas_hdl.snp_matched.auto_prs.rds \
--phenoFile GWAS_data/EUR.height.matched.txt \
--covFile GWAS_data/EUR.cov.matched.txt \
--response continuous
res = readRDS("mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.qc.auto_prs.rds")
summary(res$fitted)
res$summary
res = readRDS("mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.auto_prs.rds")
summary(res$fitted)
res$summary
Following table shows adjusted R squared of HDL prediction model for original betas and posterior betas. QC is quality control in step 4.
Compared to baseline model, higher r-squared in ldpred model implies PRS explains part of the variation of HDL in train dataset (n=1600).
Posterior betas have greater R squared compared to originial betas which implies PRS derived from posterior betas can explain much variance of HDL compared to PRS derived from original betas.
Among Inf, Grid and Auto, phynotype model with PRS derived using Grid model has the largest R squared which implies that Ldpred-grid model can find a hyper-parameter combination in prior distribution so that the PRS derived based on this prior distribution can explain much variance of HDL.
QC? | # of SNPs | Baseline | Inf | Grid | Auto |
---|---|---|---|---|---|
Yes | 232,552 | 0.176 | 0.20773 | 0.23512 | 0.21034 |
No | 384,249 | 0.176 | 0.23821 | 0.26274 | 0.24341 |
Exported from ldpred/ldpred2_example.ipynb
committed by Gao Wang on Sun Oct 24 15:39:21 2021 revision 2, 55ca1d9