Polygenic Risk Score Analysis on Height GWAS data¶

Here we show an example of our LDpred2 pipeline for height PRS on an example public data-set from github user @choishingwan.

Data used¶

Reference panel¶

Obtained via download_1000G() in bigsnpr.

We extracted unrelated European individuals (~500 samples) and ~1.7M SNPs in common with either HapMap3 or the UK Biobank. Classification of European population can be found at IGSR. European individuals ID are from IGSR data portal.

GWAS summary statistics data¶

Height.QC.gz, from public data-set provided by github user @choishingwan, of height GWAS in European samples.

Target test data¶

EUR.height, EUR.cov, and EUR.eigenvec contain phenotypes, covariates and genotype principle components of samples.

PRS Models applied¶

Auto model runs the algorithm for 30 different $p$ (the proportion of causal variants) values range from 10e-4 to 0.9, and heritability $h^2$ from LD score regression as initial value.

Grid model tries a grid of parameters $p$, ranges from 0 to 1 and three $h^2$ which are 0.7/1/1.4 times of initial $h^2$ estimated by LD score regression.

Data preparation¶

Please download the GWAS data and pipeline script to your computer. We have pre-downloaded and extracted the European genotypes from 1000 Genomes. We have preprocessed the GWAS data as follows, to fit in our pipeline.

%cd ~/Downloads/tutorial

/home/gw/Downloads/tutorial

Summary statistics formatting¶

sumstats <- bigreadr::fread2("GWAS_data/Height.QC.gz") 
# LDpred2 require the header to follow the exact naming
names(sumstats) <-
    c("chr",
    "pos",
    "rsid",
    "a1",
    "a0",
    "n_eff",
    "beta_se",
    "p",
    "OR",
    "INFO",
    "MAF")
# Transform the OR into log(OR)
sumstats$beta <- log(sumstats$OR)

saveRDS(sumstats, "GWAS_data/Height.QC.rds")

Phenotype and covariates formatting¶

options(stringsAsFactors=F)
fam = read.table("GWAS_data/EUR.QC.fam", header=F)
colnames(fam) = c("FID", "IID", "PID", "MID", "S", "D")
pheno = read.table("GWAS_data/EUR.height", header=T)
covar = read.table("GWAS_data/EUR.cov", header=T)
pcs = read.table("GWAS_data/EUR.eigenvec", header=F)
colnames(pcs) = c("FID", "IID", "PC1", "PC2", "PC3", "PC4", "PC5", "PC6")

require(dplyr)
pheno_out = left_join(fam, pheno, by = c("FID", "IID"))
pheno_out = left_join(pheno_out, covar, by = c("FID", "IID"))
pheno_out = left_join(pheno_out, pcs, by = c("FID", "IID"))

names(pheno_out)

write.table(pheno_out[, "Height"], "GWAS_data/EUR.height.matched.txt", col.names="Height",row.names=F)
write.table(pheno_out[, c("Sex",'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6')], "GWAS_data/EUR.cov.matched.txt", col.names=T,row.names=F)

The directory should have the following:

tree

.
├── 1000G.EUR
│   ├── 1000G.EUR.bed
│   ├── 1000G.EUR.bim
│   └── 1000G.EUR.fam
├── 1000G.EUR...bed
├── 1000G.EUR...bim
├── 1000G.EUR...fam
├── 1000G.EUR...log
├── 1000G.EUR...stdout
├── GWAS_data
│   ├── EUR.cov
│   ├── EUR.eigenvec
│   ├── EUR.height
│   ├── EUR.QC.bed
│   ├── EUR.QC.bim
│   ├── EUR.QC.fam
│   ├── Height.QC.gz
│   └── Height.QC.rds
└── ldpred.ipynb

2 directories, 17 files

Analysis of height GWAS data¶

We set the work directory to height_results folder (to be created by the workflow). This will also be used as part of the filenames in the outputs to identify this analysis.

work_dir="height_results"

Step 1: QC on reference panel¶

Here we assume the GWAS genotype data EUR.* has already been QC-ed. We perform here QC for reference panel,

sos run ldpred.ipynb snp_qc \
    --cwd $work_dir \
    --genoFiles 1000G.EUR/1000G.EUR.bed

INFO: Running basic QC filters: Filter SNPs and select individuals
INFO: basic QC filters is completed.
INFO: basic QC filters output:   /home/gw/Downloads/tutorial/height_results/1000G.EUR.height_results.bed
INFO: Workflow snp_qc (ID=w015cfd3cc381496b) is executed successfully with 1 completed step.

Step 2: Intersect SNPs among summary stats, reference panel and target data¶

SNPs shared between summary statistics, reference panels and target genotype data (for which PRS will be computed) are extracted.

sos run ldpred.ipynb snp_intersect \
    --cwd $work_dir \
    --ss GWAS_data/Height.QC.rds \
    --genoFiles $work_dir/1000G.EUR.$work_dir.bed GWAS_data/EUR.QC.bed

INFO: Running snp_intersect_1: SNP intersect of summary stats and genotype data
INFO: snp_intersect_1 is completed.
INFO: snp_intersect_1 output:   /home/gw/Downloads/tutorial/height_results/Height.QC.intersect.rds /home/gw/Downloads/tutorial/height_results/Height.QC.intersect.snplist
INFO: Running snp_intersect_2: 
INFO: snp_intersect_2 is completed (pending nested workflow).
INFO: Running preprocess_1: Filter SNPs and select individuals
INFO: preprocess_1 (index=1) is completed.
INFO: preprocess_1 (index=0) is completed.
INFO: preprocess_1 output:   /home/gw/Downloads/tutorial/height_results/1000G.EUR.height_results.snp_intersect.extracted.bed /home/gw/Downloads/tutorial/height_results/EUR.QC.snp_intersect.extracted.bed in 2 groups
INFO: Running convert PLNIK to bigsnpr format with missing data mean imputed: 
INFO: convert PLNIK to bigsnpr format with missing data mean imputed (index=0) is completed.
INFO: convert PLNIK to bigsnpr format with missing data mean imputed (index=1) is completed.
INFO: convert PLNIK to bigsnpr format with missing data mean imputed output:   /home/gw/Downloads/tutorial/height_results/1000G.EUR.height_results.snp_intersect.extracted.bk /home/gw/Downloads/tutorial/height_results/1000G.EUR.height_results.snp_intersect.extracted.rds... (4 items in 2 groups)
INFO: snp_intersect_2 output:   /home/gw/Downloads/tutorial/height_results/1000G.EUR.height_results.snp_intersect.extracted.bed /home/gw/Downloads/tutorial/height_results/1000G.EUR.height_results.snp_intersect.extracted.rds... (4 items)
INFO: Workflow snp_intersect (ID=w9482ac12438e4cc8) is executed successfully with 4 completed steps and 6 completed substeps.

cat $work_dir/Height.QC.intersect.stdout

[1] "There are 400400 shared SNPs."

Step 3: Harmonize alleles for shared SNPs¶

To handle major/minor allele, strand flips and consequently possible flips in sign for summary statistics.

sos run ldpred.ipynb snp_match \
    --cwd $work_dir \
    --reference_geno $work_dir/1000G.EUR.$work_dir.snp_intersect.extracted.rds \
    --ss GWAS_data/Height.QC.rds

INFO: Running snp_match: 
INFO: snp_match is completed.
INFO: snp_match output:   /home/gw/Downloads/tutorial/height_results/Height.QC.snp_matched.rds /home/gw/Downloads/tutorial/height_results/Height.QC.snp_matched.snplist
INFO: Workflow snp_match (ID=waa8f5f3cb5665126) is executed successfully with 1 completed step.

Step 4: Summary statistics quality control (optional)¶

Please refer to documentation in the pipeline notebook ldpred.ipynb for an explanation of summary statistics QC.

sos run ldpred.ipynb sumstats_qc \
    --cwd $work_dir \
    --reference_geno $work_dir/1000G.EUR.$work_dir.snp_intersect.extracted.rds \
    --ss $work_dir/Height.QC.snp_matched.rds \
    --sdy 1

INFO: Running sumstats_qc: 
INFO: sumstats_qc is completed.
INFO: sumstats_qc output:   /home/gw/Downloads/tutorial/height_results/Height.QC.snp_matched.qc.png /home/gw/Downloads/tutorial/height_results/Height.QC.snp_matched.qc.snplist... (3 items)
INFO: Workflow sumstats_qc (ID=wdbbf3e7458e2f8a0) is executed successfully with 1 completed step.

%preview height_results/Height.QC.snp_matched.qc.png

Step 5: Calculate LD matrix and fit LDSC model¶

As an illustration, hereafter all analysis are performed on both summary statistics before and after QC in Step 4.

sos run ldpred.ipynb ldsc \
    --cwd $work_dir \
    --ss $work_dir/Height.QC.snp_matched.rds \
    --reference-geno $work_dir/1000G.EUR.$work_dir.snp_intersect.extracted.rds

INFO: Running ldsc: 
INFO: ldsc is completed.
INFO: ldsc output:   /home/gw/Downloads/tutorial/height_results/Height.QC.snp_matched.ld.rds
INFO: Workflow ldsc (ID=we9a3f5af1a3a1a9c) is executed successfully with 1 completed step.

sos run ldpred.ipynb ldsc \
    --cwd $work_dir \
    --ss $work_dir/Height.QC.snp_matched.qc.rds \
    --reference-geno $work_dir/1000G.EUR.$work_dir.snp_intersect.extracted.rds

INFO: Running ldsc: 
INFO: ldsc is completed.
INFO: ldsc output:   /home/gw/Downloads/tutorial/height_results/Height.QC.snp_matched.qc.ld.rds
INFO: Workflow ldsc (ID=wbacc9e382c830fa4) is executed successfully with 1 completed step.

Step 6: Estimate posterior effect sizes and PRS¶

For QC-ed data, we perform 3 PRS models implemented in ldpred2.

sos run ldpred.ipynb inf_prs \
    --cwd $work_dir \
    --ss $work_dir/Height.QC.snp_matched.qc.rds \
    --target-geno $work_dir/EUR.QC.snp_intersect.extracted.rds \
    --ldsc $work_dir/Height.QC.snp_matched.qc.ld.rds

INFO: Running inf_prs: 
INFO: inf_prs is completed (pending nested workflow).
INFO: Running prs_core: 
INFO: prs_core is completed.
INFO: prs_core output:   /home/gw/Downloads/tutorial/height_results/Height.QC.snp_matched.qc.inf_prs.rds
INFO: inf_prs output:   /home/gw/Downloads/tutorial/height_results/Height.QC.snp_matched.qc.inf_prs.rds
INFO: Workflow inf_prs (ID=w3fac9d56baf0b108) is executed successfully with 2 completed steps.

cat $work_dir/Height.QC.snp_matched.qc.inf_prs.stdout

[1] "378890 SNPs are used for PRS calculations"

sos run ldpred.ipynb auto_prs \
    --cwd $work_dir \
    --ss $work_dir/Height.QC.snp_matched.qc.rds \
    --target-geno $work_dir/EUR.QC.snp_intersect.extracted.rds \
    --ldsc $work_dir/Height.QC.snp_matched.qc.ld.rds

INFO: Running auto_prs: 
INFO: auto_prs is completed (pending nested workflow).
INFO: Running prs_core:

sos run ldpred.ipynb grid_prs \
    --cwd $work_dir \
    --ss $work_dir/Height.QC.snp_matched.qc.rds \
    --target-geno $work_dir/EUR.QC.snp_intersect.extracted.rds \
    --ldsc $work_dir/Height.QC.snp_matched.qc.ld.rds \
    --phenoFile GWAS_data/EUR.height.matched.txt \
    --covFile GWAS_data/EUR.cov.matched.txt \
    --response continuous

For original data,

sos run ldpred.ipynb inf_prs \
    --cwd $work_dir \
    --ss $work_dir/Height.QC.snp_matched.rds \
    --target-geno $work_dir/EUR.QC.snp_intersect.extracted.rds \
    --ldsc $work_dir/Height.QC.snp_matched.ld.rds

cat $work_dir/Height.QC.snp_matched.inf_prs.stdout

sos run ldpred.ipynb auto_prs \
    --cwd $work_dir \
    --ss $work_dir/Height.QC.snp_matched.rds \
    --target-geno $work_dir/EUR.QC.snp_intersect.extracted.rds \
    --ldsc $work_dir/Height.QC.snp_matched.ld.rds

sos run ldpred.ipynb grid_prs \
    --cwd $work_dir \
    --ss $work_dir/Height.QC.snp_matched.rds \
    --target-geno $work_dir/EUR.QC.snp_intersect.extracted.rds \
    --ldsc $work_dir/Height.QC.snp_matched.ld.rds \
    --phenoFile GWAS_data/EUR.height.matched.txt \
    --covFile GWAS_data/EUR.cov.matched.txt \
    --response continuous

For grid model, in practice we should use another subset of individuals to train the model, independent from the subset to make PRS predictions. Here we use the same target genotype data only to illustrate the workflow.

Step 7: predict phenotypes¶

Baseline model: Trait ~ Sex + PCs

sos run ldpred.ipynb pred_eval \
    --cwd $work_dir \
    --phenoFile GWAS_data/EUR.height.matched.txt \
    --covFile GWAS_data/EUR.cov.matched.txt \
    --response continuous

INFO: Running pred_eval: 
INFO: pred_eval is completed.
INFO: pred_eval output:   /home/gw/tmp/18-May-2021/mvp_gwas/UKB.hdl.baseline.rds
INFO: Workflow pred_eval (ID=w2415704657bde76b) is executed successfully with 1 completed step.

res = readRDS("height_results/UKB.hdl.baseline.rds")
summary(res$fitted)
res$summary

Call:
lm(formula = ., data = dat[train.ind, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7973 -0.6148 -0.0072  0.5943  3.2185 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.083025   0.177526  -0.468   0.6401    
AGE          0.007225   0.003028   2.386   0.0171 *  
SEX         -0.857958   0.046395 -18.492   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9257 on 1597 degrees of freedom
Multiple R-squared:  0.177,	Adjusted R-squared:  0.176 
F-statistic: 171.7 on 2 and 1597 DF,  p-value: < 2.2e-16

Inf/grid/auto model: Traits ~ Sex + Age + PRS

sos run ldpred.ipynb pred_eval \
    --cwd $work_dir \
    --prs $work_dir/gwas_hdl.snp_matched.qc.inf_prs.rds \
    --phenoFile GWAS_data/EUR.height.matched.txt \
    --covFile GWAS_data/EUR.cov.matched.txt \
    --response continuous

sos run ldpred.ipynb pred_eval \
    --cwd $work_dir \
    --prs $work_dir/gwas_hdl.snp_matched.inf_prs.rds \
    --phenoFile GWAS_data/EUR.height.matched.txt \
    --covFile GWAS_data/EUR.cov.matched.txt \
    --response continuous

INFO: Running pred_eval: 
INFO: pred_eval (index=0) is ignored due to saved signature
INFO: pred_eval output:   /home/gw/tmp/18-May-2021/mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.qc.inf_prs.rds
INFO: Workflow pred_eval (ID=we0f848b067c83826) is ignored with 1 ignored step.
INFO: Running pred_eval: 
INFO: pred_eval is completed.
INFO: pred_eval output:   /home/gw/tmp/18-May-2021/mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.inf_prs.rds
INFO: Workflow pred_eval (ID=wbb35457749543c1e) is executed successfully with 1 completed step.

res = readRDS("mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.qc.inf_prs.rds")
summary(res$fitted)
res$summary

Call:
lm(formula = ., data = dat[train.ind, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9274 -0.5944  0.0028  0.5898  3.5610 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.515490   0.182145  -2.830  0.00471 ** 
AGE          0.006138   0.002972   2.065  0.03907 *  
SEX         -0.857887   0.045493 -18.858  < 2e-16 ***
PRS         -0.664444   0.082400  -8.064 1.44e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9076 on 1596 degrees of freedom
Multiple R-squared:  0.2092,	Adjusted R-squared:  0.2077 
F-statistic: 140.8 on 3 and 1596 DF,  p-value: < 2.2e-16

res = readRDS("mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.inf_prs.rds")
summary(res$fitted)
res$summary

Call:
lm(formula = ., data = dat[train.ind, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9751 -0.5807 -0.0083  0.5798  3.7434 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.433701   0.173408  -2.501   0.0125 *  
AGE          0.006043   0.002913   2.074   0.0382 *  
SEX         -0.857926   0.044609 -19.232   <2e-16 ***
PRS         -0.787219   0.068654 -11.466   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.89 on 1596 degrees of freedom
Multiple R-squared:  0.2396,	Adjusted R-squared:  0.2382 
F-statistic: 167.7 on 3 and 1596 DF,  p-value: < 2.2e-16

sos run ldpred.ipynb pred_eval \
    --cwd $work_dir \
    --prs $work_dir/gwas_hdl.snp_matched.qc.grid_prs.rds \
    --phenoFile GWAS_data/EUR.height.matched.txt \
    --covFile GWAS_data/EUR.cov.matched.txt \
    --response continuous

sos run ldpred.ipynb pred_eval \
    --cwd $work_dir \
    --prs $work_dir/gwas_hdl.snp_matched.grid_prs.rds \
    --phenoFile GWAS_data/EUR.height.matched.txt \
    --covFile GWAS_data/EUR.cov.matched.txt \
    --response continuous

INFO: Running pred_eval: 
INFO: pred_eval is completed.
INFO: pred_eval output:   /home/gw/tmp/18-May-2021/mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.qc.grid_prs.rds
INFO: Workflow pred_eval (ID=w572e9a7b247e7f40) is executed successfully with 1 completed step.
INFO: Running pred_eval: 
INFO: pred_eval is completed.
INFO: pred_eval output:   /home/gw/tmp/18-May-2021/mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.grid_prs.rds
INFO: Workflow pred_eval (ID=w91862006b36b43f9) is executed successfully with 1 completed step.

res = readRDS("mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.qc.grid_prs.rds")
summary(res$fitted)
res$summary

Call:
lm(formula = ., data = dat[train.ind, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9274 -0.6107 -0.0004  0.5825  3.7149 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.316030   0.172306  -1.834   0.0668 .  
AGE          0.006228   0.002919   2.134   0.0330 *  
SEX         -0.852849   0.044702 -19.079   <2e-16 ***
PRS         -0.802795   0.071950 -11.158   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8918 on 1596 degrees of freedom
Multiple R-squared:  0.2366,	Adjusted R-squared:  0.2351 
F-statistic: 164.8 on 3 and 1596 DF,  p-value: < 2.2e-16

res = readRDS("mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.grid_prs.rds")
summary(res$fitted)
res$summary

Call:
lm(formula = ., data = dat[train.ind, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0014 -0.5953 -0.0181  0.5609  3.8511 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.367020   0.169185  -2.169   0.0302 *  
AGE          0.006092   0.002865   2.126   0.0337 *  
SEX         -0.847968   0.043891 -19.320   <2e-16 ***
PRS         -1.046833   0.076154 -13.746   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8756 on 1596 degrees of freedom
Multiple R-squared:  0.2641,	Adjusted R-squared:  0.2627 
F-statistic: 190.9 on 3 and 1596 DF,  p-value: < 2.2e-16

sos run ldpred.ipynb pred_eval \
    --cwd $work_dir \
    --prs $work_dir/gwas_hdl.snp_matched.qc.auto_prs.rds \
    --phenoFile GWAS_data/EUR.height.matched.txt \
    --covFile GWAS_data/EUR.cov.matched.txt \
    --response continuous

sos run ldpred.ipynb pred_eval \
    --cwd $work_dir \
    --prs $work_dir/gwas_hdl.snp_matched.auto_prs.rds \
    --phenoFile GWAS_data/EUR.height.matched.txt \
    --covFile GWAS_data/EUR.cov.matched.txt \
    --response continuous

INFO: Running pred_eval: 
INFO: pred_eval is completed.
INFO: pred_eval output:   /home/gw/tmp/18-May-2021/mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.qc.auto_prs.rds
INFO: Workflow pred_eval (ID=w9bf9aef80a3a5b7d) is executed successfully with 1 completed step.
INFO: Running pred_eval: 
INFO: pred_eval is completed.
INFO: pred_eval output:   /home/gw/tmp/18-May-2021/mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.auto_prs.rds
INFO: Workflow pred_eval (ID=w099507f42af13c80) is executed successfully with 1 completed step.

res = readRDS("mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.qc.auto_prs.rds")
summary(res$fitted)
res$summary

Call:
lm(formula = ., data = dat[train.ind, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9925 -0.6067  0.0001  0.5854  3.5452 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.491939   0.180478  -2.726  0.00649 ** 
AGE          0.006317   0.002966   2.130  0.03336 *  
SEX         -0.853595   0.045421 -18.793  < 2e-16 ***
PRS         -1.333720   0.158835  -8.397  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9061 on 1596 degrees of freedom
Multiple R-squared:  0.2118,	Adjusted R-squared:  0.2103 
F-statistic:   143 on 3 and 1596 DF,  p-value: < 2.2e-16

res = readRDS("mvp_gwas/UKB.hdl.gwas_hdl.snp_matched.auto_prs.rds")
summary(res$fitted)
res$summary

Call:
lm(formula = ., data = dat[train.ind, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0331 -0.5738 -0.0088  0.5593  3.7989 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.405491   0.172225  -2.354   0.0187 *  
AGE          0.006289   0.002903   2.167   0.0304 *  
SEX         -0.851866   0.044459 -19.161   <2e-16 ***
PRS         -1.332495   0.111293 -11.973   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.887 on 1596 degrees of freedom
Multiple R-squared:  0.2448,	Adjusted R-squared:  0.2434 
F-statistic: 172.5 on 3 and 1596 DF,  p-value: < 2.2e-16

Results summary¶

Following table shows adjusted R squared of HDL prediction model for original betas and posterior betas. QC is quality control in step 4.

Compared to baseline model, higher r-squared in ldpred model implies PRS explains part of the variation of HDL in train dataset (n=1600).

Posterior betas have greater R squared compared to originial betas which implies PRS derived from posterior betas can explain much variance of HDL compared to PRS derived from original betas.

Among Inf, Grid and Auto, phynotype model with PRS derived using Grid model has the largest R squared which implies that Ldpred-grid model can find a hyper-parameter combination in prior distribution so that the PRS derived based on this prior distribution can explain much variance of HDL.

QC?	# of SNPs	Baseline	Inf	Grid	Auto
Yes	232,552	0.176	0.20773	0.23512	0.21034
No	384,249	0.176	0.23821	0.26274	0.24341

model	R2	MSE
<chr>	<dbl>	<dbl>
model	0.17597	0.95412

model	R2	MSE
<chr>	<dbl>	<dbl>
model.inf_prs	0.20773	0.91876

model	R2	MSE
<chr>	<dbl>	<dbl>
model.inf_prs	0.23821	0.89215

model	R2	MSE
<chr>	<dbl>	<dbl>
model.grid_prs	0.23512	0.89653

model	R2	MSE
<chr>	<dbl>	<dbl>
model.grid_prs	0.26274	0.86741