Independent list of variants using LD clumping#

Description#

This notebook creates a list of randomly selected variants that are in weak LD with each other, for given reference data-set. This is useful for the downstreams mashr analysis where a list of independent variants are used.

Input#

  • Reference genotype, in this case, our ROSMAP European ancestry genotype of ~1,000 individuals, ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.bed

  • A list of approximately independent LD blocks, which is optional but when it is included, we can parallel the LD clumping to all LD blocks which will speed up the process. This list is available from: gaow/LDblocks_GRCh38 (a forked repo created from work described at https://www.biorxiv.org/content/10.1101/2022.03.04.483057v2)

Output#

  • A list of independent variants in bed.gz format

Minimal Working Example Steps#

i. Extract genotypes per LD block#

Using the genotype_formatting.ipynb pipeline,

FIXME: this is using the original bed file for blocks, which is a lot cleaner. Please test this

sos run pipeline/genotype_formatting.ipynb genotype_by_region \
    --region-list data/references/deCODE_EUR_LD_blocks.bed \
    --genoFile data/genotype/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.bed \
    --cwd output/geno_by_LDblock \
    --container /mnt/vast/hpc/csg/containers_xqtl/bioinfo.sif \
    --mem 200G -s build
INFO: Running genotype_by_region_1: 

ii. LD pruning within each LD block#

We create a small pipeline for the purpose,

[global]
parameter: cwd = path("output")
parameter: container = ''
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
parameter: job_size = 1
parameter: walltime = "1h"
parameter: mem = "80G"
parameter: numThreads = 10
[LD_pruning_1]
parameter: genotype_list = path
parameter: window = 50
parameter: shift = 10
parameter: r2 = 1e-3

import pandas as pd
geno_path = pd.read_csv(genotype, sep = "\t")
input_df = geno_path.values.tolist()
input_blocks = [x[0] for x in input_df]
input_files = [x[1:] for x in input_df] 

input: input_files, group_by = 1, group_with = "input_blocks"
output: prune = f'{cwd}/{step_name[:-2]}/{_input_blocks}.prune.in'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint
    plink2 \
        --bfile  ${_input:anr} \
        --indep-pairwise  ${window} ${shift} ${r2} \
        --rm-dup force-first \
        --out ${_output["prune"]:annr} \
        --threads ${numThreads}
[LD_pruning_2]
input: group_by = "all"
output: f'{cwd}/{step_name[:-2]}/LD_pruned_variants.txt'
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    library(data.table)
    library(dplyr)
    library(stringr)
    per_block_variants = c(${_input:ar,})
    uncorrelated_variants = as.data.frame(rbindlist(lapply(per_block_variants, fread, header = FALSE)))
    random_variants  = str_split(gsub("_",":",uncorrelated_variants$V1),":",simplify=TRUE)%>%data.frame()%>%
                    rename("chrom" = "X1","pos" = "X2","alt" = "X3","ref"="X4")%>%
                    mutate(variant_id = paste(chrom, pos, alt, ref, sep = ":"))
    fwrite(random_variants, ${_output:r}, row.names=F,sep="\t",quote=F)
sos run ld_prune_reference.ipynb LD_pruning \
   --cwd output/ROSMAP_LD_pruned \

FIXME: please note that I uused LD pruning step 2 to merge them. Please test it. Also you did not show how to actually run this. Please add it above

random_variants = data.table::fread("LD_pruned_variants.txt")
head(random_variants)
A data.frame: 6 × 5
chromposaltrefvariant_id
<chr><chr><chr><chr><chr>
1chr116206TAchr1:16206:T:A
2chr116433CGchr1:16433:C:G
3chr116471CTchr1:16471:C:T
4chr116542CAchr1:16542:C:A
5chr116570CTchr1:16570:C:T
6chr116635GAchr1:16635:G:A