Independent list of variants using LD clumping

Independent list of variants using LD clumping#

Description#

This notebook creates a list of randomly selected variants that are in weak LD with each other, for given reference data-set. This is useful for the downstreams mashr analysis where a list of independent variants are used.

Input#

Reference genotype, in this case, our ROSMAP European ancestry genotype of ~1,000 individuals, ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.bed
A list of approximately independent LD blocks, which is optional but when it is included, we can parallel the LD clumping to all LD blocks which will speed up the process. This list is available from: gaow/LDblocks_GRCh38 (a forked repo created from work described at https://www.biorxiv.org/content/10.1101/2022.03.04.483057v2)

Output#

A list of independent variants in bed.gz format

Minimal Working Example Steps#

i. Extract genotypes per LD block#

Using the genotype_formatting.ipynb pipeline,

FIXME: this is using the original bed file for blocks, which is a lot cleaner. Please test this

sos run pipeline/genotype_formatting.ipynb genotype_by_region \
    --region-list data/references/deCODE_EUR_LD_blocks.bed \
    --genoFile data/genotype/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.bed \
    --cwd output/geno_by_LDblock \
    --container /mnt/vast/hpc/csg/containers_xqtl/bioinfo.sif \
    --mem 200G -s build

INFO: Running genotype_by_region_1:

ii. LD pruning within each LD block#

We create a small pipeline for the purpose,

[global]
parameter: cwd = path("output")
parameter: container = ''
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
parameter: job_size = 1
parameter: walltime = "1h"
parameter: mem = "80G"
parameter: numThreads = 10

[LD_pruning_1]
parameter: genotype_list = path
parameter: window = 50
parameter: shift = 10
parameter: r2 = 1e-3

import pandas as pd
geno_path = pd.read_csv(genotype, sep = "\t")
input_df = geno_path.values.tolist()
input_blocks = [x[0] for x in input_df]
input_files = [x[1:] for x in input_df] 

input: input_files, group_by = 1, group_with = "input_blocks"
output: prune = f'{cwd}/{step_name[:-2]}/{_input_blocks}.prune.in'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint
    plink2 \
        --bfile  ${_input:anr} \
        --indep-pairwise  ${window} ${shift} ${r2} \
        --rm-dup force-first \
        --out ${_output["prune"]:annr} \
        --threads ${numThreads}

[LD_pruning_2]
input: group_by = "all"
output: f'{cwd}/{step_name[:-2]}/LD_pruned_variants.txt'
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    library(data.table)
    library(dplyr)
    library(stringr)
    per_block_variants = c(${_input:ar,})
    uncorrelated_variants = as.data.frame(rbindlist(lapply(per_block_variants, fread, header = FALSE)))
    random_variants  = str_split(gsub("_",":",uncorrelated_variants$V1),":",simplify=TRUE)%>%data.frame()%>%
                    rename("chrom" = "X1","pos" = "X2","alt" = "X3","ref"="X4")%>%
                    mutate(variant_id = paste(chrom, pos, alt, ref, sep = ":"))
    fwrite(random_variants, ${_output:r}, row.names=F,sep="\t",quote=F)

sos run ld_prune_reference.ipynb LD_pruning \
   --cwd output/ROSMAP_LD_pruned \

FIXME: please note that I uused LD pruning step 2 to merge them. Please test it. Also you did not show how to actually run this. Please add it above

random_variants = data.table::fread("LD_pruned_variants.txt")
head(random_variants)

A data.frame: 6 × 5
	chrom	pos	alt	ref	variant_id
	<chr>	<chr>	<chr>	<chr>	<chr>
1	chr1	16206	T	A	chr1:16206:T:A
2	chr1	16433	C	G	chr1:16433:C:G
3	chr1	16471	C	T	chr1:16471:C:T
4	chr1	16542	C	A	chr1:16542:C:A
5	chr1	16570	C	T	chr1:16570:C:T
6	chr1	16635	G	A	chr1:16635:G:A