Genomic Relationship Matrices#

This workflow generates genomic relationship matrices (GRM) under the leave-one-chromosome-out (LOCO) theme.

Description#

For each chromosome, we compute a GRM using data excluding this chromosome. Computation is implemented using GCTA software package.

Input#

This workflow expects per chromosome genotype data in PLINK 1.0 format. We provide some genotype formatting workflows to help convert other genotype formats to the required PLINK format. Notice that common variants (minor allele frequency > 1%) are recommended to use for computing GRM.

Output#

GRM per LOCO, in formats compatible with downstreams analysis including LMM using APEX.

Minimal Working Example#

sos run GRM.ipynb grm \
    --cwd output \
    --genoFile data/genotype/mwe_genotype.list \
    --container container/bioinfo.sif

Command interface#

sos run GRM.ipynb -h
usage: sos run GRM.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  grm

Global Workflow Options:
  --cwd VAL (as path, required)
                        Work directory & output directory
  --container ''
                        The filename name for output data
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 20 (as int)
                        Number of threads
  --genotype-list  paths


Sections
  grm_1:                Generate LOCO file list
  grm_2:
  grm_3:
  grm_4:
[global]
# Work directory & output directory
parameter: cwd = path("output")
# The filename name for output data
parameter: container = ''
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
parameter: genoFile = paths
cwd = f"{cwd:a}"

## Code below are copied from genotype_formatting.ipynb. Change should be made to codes in that document and sync with this chunk.
from sos.utils import expand_size
import os

def get_genotype_file(geno_file_paths):
    #
    def valid_geno_file(x):
        suffixes = path(x).suffixes
        if suffixes[-1] == '.bed':
            return True
        if len(suffixes)>1 and ''.join(suffixes[-2:]) == ".vcf.gz":
            return True
        return False
    #
    def complete_geno_path(x, geno_file):
        if not valid_geno_file(x):
            raise ValueError(f"Genotype file {x} should be VCF (end with .vcf.gz) or PLINK bed file (end with .bed)")
        if not os.path.isfile(x):
            # relative path
            if not os.path.isfile(f'{geno_file:ad}/' + x):
                raise ValueError(f"Cannot find genotype file {x}")
            else:
                x = f'{geno_file:ad}/' + x
        return x
    # 
    def format_chrom(chrom):
        if chrom.startswith('chr'):
            chrom = chrom[3:]
        return chrom
    # Inputs are either VCF or bed, or a vector of them 
    if len(geno_file_paths) > 1:
        if all([valid_geno_file(x) for x in geno_file_paths]):
            return paths(geno_file_paths)
        else: 
            raise ValueError(f"Invalid input {geno_file_paths}")
    # Input is one genotype file or text list of genotype files
    geno_file = geno_file_paths[0]
    if valid_geno_file(geno_file):
        return path(geno_file)
    else: 
        units = [x.strip().split() for x in open(geno_file).readlines() if x.strip() and not x.strip().startswith('#')]
        if all([len(x) == 1 for x in units]):
            return paths([complete_geno_path(x[0], geno_file) for x in units])
        elif all([len(x) == 2 for x in units]):
            genos = dict([(format_chrom(x[0]), path(complete_geno_path(x[1], geno_file))) for x in units])
        else:
            raise ValueError(f"{geno_file} should contain one column of file names, or two columns of chrom number and corresponding file name")
        return genos
                        
genotypes = get_genotype_file(genoFile)

Generate LOCO file list#

This step generates a list of bfiles that are used by gcta to generate the GRM. eg,:

/mnt/mfs/statgen/xqtl_workflow_testing/ROSMAP/data_preprocessing/genotype/AC_per_chrom_plink/AC_chr1
/mnt/mfs/statgen/xqtl_workflow_testing/ROSMAP/data_preprocessing/genotype/AC_per_chrom_plink/AC_chr2
/mnt/mfs/statgen/xqtl_workflow_testing/ROSMAP/data_preprocessing/genotype/AC_per_chrom_plink/AC_chr3
/mnt/mfs/statgen/xqtl_workflow_testing/ROSMAP/data_preprocessing/genotype/AC_per_chrom_plink/AC_chr4
/mnt/mfs/statgen/xqtl_workflow_testing/ROSMAP/data_preprocessing/genotype/AC_per_chrom_plink/AC_chr5
/mnt/mfs/statgen/xqtl_workflow_testing/ROSMAP/data_preprocessing/genotype/AC_per_chrom_plink/AC_chr6

without .bed extension.

# Generate LOCO file list
[grm_1]
chrom = list(genotypes.keys())
input: for_each = 'chrom'
output: f'{cwd}/{genotypes[_chrom]:bn}.loco.txt'
with open(_output, 'w') as f:
    f.write('\n'.join([str(genotypes[x].with_suffix('')) for x in genotypes if x != _chrom]))

Compute LOCO-GRM#

This step can take a while. We recommend using genotype data of common variants to perform this analysis.

# Compute LOCO-GRM
[grm_2]
output: f'{_input:nn}.grm.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container, entrypoint=entrypoint
   gcta64 \
   --mbfile $[_input] \
   --make-grm-gz \
   --out $[_output:nn]

Format output for APEX#

# Format output for APEX
[grm_3]
output: f'{_input:nn}.apex.grm'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container, entrypoint=entrypoint

    # Load necessary libraries
    library(dplyr)
    library(readr)
    library(purrr)

    # Read data
    grm <- read_delim(gzfile("$[_input]"), '\t', col_names = FALSE) %>%
      select(1, 2, 4) %>%
      set_names(c("#id1", "id2", "kinship"))

    id <- read_delim("$[_input:n].id", '\t', col_names = FALSE)[, 2] %>%
      unlist()

    # Modify grm data
    grm_updated <- grm %>%
      mutate(`#id1` = map_chr(`#id1`, ~id[.x]),
             `id2` = map_chr(`id2`, ~id[.x]))

    # Write to output
    grm_updated %>%
      write_delim("$[_output]", '\t')

Generate the list of LOCO-GRM file for APEX input,

# Generate APEX input list
[grm_4]
input: group_by = "all"
output: f'{cwd}/{path(list(genotypes.values())[1]):bnn}.loco_grm_list.txt'
R: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container, entrypoint=entrypoint

    # Load necessary libraries
    library(dplyr)
    library(readr)
    library(purrr)

    # Define chromosomal names
    chromosomes <- paste0("chr", c($[paths(genotypes.keys()):r,]), collapse = "")

    # Define input files and directory
    input_files <- c($[_input:r,])

    # Create and sort the tibble
    grm_tibble <- tibble(`#chr` = chromosomes, dir = input_files) %>%
      arrange(`#chr`)

    # Write to output
    grm_tibble %>%
      write_delim("$[_output]", "\t")