Multimodal QTL mapping | Sishir Subedi

Multimodal analysis for discovery of complex trait genetics

In this project, I am interested in exploring the integration of multimodal (scATAC + scRNA + SNPs) data to reveal novel disease mechanisms.

I will use data from Zhang et al 2023 paper. The paper analyzes data from COVID-19 patients and examines the connections between genetic variations, epigenetic factors, and immune responses at the cellular level.

I will extend the original analyses to approximate cell-type-resolved chromatin accessibility (caQTL) profiles. This approach will link genetic, epigenetic, and transcriptional data to identify regulatory mechanisms in patients with COVID-19.

scRNAseq analysis

library(Seurat)
library(ArchR)
library(ggplot2)

sc_data <- readRDS("data/Monocytes_scRNAseq.rds")
sc_data <- UpdateSeuratObject(sc_data)

colnames(sc_data@meta.data)
unique(sc_data@meta.data$celltype.idL0)
pdf("UMAP_scRNA.pdf",height=4,width=8) 
p1 <- DimPlot(sc_data,label=T,group.by = "celltype.idL0")
p2 <- DimPlot(sc_data,label=T,group.by = "Severity")
p <- p1 + p2
plot(p)
dev.off()

The scRNAseq data consists of 25901 genes/features across 56569 cells within one assay. A total of 1,000 variable features were used for the analysis. Harmony was employed for data integration across COVID-19 patients.

UMAP of scRNAseq

scATACseq analysis

proj <- readRDS("data/Monocyte_scATACseq/Save-ArchR-Project.rds")

colnames(getCellColData(proj))

umapEmbedding <- getEmbedding(proj, embedding = "UMAP")
cellColData <- getCellColData(proj)

umapData <- data.frame(
  Cell = rownames(umapEmbedding),
  UMAP_1 = umapEmbedding[, 1],
  UMAP_2 = umapEmbedding[, 2],
  Clusters = cellColData[rownames(umapEmbedding), "Clusters"],
  predictedGroup = cellColData[rownames(umapEmbedding), "predictedGroup"],
  Severity = cellColData[rownames(umapEmbedding), "Severity"]
)

pdf("UMAP_scATAC.pdf",height=4,width=8)

p1 <- ggplot(umapData, aes(x = UMAP_1, y = UMAP_2, color = predictedGroup)) +
  geom_point(size = 0.5) +
  theme_minimal() +
  labs(title = "UMAP Colored by Clusters", x = "UMAP 1", y = "UMAP 2")

p2 <- ggplot(umapData, aes(x = UMAP_1, y = UMAP_2, color = Severity)) +
  geom_point(size = 0.5) +
  theme_minimal() +
  labs(title = "UMAP Colored by Clusters", x = "UMAP 1", y = "UMAP 2")

p <- p1 + p2
plot(p)
dev.off()

The ATACseq dataset comprises 24,919 genes/features and 10,596 single cells that passed basic quality control in the arrow files. TSS enrichment measures the degree to which the transcription start site (TSS) is more accessible compared to its flanking regions. A score>10 is considered good accessibility. The dataset score is 13.9. The median number of unique fragments per cell is 5,389.5, indicating high-quality data. Also, doublet has been calculated and filtered.

UMAP of scATACseq

Integration was conducted using addGeneIntegrationMatrix function in archR. Briefly, both RNAseq and ATACseq chromatin accessiblility data (peaks) are projected onto low-dimensional latent space ( LSI for ATACseq and PCA for RNAseq). Cell matching across datasets based on similarity in this shared space, followed by gene expression imputation and cell type label transfer.

projDm <- addGeneIntegrationMatrix(
     ArchRProj = projDm, 
     useMatrix = "GeneScoreMatrix",
     matrixName = "GeneIntegrationMatrix",
     reducedDims = "IterativeLSI",
     seRNA = scRNA,
     addToArrow = FALSE,
     groupRNA = "celltypeL0",
     nameCell = "predictedCell_Un",
     nameGroup = "predictedGroup_Un",
     nameScore = "predictedScore_Un"
)

Now, let’s outline the analysis plan:

Step 1: Use scATACseq analysis to extract the gene score matrix, which provides gene x cell chromatin accessibility data.

Step 2: Create pseudobulk data based on cell types for each patient. We transfer cell-type labels from single-cell data integration analysis. Here, we generate gene x celltype_patient pseudobulk data.

Step 3: Use publicly available COVID-19 GWAS summary statistics and select significantly disease-associated variants.

Step 4: Obtain patient-level genotype data.

Once we generate three main data inputs:

Pseudobulk chromatin accessibility data for each cell type and patient
Significantly COVID-19-associated variants data
Patient-level genotype data

Next, we focus on our main data analysis question:

Cell-type specific gene - chromatin accessibility association

Now, lets follow the analysis plan:

Step 1: Use scATACseq analysis to extract the gene score matrix, which provides gene x cell chromatin accessibility data.

proj <- readRDS("data/Monocyte_scATACseq/Save-ArchR-Project.rds")

# Update Arrow file paths
oldArrowPaths <- getArrowFiles(proj)
newArrowPaths <- gsub(
  "/vol/projects/BIIM/Covid_50MHH/scATAC/analysis_new2/projMono_uniq/",
  "/data/sishir/projects/qtl_wrksh/data/Monocyte_scATACseq/",
  oldArrowPaths,
  fixed = TRUE
)
names(newArrowPaths) <- names(oldArrowPaths)
newSampleColData <- DataFrame(
  ArrowFiles = newArrowPaths[rownames(proj@sampleColData)],
  row.names = rownames(proj@sampleColData)
)
proj@sampleColData <- newSampleColData
getArrowFiles(proj)


# Extract GeneScoreMatrix
geneScoreMat <- getMatrixFromProject(
  ArchRProj = proj,
  useMatrix = "GeneScoreMatrix"
)
counts <- assay(geneScoreMat)
rownames(counts) <- rowData(geneScoreMat)$name

Here, counts matrix is a cell-by-gene matrix containing gene activity scores, not peak counts or direct gene expression. These scores are inferred from chromatin accessibility (e.g., promoter and enhancer regions) in scATAC-seq data, serving as a proxy for gene expression before scRNA-seq integration.

Rows are genes (24919), columns are cells (10596)

Here, we focus on three cell types - classical, non-classical monocytes, and CD163+ classical monocytes.

colnames(getCellColData(proj))

# Get cell type assignment from single cell integration analysis 
celltypes <- getCellColData(proj)$predictedGroup
patients <- getCellColData(proj)$patient

# Create unique patient-cell type combinations
combinations <- paste0(patients, "_", celltypes)
unique_combinations <- unique(combinations)

# Aggregate counts by patient and cell type (pseudobulk)
pseudobulk <- do.call(cbind, lapply(unique_combinations, function(combo) {
  idx <- which(combinations == combo)
  if (length(idx) > 0) {
    Matrix::rowSums(counts[, idx, drop = FALSE])
  } else {
    Matrix::Matrix(0, nrow = nrow(counts), ncol = 1)
  }
}))
colnames(pseudobulk) <- unique_combinations

gz <- gzfile("data/pseudobulk.csv.gz", "w")
write.csv(pseudobulk, file = gz, row.names = TRUE, col.names = TRUE)
close(gz)

Step 3: Use publicly available COVID-19 GWAS summary statistics and select significantly disease-associated variants.

Here, we download COVID-19 GWAS summary statistics from COVID-19 host genetics initiative database. We use COVID19-hg GWAS meta-analyses round 7 for all population data.

The VCF file contains summary statistics for genetic variants, including columns like CHR, POS, REF, ALT, SNP, effect sizes (all_inv_var_meta_beta), standard errors (all_inv_var_meta_sebeta), p-values (all_inv_var_meta_p), allele frequencies (all_meta_AF), etc.

library(dplyr)
library(data.table)
library(qqman)

# Load GWAS data
gwas <- fread("data/summ_gwas.tsv")

# Rename columns if needed
colnames(gwas)[colnames(gwas) == "#CHR"] <- "CHR"
colnames(gwas)[colnames(gwas) == "POS"] <- "BP"
colnames(gwas)[colnames(gwas) == "all_inv_var_meta_p"] <- "P"

# Remove rows with NA or P=0
gwas <- gwas %>% filter(!is.na(P), P > 0)
gwas$SNP <- rownames(gwas)
gwas[, SNP := paste0("rs", SNP)]
gwas <- gwas[,c("SNP", "CHR", "BP", "P")]

pdf("GWAS_summary.pdf",height=4,width=8) 
manhattan(gwas,
          chr = "CHR",
          bp = "BP",
          p = "P",
          snp = "SNP",
          col = c("blue4", "orange3"),
          genomewideline = -log10(5e-8),
          suggestiveline = -log10(1e-5),
          main = "COVID-19 GWAS Manhattan Plot - filtered SNPs")
dev.off()

Filtered SNPs associated with COVID-19

Ok, now we have:

Summary GWAS data from COVID 19 patients where significant chr, position, and pvalues are estimated
Chromatin accessibility data from pseudobulk where gene accesibility score is aggregated by cell type and patient
We are missing donor level genotype data, we simulate the data as the original data has restricted access

library(data.table)

pseudobulk <- fread("data/pseudobulk.csv.gz")
rownames <- pseudobulk[[1]]
pseudobulk <- pseudobulk[, -1] 
pseudobulk <- as.matrix(pseudobulk) 
rownames(pseudobulk) <- rownames

# Normalize pseudobulk for a specific cell type 
celltype <- "cMono"
cols <- grep(paste0("_", celltype, "$"), colnames(pseudobulk), value = TRUE)
pseudobulk_cmono <- pseudobulk[, cols]
pseudobulk_cmono <- t(scale(t(pseudobulk_cmono)))

celltype <- "ncMono"
cols <- grep(paste0("_", celltype, "$"), colnames(pseudobulk), value = TRUE)
pseudobulk_ncmono <- pseudobulk[, cols]
pseudobulk_ncmono <- t(scale(t(pseudobulk_ncmono)))


# Map significant SNPs to genes within 1Mb
gene_pos <- as.data.table(rowData(geneScoreMat))[, .(name, seqnames, start, end)]
setnames(gene_pos, c("gene", "chr", "start", "end"))
gene_pos[, `:=`(start_window = start - 1e6, end_window = end + 1e6)]

# Filter GWAS for significant SNPs
gwas_sig <- gwas[P < 5e-8, ]
gwas_sig[, CHR := as.character(CHR)] # Ensure CHR is character
gene_pos[, chr := as.character(chr)]
gwas_sig[, CHR := paste0("chr", CHR)]
gwas[, CHR := paste0("chr", CHR)]

# Join GWAS SNPs with genes
genes_overlapping <- gwas_sig[gene_pos, 
                             .(SNP, gene, chr = CHR, pos = BP, P),
                             on = .(CHR == chr, BP >= start_window, BP <= end_window),
                             nomatch = 0]

genes_unique <- unique(genes_overlapping[, .(gene, chr)])

Association analysis: For each gene present in both summary GWAS data and ATACseq data, we estimate beta values for all the SNPs associated with the gene. Then we use cell-type specific chromatic accessibility for all donors and fit a linear model

chromatin_accessibilty_score ~ SNPs.

all_results <- list()

for ( gene in genes_unique$gene) {

  print(gene)

  # Get gene position
  gene_pos <- rowData(geneScoreMat)[rowData(geneScoreMat)$name == gene, ]
  chr <- as.character(gene_pos$seqnames)
  start <- gene_pos$start - 1e6
  end <- gene_pos$end + 1e6

  # Subset GWAS SNPs in gene region
  gwas_subset <- gwas[gwas$CHR == chr & gwas$BP >= start & gwas$BP <= end, ]

  # Estimate summary stats from pvalues
  gwas_subset[, beta := qnorm(P/2, lower.tail = FALSE) * sign(runif(.N, -1, 1))]
  gwas_subset[, se := abs(beta / qnorm(P/2, lower.tail = FALSE))]
  gwas_subset[, freq := 0.5] # Default MAF
  gwas_subset[, N := 1000] # Default sample size


  # now we have two vectors: 
  #(1) beta vector from gwas_subset data calculated based on pvalues from summary statistics
  #(2) chromatin accesibility vector for gene in diffrent patients as pseudobulk_ct[gene,]

  ## simulate genotype using estimation from gwas summary
  n_donors_ncmono <- length(pseudobulk_ncmono[gene,])
  n_donors_cmono <- length(pseudobulk_cmono[gene,])
  n_donors <- max(n_donors_ncmono, n_donors_cmono) 

  if (any(is.nan(pseudobulk_ncmono[gene, ]) | is.na(pseudobulk_ncmono[gene, ])) || any(is.nan(pseudobulk_cmono[gene, ]) | is.na(pseudobulk_cmono[gene, ]))) next

  geno <- matrix(NA, nrow = nrow(gwas_subset), ncol = n_donors)
  rownames(geno) <- gwas_subset$SNP
  for (i in 1:nrow(gwas_subset)) {
    maf <- gwas_subset$freq[i]
    # Simulate genotypes (0, 1, 2) under Hardy-Weinberg equilibrium
    geno[i, ] <- rbinom(n_donors, 2, maf)
  }
  geno <- t(geno)  # Donors x SNPs


# Fit linear model
fit_lm <- function(scores, geno, n_donors) {
  results <- lapply(1:ncol(geno), function(i) {
    snp <- colnames(geno)[i]
    valid_idx <- 1:min(n_donors, nrow(geno))  # Match donor count
    lm_fit <- lm(scores ~ geno[valid_idx, i])
    coef_summary <- summary(lm_fit)$coefficients
    if (nrow(coef_summary) > 1) {
      beta <- coef_summary[2, 1]
      se <- coef_summary[2, 2]
      pval <- coef_summary[2, 4]
    } else {
      beta <- NA
      se <- NA
      pval <- 1
    }
    list(beta = beta, se = se, pval = pval)
  })
  list(
    beta = sapply(results, function(x) x$beta),
    se = sapply(results, function(x) x$se),
    pval = sapply(results, function(x) x$pval)
  )
}

results_ncmono <- fit_lm(pseudobulk_ncmono[gene,], geno, n_donors_ncmono)
results_cmono <- fit_lm(pseudobulk_cmono[gene,], geno, n_donors_cmono)

gene_results <- data.table(
    gene = gene,
    SNP = rep(gwas_subset$SNP, 2),
    cell_type = rep(c("ncMono", "cMono"), each = nrow(gwas_subset)),
    beta = c(results_ncmono$beta, results_cmono$beta),
    se = c(results_ncmono$se, results_cmono$se),
    pvalue = c(results_ncmono$pval, results_cmono$pval),
    gwas_beta = rep(gwas_subset$beta, 2),
    chr = chr,
    pos = gwas_subset$BP
  )

all_results[[gene]] <- gene_results

}

results <- do.call(rbind, all_results)

# Step 4: Identify cell-type-specific associations
results[, sig := pvalue < 0.05]

results[, beta_diff := abs(beta - shift(beta, n = sum(cell_type == "ncMono"), type = "lead")), by = gene]
results[, cell_specific := sig & beta_diff > 0.1]  # Threshold for specificity
results <- results[sig == TRUE | cell_specific == TRUE]
write.csv(results, "data/snp_association_significant_genes_ncMono_cMono.csv")

Now, we select genes from the results in the original paper.

We compare whether our analyses identified the same genes as highly associated with disease variants.

library(data.table)
library(ggplot2)
results <- fread("data/snp_association_significant_genes_ncMono_cMono.csv")

gene_importance <- results[sig == TRUE, .(importance = sum(abs(beta * gwas_beta))), 
                          by = .(gene, cell_type)]
gene_importance <- gene_importance[cell_type == "ncMono"][order(-importance)]

# Filter for one gene
gene <- "CXCR6"
plot_data <- results[gene == "CXCR6" & sig == TRUE ]

common_snps <- plot_data[, .N, by = SNP][N == 2, SNP]
plot_data <- plot_data[SNP %in% common_snps]



# Check for duplicate columns in plot_data
if (any(duplicated(colnames(plot_data)))) {
  plot_data <- plot_data[, !duplicated(colnames(plot_data)), with = FALSE]
}

top_snps <- plot_data[1:min(3, .N), SNP]
plot_data <- plot_data[SNP %in% top_snps]


ggplot(plot_data, aes(x = SNP, y = beta, fill = cell_type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = paste("Cell-Type-Specific SNP Associations for ",gene),
        x = "SNP", y = "Beta (Effect Size)", fill = "Cell Type") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = c("ncMono" = "#1f77b4", "cMono" = "#ff7f0e"))

ggsave(paste("cell_type_diff_",gene,".png"), width = 6, height = 10)

According to the paper’s results, the analyses identified genes such as CCR1, CCR1, CXCR6, and IFNGR2 as associated with GWAS-mapped peaks.

In our analysis also, we can verify that SNPs in CCR1, CCR1, CXCR6, and IFNGR2 genes are associated with cell-type specific chromatin accessibility.

In conclusion, we used the published research study/data, repeated the analyses, and conducted GWAS summary-based SNP-to-chromatin accessibility association analysis to show how genetic variations are associated with chromatin accessibility in a cell-type specific manner. Our results identified the same group of genes and cell types as published in the paper. This analysis presents an approach to approximating a cell-type-resolved chromatin accessibility (caQTL) analysis using public GWAS summary data and chromatin accessibility profiles.

The code used is available caQTL_analysis.