Package 'GWASbyCluster' reference manual

Title:	Identifying Significant SNPs in Genome Wide Association Studies (GWAS) via Clustering
Description:	Identifying disease-associated significant SNPs using clustering approach. This package is implementation of method proposed in Xu et al (2019) <DOI:10.1038/s41598-019-50229-6>.
Authors:	Yan Xu, Li Xing, Jessica Su, Xuekui Zhang<[email protected]>, Weiliang Qiu <[email protected]>
Maintainer:	Li Xing <[email protected]>
License:	GPL (>= 2)
Version:	0.1.7
Built:	2025-03-25 03:49:02 UTC
Source:	https://github.com/cran/GWASbyCluster

An ExpressionSet Object Storing Simulated Genotype Data

Description

An ExpressionSet object storing simulated genotype data. The minor allele frequency (MAF) of cases has the same prior as that of controls.

Usage

data("esSim")data("esSim")

Details

In this simulation, we generate additive-coded genotypes for 3 clusters of SNPs based on a mixture of 3 Bayesian hierarchical models.

In cluster $+$ , the minor allele frequency (MAF) $\theta_{x+}$ of cases is greater than the MAF $\theta_{y+}$ of controls.

In cluster $0$ , the MAF $\theta_{0}$ of cases is equal to the MAF of controls.

In cluster $-$ , the MAF $\theta_{x-}$ of cases is smaller than the MAF $\theta_{y-}$ of controls.

The proportions of the 3 clusters of SNPs are $\pi_{+}$ , $\pi_{0}$ , and $\pi_{-}$ , respectively.

We assume a “half-flat shape” bivariate prior for the MAF in cluster $+$

$2h_{+}\left(\theta_{x+}\right)h_{+}\left(\theta_{y+}\right) I\left(\theta_{x+}>\theta_{y+}\right),$

where $I(a)$ is hte indicator function taking value $1$ if the event $a$ is true, and value $0$ otherwise. The function $h_{+}$ is the probability density function of the beta distribution $Beta\left(\alpha_{+}, \beta_{+}\right)$ .

We assume $\theta_{0}$ has the beta prior $Beta(\alpha_0, \beta_0)$ .

We also assume a “half-flat shape” bivariate prior for the MAF in cluster $-$

$2h_{-}\left(\theta_{x-}\right)h_{-}\left(\theta_{y-}\right) I\left(\theta_{x-}>\theta_{y-}\right).$

The function $h_{-}$ is the probability density function of the beta distribution $Beta\left(\alpha_{-}, \beta_{-}\right)$ .

Given a SNP, we assume Hardy-Weinberg equilibrium holds for its genotypes. That is, given MAF $\theta$ , the probabilities of genotypes are

$Pr(geno=2) = \theta^2$

$Pr(geno=1) = 2\theta\left(1-\theta\right)$

$Pr(geno=0) = \left(1-\theta\right)^2$

We also assume the genotypes $0$ (wild-type), $1$ (heterozygote), and $2$ (mutation) follows a multinomial distribution $Multinomial\left\{1, \left[ \theta^2, 2\theta\left(1-\theta\right), \left(1-\theta\right)^2 \right]\right\}$

We set the number of cases as $100$ , the number of controls as $100$ , and the number of SNPs as $1000$ .

The hyperparameters are $\alpha_{+}=2$ , $\beta_{+}=5$ , $\pi_{+}=0.1$ , $\alpha_{0}=2$ , $\beta_{0}=5$ , $\pi_{0}=0.8$ , $\alpha_{-}=2$ , $\beta_{-}=5$ , $\pi_{-}=0.1$ .

Note that when we generate MAFs from the half-flat shape bivariate priors, we might get very small MAFs or get MAFs $>0.5$ . In these cased, we then delete this SNP.

So the final number of SNPs generated might be less than the initially-set number $1000$ of SNPs.

For the dataset stored in esSim, there are $872$ SNPs. $83$ SNPs are in cluster -, $714$ SNPs are in cluster $0$ , and $75$ SNPs are in cluster $+$ .

References

Yan X, Xing L, Su J, Zhang X, Qiu W. Model-based clustering for identifying disease-associated SNPs in case-control genome-wide association studies. Scientific Reports 9, Article number: 13686 (2019) https://www.nature.com/articles/s41598-019-50229-6.

Examples

data(esSim)
print(esSim)

pDat=pData(esSim)
print(pDat[1:2,])
print(table(pDat$memSubjs))

fDat=fData(esSim)
print(fDat[1:2,])
print(table(fDat$memGenes))
print(table(fDat$memGenes2))

data(esSim)
print(esSim)

pDat=pData(esSim)
print(pDat[1:2,])
print(table(pDat$memSubjs))

fDat=fData(esSim)
print(fDat[1:2,])
print(table(fDat$memGenes))
print(table(fDat$memGenes2))

An ExpressionSet Object Storing Simulated Genotype Data

Description

An ExpressionSet object storing simulated genotype data. The minor allele frequency (MAF) of cases has different prior than that of controls.

Usage

data("esSimDiffPriors")data("esSimDiffPriors")

Details

In this simulation, we generate additive-coded genotypes for 3 clusters of SNPs based on a mixture of 3 Bayesian hierarchical models.

In cluster $+$ , the minor allele frequency (MAF) $\theta_{x+}$ of cases is greater than the MAF $\theta_{y+}$ of controls.

In cluster $0$ , the MAF $\theta_{0}$ of cases is equal to the MAF of controls.

In cluster $-$ , the MAF $\theta_{x-}$ of cases is smaller than the MAF $\theta_{y-}$ of controls.

The proportions of the 3 clusters of SNPs are $\pi_{+}$ , $\pi_{0}$ , and $\pi_{-}$ , respectively.

We assume a “half-flat shape” bivariate prior for the MAF in cluster $+$

$2h_{x+}\left(\theta_{x+}\right)h_{y+}\left(\theta_{y+}\right) I\left(\theta_{x+}>\theta_{y+}\right),$

where $I(a)$ is hte indicator function taking value $1$ if the event $a$ is true, and value $0$ otherwise. The function $h_{x+}$ is the probability density function of the beta distribution $Beta\left(\alpha_{x+}, \beta_{x+}\right)$ . The function $h_{y+}$ is the probability density function of the beta distribution $Beta\left(\alpha_{y+}, \beta_{y+}\right)$ .

We assume $\theta_{0}$ has the beta prior $Beta(\alpha_0, \beta_0)$ .

We also assume a “half-flat shape” bivariate prior for the MAF in cluster $-$

$2h_{x-}\left(\theta_{x-}\right)h_{y-}\left(\theta_{y-}\right) I\left(\theta_{x-}>\theta_{y-}\right).$

The function $h_{x-}$ is the probability density function of the beta distribution $Beta\left(\alpha_{x-}, \beta_{x-}\right)$ . The function $h_{y-}$ is the probability density function of the beta distribution $Beta\left(\alpha_{y-}, \beta_{y-}\right)$ .

Given a SNP, we assume Hardy-Weinberg equilibrium holds for its genotypes. That is, given MAF $\theta$ , the probabilities of genotypes are

$Pr(geno=2) = \theta^2$

$Pr(geno=1) = 2\theta\left(1-\theta\right)$

$Pr(geno=0) = \left(1-\theta\right)^2$

We set the number of cases as $100$ , the number of controls as $100$ , and the number of SNPs as $1000$ .

The hyperparameters are $\alpha_{x+}=2$ , $\beta_{x+}=3$ , $\alpha_{y+}=2$ , $\beta_{y+}=8$ , $\pi_{+}=0.1$ ,

$\alpha_{0}=2$ , $\beta_{0}=5$ , $\pi_{0}=0.8$ ,

$\alpha_{x-}=2$ , $\beta_{x-}=8$ , $\alpha_{y-}=2$ , $\beta_{y-}=3$ , $\pi_{-}=0.1$ .

Note that when we generate MAFs from the half-flat shape bivariate priors, we might get very small MAFs or get MAFs $>0.5$ . In these cased, we then delete this SNP.

So the final number of SNPs generated might be less than the initially-set number $1000$ of SNPs.

For the dataset stored in esSim, there are $838$ SNPs. $64$ SNPs are in cluster -, $708$ SNPs are in cluster $0$ , and $66$ SNPs are in cluster $+$ .

References

Examples

data(esSimDiffPriors)
print(esSimDiffPriors)

pDat=pData(esSimDiffPriors)
print(pDat[1:2,])
print(table(pDat$memSubjs))

fDat=fData(esSimDiffPriors)
print(fDat[1:2,])
print(table(fDat$memGenes))
print(table(fDat$memGenes2))
data(esSimDiffPriors)
print(esSimDiffPriors)

pDat=pData(esSimDiffPriors)
print(pDat[1:2,])
print(table(pDat$memSubjs))

fDat=fData(esSimDiffPriors)
print(fDat[1:2,])
print(table(fDat$memGenes))
print(table(fDat$memGenes2))

Estimate SNP cluster membership

Description

Estimate SNP cluster membership. Only update cluster mixture proportions. Assume the 3 clusters have different sets of hyperparameters.

Usage

estMemSNPs(es, 
           var.memSubjs = "memSubjs", 
           eps = 0.001, 
           MaxIter = 50, 
           bVec = rep(3, 3), 
           pvalAdjMethod = "fdr", 
           method = "FDR",
           fdr = 0.05,
           verbose = FALSE)
estMemSNPs(es, 
           var.memSubjs = "memSubjs", 
           eps = 0.001, 
           MaxIter = 50, 
           bVec = rep(3, 3), 
           pvalAdjMethod = "fdr", 
           method = "FDR",
           fdr = 0.05,
           verbose = FALSE)

Arguments

`es`	An ExpressionSet object storing SNP genotype data. It contains 3 matrices. The first matrix, which can be extracted by `exprs` method (e.g., `exprs(es)`), stores genotype data, with rows are SNPs and columns are subjects. The second matrix, which can be extracted by `pData` method (e.g., `pData(es)`), stores phenotype data describing subjects. Rows are subjects, and columns are phenotype variables. The third matrix, which can be extracted by `fData` method (e.g., `fData(es)`), stores feature data describing SNPs. Rows are SNPs and columns are feature variables.
`var.memSubjs`	character. The name of the phenotype variable indicating subject's case-control status. It must take only two values: 1 indicating case and 0 indicating control.
`eps`	numeric. A small positive number as threshold for convergence of EM algorithm.
`MaxIter`	integer. A positive integer indicating maximum iteration in EM algorithm.
`bVec`	numeric. A vector of 2 elements. Indicates the parameters of the symmetric Dirichlet prior for proportion mixtures.
`pvalAdjMethod`	character. Indicating p-value adjustment method. c.f. `p.adjust`.
`method`	method to obtain SNP cluster membership based on the responsibility matrix. The default value is “FDR”. The other possible value is “max”. see details.
`fdr`	numeric. A small positive FDR threshold used to call SNP cluster membership
`verbose`	logical. Indicating if intermediate and final results should be output.