Some useful R functions for processing GWAS output
Installation
To install the development version from GitHub use the remotes
package:
remotes::install_github("lcpilling/gwasRtools")
remotes::install_github("lcpilling/gwasRtools@*release") # To install the latest release
remotes::install_github("lcpilling/gwasRtools@v0.1.3") # To install a specific version (see tags)
Example dataset
The package includes a subset of variants from the Graham et al. 2021 GWAS of LDL in 1,320,016 Europeans (GWAS catalog GCST90239658). I will use this throughout.
library(gwasRtools)
head(gwas_example)
#> SNP CHR BP A1 A2 MAF BETA SE P
#> 1 rs370704455 1 104275751 CATCT C 7.21e-02 -0.00350057 0.0040782 0.391
#> 2 rs545633289 1 104275758 C G 5.25e-04 -0.36329200 0.2903640 0.211
#> 3 rs367651119 1 104275797 G A 5.99e-03 -0.01142350 0.0164861 0.488
#> 4 rs560138913 1 104296901 T G 3.67e-03 -0.05503450 0.0318479 0.084
#> 5 rs533558585 1 104296979 T C 7.57e-05 0.00173860 0.1826840 0.992
#> 6 rs551694811 1 104297001 T A 1.97e-03 -0.01075230 0.0316984 0.734
get_loci()
Determine loci from GWAS summary statistics. Use distance from lead significant SNP to estimate independent loci [default distance = 500kb]. Uses -log10(p) derived from BETA/SE so does not need P as input. Example below with default input.
gwas_loci = get_loci(gwas_example)
#>
#> Locus size (bases) = 5e+05
#> P-value threshold = 5e-08
#>
#> N variants = 319732
#> N variants p<threshold = 4132
#> N loci = 15
head(gwas_loci)
#> SNP CHR BP A1 A2 MAF BETA SE P locus lead
#> 57882 rs12046439 1 107536799 T C 0.248 0.0099 0.0017 5.01e-09 1 FALSE
#> 57900 rs143849791 1 107537916 CATG C 0.325 0.0128 0.0016 5.85e-15 1 FALSE
#> 57922 rs113329442 1 107539252 A G 0.330 0.0110 0.0014 1.27e-13 1 FALSE
#> 57987 rs3861909 1 107544176 G A 0.327 0.0118 0.0015 3.52e-15 1 FALSE
#> 58025 rs17496332 1 107546375 A G 0.331 0.0111 0.0014 8.70e-14 1 FALSE
#> 58091 rs2878349 1 107549245 G A 0.327 0.0118 0.0014 2.33e-15 1 FALSE
gwas_loci |> dplyr::filter(lead==TRUE) |> head()
#> SNP CHR BP A1 A2 MAF BETA SE P locus lead
#> 1 rs111232683 1 107566149 G C 0.343 0.0135 0.0016 5.43e-17 1 TRUE
#> 2 rs114254196 1 108635400 C T 0.008 -0.0448 0.0081 4.38e-08 2 TRUE
#> 3 rs115292790 1 109310728 G A 0.013 -0.0563 0.0060 2.01e-20 3 TRUE
#> 4 rs12740374 1 109817590 G T 0.219 -0.1482 0.0016 4.73e-305 4 TRUE
#> 5 rs140266316 1 110326545 G A 0.016 -0.0577 0.0059 4.73e-22 5 TRUE
#> 6 rs657801 1 111736389 T C 0.315 0.0090 0.0015 1.88e-09 6 TRUE
- Loci are numbered. Variants within a locus (i.e., significant below the
p_threshold
and less thann_bases
from last significant variant). - Lead variant for each locus is highlighted where
lead==TRUE
(i.e., smallest p-value for any variant within a locus)
The HLA region can be treated as one continuous locus by setting single_hla_locus
to TRUE.
Use LD clumping to identify independent SNPs at the same locus
Setting option ld_clump=TRUE
will use {ieugwasr} package ld_clump()
function to run Plink LD clumping.
Default is to use a local Plink installation (this is faster) with EUR reference panel. But setting option ld_clump_local
to FALSE will use the online IEU API. See the {ieugwasr} docs for details. Default R2 threshold for LD pruning is 0.01 (modify with ld_pruning_r2
option).
gwas_loci = get_loci(gwas_example, ld_clump=TRUE)
#> ** Performing LD clumping. Can take a few minutes
#> ** Local Plink installation will be called -- output appears in your R terminal
#>
#> Locus size (bases) = 5e+05
#> P-value threshold = 5e-08
#>
#> N variants = 319732
#> N variants p<threshold = 4132
#> N loci = 15
#> N independent variants (LD R2 threshold 0.01) = 153
head(gwas_loci)
#> SNP CHR BP A1 A2 MAF BETA SE P locus lead lead_dist lead_ld
#> 57882 rs12046439 1 107536799 T C 0.248 0.0099 0.0017 5.01e-09 1 FALSE FALSE FALSE
#> 57900 rs143849791 1 107537916 CATG C 0.325 0.0128 0.0016 5.85e-15 1 FALSE FALSE FALSE
#> 57922 rs113329442 1 107539252 A G 0.330 0.0110 0.0014 1.27e-13 1 FALSE FALSE FALSE
#> 57987 rs3861909 1 107544176 G A 0.327 0.0118 0.0015 3.52e-15 1 FALSE FALSE FALSE
#> 58025 rs17496332 1 107546375 A G 0.331 0.0111 0.0014 8.70e-14 1 FALSE FALSE FALSE
#> 58091 rs2878349 1 107549245 G A 0.327 0.0118 0.0014 2.33e-15 1 FALSE FALSE FALSE
gwas_loci |> dplyr::filter(lead==TRUE) |> head()
#> SNP CHR BP A1 A2 MAF BETA SE P locus lead lead_dist lead_ld
#> 1 rs111232683 1 107566149 G C 0.343 0.0135 0.0016 5.43e-17 1 TRUE TRUE TRUE
#> 2 rs114254196 1 108635400 C T 0.008 -0.0448 0.0081 4.38e-08 2 TRUE TRUE TRUE
#> 3 rs140300970 1 109020060 A T 0.022 -0.0278 0.0049 2.13e-08 3 TRUE FALSE TRUE
#> 4 rs148503795 1 109166178 C G 0.010 -0.0423 0.0070 2.47e-09 3 TRUE FALSE TRUE
#> 5 rs74896173 1 109167705 T C 0.009 -0.0429 0.0076 2.08e-08 3 TRUE FALSE TRUE
#> 6 rs111751551 1 109242056 G A 0.010 -0.0505 0.0069 4.32e-13 3 TRUE FALSE TRUE
Where before, locus 3 would only have had one lead SNP (based on distance/lowest p-value) ld_clump()
has identified multiple independent variants in the region.
Note that there are now three lead
columns: - lead_dist
is the original lead
column, simply based on distance and p-values - lead_ld
is the direct results from ld_clump()
- The lead
column combines the two (some SNPs are missing from LD panel so it does not always choose the lowest p-value if only 1 variant identified at a locus).
** Note that ld_clump()
only considers R^2 when defining independent variants, not D’ – you should perform additional checking/conditional analysis where relevant for lead
variants in close proximity.
get_nearest_gene()
Get nearest gene from a set of variants using GENCODE data. Need to provide a data.frame of variant IDs (e.g., rsids), CHR and POS. Default column names are the same as for get_loci()
. Default max distance from variant to gene is 100kb.
gwas_loci = get_nearest_gene(gwas_loci, build=37)
#> Using human genome build 37
#> Getting nearest gene for 4131 unique variants
#> (Removed 1 duplicated or missing variant IDs/positions)
head(gwas_loci)
#> SNP CHR BP A1 A2 MAF BETA SE P locus lead gene dist
#> 1 rs12046439 1 107536799 T C 0.248 0.0099 0.0017 5.01e-09 1 FALSE PRMT6 62468
#> 2 rs143849791 1 107537916 CATG C 0.325 0.0128 0.0016 5.85e-15 1 FALSE PRMT6 61351
#> 3 rs113329442 1 107539252 A G 0.330 0.0110 0.0014 1.27e-13 1 FALSE PRMT6 60015
#> 4 rs3861909 1 107544176 G A 0.327 0.0118 0.0015 3.52e-15 1 FALSE PRMT6 55091
#> 5 rs17496332 1 107546375 A G 0.331 0.0111 0.0014 8.70e-14 1 FALSE PRMT6 52892
#> 6 rs2878349 1 107549245 G A 0.327 0.0118 0.0014 2.33e-15 1 FALSE PRMT6 50022
gwas_loci |> dplyr::filter(lead==TRUE) |> head()
#> SNP CHR BP A1 A2 MAF BETA SE P locus lead gene dist
#> 1 rs111232683 1 107566149 G C 0.343 0.0135 0.0016 5.43e-17 1 TRUE PRMT6 33118
#> 2 rs114254196 1 108635400 C T 0.008 -0.0448 0.0081 4.38e-08 2 TRUE SLC25A24 41258
#> 3 rs140300970 1 109020060 A T 0.022 -0.0278 0.0049 2.13e-08 3 TRUE NBPF6 6436
#> 4 rs148503795 1 109166178 C G 0.010 -0.0423 0.0070 2.47e-09 3 TRUE FAM102B -63467
#> 5 rs74896173 1 109167705 T C 0.009 -0.0429 0.0076 2.08e-08 3 TRUE FAM102B -64994
#> 6 rs111751551 1 109242056 G A 0.010 -0.0505 0.0069 4.32e-13 3 TRUE PRPF38B -7111
- If
dist
is positive, the variant is intergenic, and this is the distance to the closest gene. - If
dist
is negative, the variant is within a gene, and this is the distance to the start of the gene. - If
dist
is NA, the variant is not withinn_bases
of a gene in GENCODE.
lambda_gc()
Estimate inflation of test statistics. Lambda GC compares the median test statistic against the expected median test statistic under the null hypothesis of no association. For well-powered GWAS of traits with a known polygenic inheritance, we expect inflation of lambda GC. For traits with no expected association, we expect lambda GC to be around 1.
lambda_gc(gwas_example$P)
#> [1] 1.41112