Introduction

The genome-wide association study (GWAS) era has improved our understanding of disease aetiology by identifying genetic variants associated with complex human traits and disease phenotypes. However, to fully evaluate the data emerging from these studies, researchers need convenient ways to access and visualize the totality of investigations so far completed, while not compromising any individual’s privacy or informed consent. To this end, we herein describe GWAS Central, a comprehensive genetic association database, designed to enable multiple study integration via graphical displays and extensive textual content.

Other GWAS depositories, such as dbGaP1 (http://www.ncbi.nlm.nih.gov/gap/) and EGA (http://www.ebi.ac.uk/ega/), act as archival systems that provide controlled access to individual-level GWAS data and open access to some categories of summary-level data. This approach is merited, given that it is possible to identify the participation of a research subject within the full range of summary-level data.2 Smaller amounts of summary-level GWAS data are available from resources such as the NHGRI GWAS Catalog3 and the Open Access Database of Genome-wide Association Results (OADGAR),4 with their content being restricted to marker signals that exceed predefined P-value thresholds. The semi-arbitrary imposition of such cut-offs is unfortunate, in that it prevents direct comparison across the totality of signals (within and between related studies), in order to identify consistently positive markers.

Convenient, dedicated resources that provide unfettered access to all GWAS summary-level data are therefore needed, powered by user-friendly tools for instant interrogation and visualization of unified views of the data. In particular, such displays also need to incorporate information about the tested markers, such as chromosome location, alleles and 5′ and 3′ flanking sequences of SNPs. Ideally, the investigated phenotypes will be represented by standardized terminologies, thus allowing meaningful cross-study searches to be conducted.

On the basis of the above considerations, we designed and created the GWAS Central resource (http://www.gwascentral.org). Here we describe the ways in which this database enables experimental biologists to explore and compare data in the GWAS domain, from either a genotype or phenotype starting point.

Implementation

GWAS Central collates association data and study metadata from many disparate sources whose data are available in different formats and to differing degrees of detail. These diverse GWAS data are integrated in a flexible and coherent data model that was described previously in an earlier incarnation of the database, named HGVbaseG2P.5 GWAS Central builds upon core genomic variation visualization and comparison concepts from HGVbaseG2P to provide new features, such as downloadable detailed data reports, semantically standardized phenotype ontology searching, optimized data visualizations, private upload and comparison of user data, and tools for remote data interrogation. The various resources collated by GWAS Central include data sets from other sites, such as the NHGRI GWAS Catalog, OADGAR and complete association data sets from the 10 trait-based investigations of the 1958 Birth Cohort.6 In addition, a substantial amount of data have also been obtained by directly requesting data from researchers and consortia and from numerous unsolicited data submissions from researchers who wish for their newly published data to be included in GWAS Central. All data submitters are fully acknowledged, with the contributing resources and the original authors of each study cited on the website.

The gathered and submitted data are extensively curated to maximize quality and completeness. This includes checking that all genetic markers have valid dbSNP rs numbers, assessing whether the alleles and strand representation of these are correct, eliminating duplicate markers, combining multiple data sets for discrete studies and populating extensive metadata. In addition, we manually evaluate each study for its range of phenotype content and apply appropriately chosen ontology terms to ensure that the phenotype descriptions are standardized across all studies. In this task, we identify for each phenotype an equivalent or most appropriate term from the National Library of Medicine’s MeSH controlled vocabulary. MeSH is used because it offers familiarity to biologists as a result of PubMed MEDLINE indexing and it also provides good generalized descriptions of phenotypes. The Human Phenotype Ontology (HPO) is also used to annotate phenotypes in cases where HPO offers a more specific description.7

To allow flexible access and data discovery, GWAS Central queries are structured into three types, namely, genotype, phenotype or keyword orientated. Genotype searches can be based on HGNC gene symbols, genomic region coordinates or dbSNP rs numbers. Phenotype searches are linked to MeSH and HPO annotations, as well as to the original free-text descriptions used in publications. Keyword searches interrogate text contained in study titles and abstracts, PubMed IDs and author names.

Results

GWAS Central content is organized and displayed using three levels, representing the core aspects of each GWAS (Supplementary Figure 1). These describe the following: 1) the top-level study summary presenting the cohorts used to carry out the study and the association result sets comprising P-values, odds ratios, frequency data and list of associated markers; 2) phenotypes observed during each experiment within a study with both free-text descriptions and ontology-derived annotations, along with the phenotyping methods; 3) detailed marker information for those markers identified in the study, such as current genomic coordinates, genic relationships, revision history (eg, if it has been merged/deleted), and its specific sequence information. A summary of each study can be exported in XML and JSON file formats and the ‘top’ 100 association results for each experiment per study can be exported as an Excel spreadsheet, in text file (CSV and TSV), news feed (RSS and Atom) and semantic web compatible (RDF) formats.

Reflecting the goal of GWAS Central to allow multiple GWAS to be compared, an integrated browser tool allows up to 16 association data sets to be correlated and visually interrogated. In addition, association data can be anonymously uploaded to the resource for comparison alongside data sets contained in the database. The real-time upload feature requires a simple delimiter-separated value file consisting of two columns containing the dbSNP rs number and the associated P-value (the online documentation provides further details: http://www.gwascentral.org/info/using-the-browser/custom-upload/). All user data are automatically removed from our servers after 48 h. The browser allows interesting signals to be examined in detail by switching between the ‘genome’ view of all chromosomes and the ‘region’ view of increasing resolution down to individual nucleotide resolution (Figure 1). Association data can be viewed in the context of genomic features such as genes, HGMD variants, HapMap SNPs and linkage disequilibrium maps. The data in regions selected by the user can also be opened in the UCSC and Ensembl genome browsers or exported in standard BED or GFF file formats for further analysis. Furthermore, all displays and outputs allow P-value thresholds to be applied, and hence only those associations a researcher deems significant are presented.

Figure 1
figure 1

Data integration example using the GWAS Central graphical browser. (a) The ‘genome view’ of four distinct rheumatoid arthritis studies showing regions of association. The coloured bars denote counts of marker P-values below a selected and adjustable threshold. Selecting the peak on chromosome 6 adjacent to the red asterisk zooms in on that section of the genome to give b. (b) The higher resolution ‘region view’ of the data provides UCSC RefSeq genes, HapMap SNPs for the various cohorts and Human Genetic Mutation Database rare disease mutations.

GWAS Central also provides a BioMart-based system,8 named GWAS Mart, for advanced data interrogation and deeper data downloads. The standard ‘MartView’ interface (http://www.gwascentral.org/biomart/martview) allows flexible querying against the complete database and download of large data sets consisting of either a complete study or up to 1000 markers and associated data per download. Query results can be viewed and exported in HTML and text file (CSV and TSV) formats. Larger data downloads are made available to researchers who agree with GWAS Central’s data sharing policy (available at http://www.gwascentral.org/info/data/data-sharing-statement/). In addition, BioMart web services make GWAS Central data available for use in remote data analyses and bioinformatics workflows.

As of September 2013, 1605 studies and 67 723 637 P-values are available, corresponding to 2 935 163 unique dbSNP rs numbers. The myriad phenotype terms encompassed by these studies have been additionally grouped into 22 upper-level MeSH disease categories (Supplementary Figure 2), with common areas extensively covered (eg, neoplasms, nutritional and metabolic diseases, nervous system illness and immune system disorders).

Discussion

GWAS Central provides a valuable toolkit for the storage, mining and display of summary-level association data. This resource is substantially more comprehensive than other openly available projects with a similar focus (ie, tens of millions of P-values vs thousands; Table 1). To fully preclude participant re-identification from the open content of this resource, all association signals are presented in a non-directional manner (ie, risk alleles not stated). GWAS Central also provides a range of user tools and interfaces that were not previously available from a single resource. Table 1 compares GWAS Central with other related resources (GWAS Catalog, OADGAR and SNPedia9) with regard to the features provided.

Table 1 Comparison of features between GWAS Central and related resources

Towards the goal of bringing all GWAS data conveniently together, we have collated data from several sources that are not openly available elsewhere; for example, an imminent release will include data from the International Serious Adverse Event Consortium. Further, we have engineered and optimized the software to deal with all available summary-level data generated during a study, instead of limiting the content to only a small number of ‘top’ P-values.

Researchers have reported that their GWAS publications have received increased attention and citations as a result of inclusion in GWAS Central. We actively encourage researchers to submit their complete summary-level findings to GWAS Central, and to this end we provide an Excel template to help researchers organize and supply single-study data. To further lower the barrier to sharing large data sets or multiple studies, researchers can supply data in whichever format they have available and we will apply our automated conversion pipelines to ensure that the findings are correctly represented in the database (the online documentation provides further details on how to submit data: http://www.gwascentral.org/info/how-to/submit-data/). Researchers are also encouraged to ensure their findings are properly and optimally displayed in GWAS Central, and we will prioritize all requests to modify or update existing data with additional P-values (the online documentation provides further details: http://www.gwascentral.org/info/how-to/how-to-modify-data/).

For consortia and research groups that wish to play a more active role in displaying their GWAS findings, we make the source of the GWAS Central platform available as part of a collaboration. Thereby, institutes, consortia, teams and even whole countries, working together on a common area of interest, can serve their data to the research community on their own terms and thereby meet the expectations of funders. Indeed, a group based in India has recently adopted this system (GWAS Central India: http://www.vigeyegpms.in/gwascentralindia/). In time, these implementations may well be federated to allow searching across different data sets. Thus, the provision of local versions of GWAS Central will reduce the effort involved in creating and maintaining GWAS summary-level databases while making important information in the disease genetics field available to the global research community.