Bio2RDF Release 2: Improved Coverage, Interoperability and Provenance of Life Science Linked Data
Bio2RDF Release 2: Improved Coverage, Interoperability and Provenance of Life Science Linked Data
Bio2RDF Release 2: Improved Coverage, Interoperability and Provenance of Life Science Linked Data
Abstract. Bio2RDF currently provides the largest network of Linked Data for
the Life Sciences. Here, we describe a significant update to increase the overall
quality of RDFized datasets generated from open scripts powered by an API to
generate registry-validated IRIs, dataset provenance and metrics, SPARQL
endpoints, downloadable RDF and database files. We demonstrate federated
SPARQL queries within and across the Bio2RDF network, including semantic
integration using the Semanticscience Integrated Ontology (SIO). This work
forms a strong foundation for increased coverage and continuous integration of
data in the life sciences.
1 Introduction
With the advent of the World Wide Web, journals have increasingly augmented their
peer-reviewed journal publications with downloadable experimental data. While the
increase in data availability should be cause for celebration, the potential for biomedi-
cal discovery across all of these data is hampered by access restrictions, incompatible
formats, lack of semantic annotation and poor connectivity between datasets [1]. Al-
though organizations such as the National Center for Biotechnology Information
(NCBI) and the European Bioinformatics Institute (EBI) have made great strides to
extract, capture and integrate data, the lack of formal, machine-understandable seman-
tics results in ambiguity in the data and the relationships between them. With over
1500 biological databases, it becomes necessary to implement a more sophisticated
scheme to unify the representation of diverse biomedical data so that it becomes easi-
er to integrate and explore [2]. Importantly, there is a fundamental need to capture the
provenance of these data in a manner that will support experimental design and repro-
ducibility in scientific research. Providing data also presents real practical challenges,
including ensuring persistence, availability, scalability, and providing the right tools
to facilitate data exploration including query formulation.
*
These authors contributed equally to this work.
P. Cimiano et al. (Eds.): ESWC 2013, LNCS 7882, pp. 200–212, 2013.
© Springer-Verlag Berlin Heidelberg 2013
Bio2RDF Release 2: Improved Coverage, Interoperability and Provenance 201
1
http://www.w3.org/DesignIssues/Principles.html
2
https://sourceforge.net/apps/mediawiki/bio2rdf/
index.php?title=Banff_Manifesto
3
http://www.w3.org/blog/hcls/
202 A. Callahan et al.
to generate validated IRIs. We further generate provenance and statistics for each
dataset, and provide public SPARQL endpoints, downloadable database files and
RDF files. We demonstrate federated SPARQL queries within and across the
Bio2RDF network, including queries that make use of the Semanticscience Integrated
Ontology (SIO) 4, which provides a simple model with a rich set of relations to coor-
dinate ontologies, data and services.
2 Methods
In the following section we will discuss the procedures and improvements used to
generate Bio2RDF R2 compliant Linked Open Data including entity naming, dataset
provenance and statistics, ontology mapping, query and exploration.
For data with a source assigned identifier, entities are named as follows:
http://bio2rdf.org/namespace:identifier
where ‘namespace’ is the preferred short name of a biological dataset as found in our
dataset registry and the ‘identifier’ is the unique string used by the source provider to
identify any given record. For example, the HUGO Gene Nomenclature Committee
identifies the human prostaglandin E synthase gene (PIG12) with the accession num-
ber “9599”. This dataset is assigned the namespace “hgnc” in our dataset registry,
thus, the corresponding Bio2RDF IRI is
http://bio2rdf.org/hgnc:9599
For data lacking a source assigned identifier, entities are named as follows:
http://bio2rdf.org/namespace_resource:identifier
where ‘namespace’ is the preferred short name of a biological dataset as found in our
dataset registry and ‘identifier’ is uniquely created and assigned by the Bio2RDF
script. This pattern is often used to identify objects that arise from the conversion of
n-ary relations into an object with a set of binary relations. For example, the Compar-
ative Toxicogenomics Database (CTD) describes associations between diseases and
drugs, but does not specify identifiers for these associations, and hence we assign a
new stable identifier for each, such as
http://bio2rdf.org/ctd_resource:C112297D029597
for the chemical-disease association between 10,10-bis(4-pyridinylmethyl)-9(10H)-
anthracenone (mesh:C112297) and the Romano-Ward Syndrome (mesh:D029597).
Finally, dataset-specific types and relations are named as follows:
http://bio2rdf.org/namespace_vocabulary:identifier
4
http://code.google.com/p/semanticscience/wiki/SIO
Bio2RDF Release 2: Improved Coverage, Interoperability and Provenance 203
where ‘namespace’ is the preferred short name of a biological dataset as found in our
dataset registry and ‘identifier’ is uniquely created and/or assigned by the Bio2RDF
script. For example, the NCBI’s HomoloGene resource provides groups of homolog-
ous eukaryotic genes and includes references to the taxa from which the genes were
isolated. Hence, the Homologene group is identified as a class
http://bio2rdf.org/homologene_vocabulary:HomoloGene_Group
while the taxonomic relation is specified with:
http://bio2rdf.org/homologene_vocabulary:has_taxid
2.4 Provenance
Bio2RDF scripts now generate provenance using the Vocabulary of Interlinked Data-
sets (VoID), the Provenance vocabulary (PROV) and Dublin Core vocabulary. As
illustrated in Fig. 1, each item in a dataset is linked using void:inDataset to a prove-
nance object (typed as void:Dataset). The provenance object represents a Bio2RDF
dataset, in that it is a version of the source data whose attributes include a label, the
creation date, the creator (script URL), the publisher (Bio2RDF.org), the Bio2RDF
license and rights, the download location for the dataset and the SPARQL endpoint in
which the resource can be found. Importantly, we use the W3C PROV relation ‘was-
DerivedFrom’ to link this Bio2RDF dataset to the source dataset, along with its li-
censing and source location.
5
http://opensource.org/licenses/MIT
6
https://github.com/bio2rdf/bio2rdf-scripts
204 A. Callahan et al.
A set of nine dataset metriccs are computed for each dataset that summarize their ccon-
tents:
where namespace is the prreferred short name for the Bio2RDF dataset. While the
values for metrics 1-4 are provided
p via suitably named datatype properties, metrics 5-
9 require a more complex, typed
t object. For instance, a SPARQL query to retrievee all
type-predicate-type links an
nd their frequencies from the CTD endpoint is:
7
http://vocab.sindice.net/analytics#
8
https://github.com/bio2rdf/bio2rdf-mapping
206 A. Callahan et al.
hosts to reuse the same configuration documents in slightly different ways. For exam-
ple, the Bio2RDF Web Application R2 profile has been configured to resolve queries
that include the new ‘_resource’ and ‘_vocabulary’ namespaces (section 2.1), as well
existing query types used by the base Bio2RDF profile, and to resolve these queries
using the R2 SPARQL endpoints.
The Bio2RDF Web Application accepts RDF requests in the Accept Request and
does not use URL suffixes for Content Negotiation, as most Linked Data providers
do, as that would make it difficult to reliably distinguish identifiers across all of the
namespaces that are resolved by Bio2RDF. Specifically, there is no guarantee that a
namespace will not contain identifiers ending in the same suffix as a file format. For
example, if a namespace had the identifier “plants.html”, the Bio2RDF Web Applica-
tion would not be able to resolve the URI consistently to non-HTML formats using
Content Negotiation. For this reason, the Bio2RDF Web Application directive to re-
solve HTML is a prefixed path, which is easy for any scriptable User Agent to gener-
ate. In the example above the identifier could be resolved to an RDF/XML document
using “/rdfxml/namespace:plants.html’’, without any ambiguity as to the meaning of
the request, as the file format is stripped from the prefix by the web application, based
on the web application configuration.
3 Results
We also have 10 additional updated scripts that are currently generating updated data-
sets and SPARQL endpoints to be available with the next release: ChemBL, DBPedia,
GenBank, PathwayCommons, the RCSB Protein Databank, PubChem, PubMed, Ref-
Seq, UniProt (including UniRef and UniParc) and UniSTS.
Dataset SPARQL endpoints are available at http://[namespace].bio2rdf.org. For
example, the Saccharomyces Genome Database (SGD) SPARQL endpoint is availa-
ble at http://sgd.bio2rdf.org. All updated Bio2RDF Linked Data and their correspond-
ing Virtuoso DB files are available for download at http://download.bio2rdf.org.
Table 1. Bio2RDF Release 2 datasets with select dataset metrics. The asterisks indicate
datasets that are new to Bio2RDF.
Table 3. Partial results from a query to obtain drug-target interactions from the Bio2RDF
DrugBank SPARQL endpoint
Dataset metrics can also facilitate federated queries over multiple Bio2RDF end-
points in a similar manner. For example, the following query retrieves all biochemical
reactions from the Bio2RDF Biomodels endpoint that are kinds of “protein catabolic
process”, as defined by the Gene Ontology in the NCBO Bioportal endpoint:
The mappings between Bio2RDF dataset vocabularies and SIO make it possible to
formulate queries that can be applied across all Bio2RDF SPARQL endpoints, and
can be used to integrate data from multiple sources, as opposed to a priori formula-
tion of dataset specific queries against targeted endpoints. For instance, we can ask for
chemicals that effect the ‘Diabetes II mellitus’ pathway and that are available in tablet
form using the Comparative Toxicogenomics Database (CTD) and the National Drug
Codes (NDC) Bio2RDF datasets, and the mappings of their vocabularies to SIO:
210 A. Callahan et al.
4 Discussion
Bio2RDF Release 2 marks several important milestones for the open source Bio2RDF
project. First, the consolidation of scripts into a single GitHub repository will make it
easier for the community to report problems, contribute code fixes, or contribute new
scripts to add more data into the Bio2RDF network of linked data for the life sciences.
Already, we are working with members of the W3C Linking Open Drug Data
(LODD) to add their code to this GitHub repository, identify and select an open
source license, and improve the linking of Bio2RDF data. With new RDF generation
guidelines and example queries that demonstrate use of dataset metrics and prove-
nance, we believe that Bio2RDF has the potential to become a central meeting point
for developing the biomedical semantic web. Indeed, we welcome those that think
Bio2RDF could be useful to their projects to contact us on the mailing list and partici-
pate in improving this community resource.
A major aspect of what makes Bio2RDF successful from a Linked Data perspec-
tive is the use of a central registry of datasets in order to normalize generated IRIs.
Although we previously created a large aggregated namespace directory, the lack of
extensive curation meant that the directory contained significant overlap and omis-
sions. Importantly, no script specifically made use of this registry, and thus adhe-
rence to the namespaces was strictly in the hands of developers at the time of writing
the code. In consolidating the scripts, we found significant divergence in the use of a
preferred namespace for generating Bio2RDF IRIs, either because of the overlap in
9
http://semanticscience.org/resource/SIO_011126
Bio2RDF Release 2: Improved Coverage, Interoperability and Provenance 211
directory content, or in the community adopting another preferred prefix. With the
addition of an API to automatically generate the preferred Bio2RDF IRI from any
number of dataset prefixes (community-preferred synonyms can be recorded), all
Bio2RDF IRIs can be validated such that unknown dataset prefixes must be defined in
the registry. Importantly, our registry has been shared with maintainers of identifi-
ers.org in order for their contents to be incorporated into the MIRIAM registry [17]
which powers that URL resolving service. Once we have merged our resource list-
ings, we expect to make direct use of the MIRIAM registry to list new entries, and to
have identifiers.org list Bio2RDF as a resolver for most of its entries. Moreover, since
the MIRIAM registry describes regular expressions that specify the identifier pattern,
Bio2RDF scripts will be able to check whether an identifier is valid for a given na-
mespace, thereby improving the quality of data produced by Bio2RDF scripts.
The dataset metrics that we now compute for each Bio2RDF dataset have signifi-
cant value for users and providers. First, users can get fast and easy access to basic
dataset metrics (number of triples, etc.) as well as more sophisticated summaries such
as which types are in the dataset and how are they connected to one another. This data
graph summary is the basis for SparQLed, an open source tool to assist in query com-
position through context-sensitive autocomplete functionality. Use of these summa-
ries also reduces the server load for data provider servers, which in turns frees up
resources to more quickly respond to interesting domain-specific queries. Second, we
anticipate that these metrics may be useful in monitoring dataset flux. Bio2RDF now
plans to provide bi-annual release of data, and as such, we will develop infrastructure
to monitor change in order to understand which datasets are evolving, and how are
they changing. Thus, users will be better able to focus in on content changes and pro-
viders will be able to make informed decisions about the hardware and software re-
sources required to provision the data to Bio2RDF users.
Our demonstration of using SIO to map Bio2RDF dataset vocabularies helps facili-
tate the composition of queries for the basic kinds of data or their relationships. Since
SIO contains unified and rich axiomatic descriptions of its classes and properties, in
the future we intend to explore how these can be automatically reasoned about to
improve query answering with newly entailed facts as well as to check the consisten-
cy of Bio2RDF linked data itself.
References
1. Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., Hide, W., Hill, D.P., Kania,
R., Schaeffer, M., St. Pierre, S., et al.: Big data: The future of biocuration. Na-
ture 455(7209), 47–50 (2008)
2. Goble, C., Stevens, R.: State of the nation in data integration for bioinformatics. J. Bio-
med. Inform. 41(5), 687–693 (2008)
3. Cerami, E.G., Bader, G.D., Gross, B.E., Sander, C.: cPath: open source software for col-
lecting, storing, and querying biological pathways. BMC Bioinformatics 7, 497 (2006)
212 A. Callahan et al.
4. Chen, H., Yu, T., Chen, J.Y.: Semantic Web meets Integrative Biology: a survey. Brief
Bioinform. (2012)
5. Ruebenacker, O., Moraru, I.I., Schaff, J.C., Blinov, M.L.: Integrating BioPAX pathway
knowledge with SBML models. IET Syst. Biol. 3(5), 317–328 (2009)
6. Sansone, S.A., Rocca-Serra, P., Field, D., Maguire, E., Taylor, C., Hofmann, O., Fang, H.,
Neumann, S., Tong, W., Amaral-Zettler, L., et al.: Toward interoperable bioscience data.
Nat. Genet. 44(2), 121–126 (2012)
7. Berlanga, R., Jimenez-Ruiz, E., Nebot, V.: Exploring and linking biomedical resources
through multidimensional semantic spaces. BMC Bioinformatics 13(suppl. 1), S6 (2012)
8. Gennari, J.H., Neal, M.L., Galdzicki, M., Cook, D.L.: Multiple ontologies in action: com-
posite annotations for biosimulation models. J. Biomed. Inform. 44(1), 146–154 (2011)
9. Hoehndorf, R., Dumontier, M., Gennari, J.H., Wimalaratne, S., de Bono, B., Cook, D.L.,
Gkoutos, G.V.: Integrating systems biology models and biomedical ontologies. BMC Syst.
Biol. 5, 124 (2011)
10. Hoehndorf, R., Dumontier, M., Oellrich, A., Rebholz-Schuhmann, D., Schofield, P.N.,
Gkoutos, G.V.: Interoperability between biomedical ontologies through relation expansion,
upper-level ontologies and automatic reasoning. PLoS One 6(7), e22006 (2011)
11. Jonquet, C., Lependu, P., Falconer, S., Coulet, A., Noy, N.F., Musen, M.A., Shah, N.H.:
NCBO Resource Index: Ontology-Based Search and Mining of Biomedical Resources.
Web Semant. 9(3), 316–324 (2011)
12. Ruttenberg, A., Rees, J.A., Samwald, M., Marshall, M.S.: Life sciences on the Semantic
Web: the Neurocommons and beyond. Brief Bioinform. 10(2), 193–204 (2009)
13. Momtchev, V., Peychev, D., Primov, T., Georgiev, G.: Expanding the Pathway and Inte-
raction Knowledge in Linked Life Data. In: Semantic Web Challenge: 2009, Amsterdam
(2009)
14. Chen, B., Dong, X., Jiao, D., Wang, H., Zhu, Q., Ding, Y., Wild, D.J.: Chem2Bio2RDF: a
semantic framework for linking and data mining chemogenomic and systems chemical bi-
ology data. BMC Bioinformatics 11, 255 (2010)
15. Campinas, S., Perry, T.E., Ceccarelli, D., Delbru, R., Tummarello, G.: Introducing RDF
Graph Summary with Application to Assisted SPARQL Formulation, pp. 261–266 (2012)
16. Ansell, P.: Model and prototype for querying multiple linked scientific datasets. Future
Generation Computer Systems 27(3), 329–333 (2011)
17. Juty, N., Le Novere, N., Laibe, C.: Identifiers.org and MIRIAM Registry: community re-
sources to provide persistent identification. Nucleic Acids Res. 40(Database issue),
D580–D586 (2012)