Building a network of open linked data is hard when URIs and/or URI patterns can change over time and when multiple URIs refer to the same entity. To simplify linking, we suggest the development of a linked data hub in which a registry of namespaces is coupled with data providers that can respond to those requests. By additionally specifying the format in which information is made available, semantic web agents can mine the network of their preferred linked data and discover new knowledge. This proposal was put forward to the Shared Names Initiative
1. Identifiers are local to the datasets they are part of - we need to be able to redirect an identifier request wrt to the dataset. This requires a registry.
2. Source documents may be transformed into a number of different representations (HTML, XML, RDF, etc). We should support multiple representations.
3. Multiple data providers can provide any of the different representations. We should indicate which provider provides which representation.
4. Each data provider will likely create their own identifier that is a resolvable dataset identifier. We should indicate that these identifiers denote the same entity.
A public database of 1826 bioinformatic namespaces has been made available on Freebase. It is composed of (overlapping) namespace references in BioMoby (400), EMBL(30), Nucleic Acids Research (846), GO (179), IPI (18), LinkDB (33), NCBI (5), PATHGUIDE (213), SRS (277), LSRN (157) and Uniprot (100). Each entry can be annotated with a namespace, a URL to the data provider homepage, an example identifier, a URL pattern, among others. This registry can be easily queried using the FreeBase MQL query language vis-a-vis an URL. While the list may be edited by some members of the Bio2RDF group, Sharedname may also choose to maintain it.
Data providers will publish supported namespaces
Data providers will publish an RDF document on their site, detailing necessary contact information (in case of service complaints or failure), and information about each namespace that they are capable of resolving, including their rank (see below), supported formats and URI patterns. Providers will add to the registry by registering their RDF document through some RESTful API. Our service will cache the published data and regularly poll the providers to ensure that their service is functioning.
Data provider rank
Every agent which contributes with at least one document in a namespace is considered a data provider of the namespace's dataset. Furthermore, those who are able to resolve the entire namespace may fall into one of three specialized categories: primary (original data source), secondary (mirrors), tertiary (alternative data source). Primary data providers created the knowledge (in whatever format), and make it available. All other designations require that a primary data provider has been identified. Secondary data providers mirror the releases of the primary data providers. Tertiary data providers may act as secondary data providers with other information such as self-referential URIs, augmented reverse links and other linked data. Tertiary data providers may also add new information to the existing data model, and in some cases, they can even provide their own model of the information that may or may not be compatible with that of a primary data provider (if one exists).
Naming
On the RDF Semantic Web, everything is named with a URI reference, a literal or a blank node (having a unique, but local name). For our purposes, we would like to use URIs to name things, whether they are physical or abstract. A common syntax for naming would simplify building links between different kinds of data, but is not strictly necessary. All that is required is: 1) the namespace and 2) the identifier. The namespace should be specified in the directory. The identifier should remain opaque w.r.t. what it denotes (e.g. no inclusion of words like document, page, record, etc. in the identifier). There is also a need to provide information about a URI in different formats (e.g. HMTL, RDF, OWL, JSON, etc.). Thus, data providers should indicate what formats they support for the given namespaces. Data providers should be able to specify a global URI pattern for all supported formats, or specify individual URI patterns for each supported format.
SPARQL Endpoint The federation of rdf-based knowledge is possible via SPARQL endpoints. SPARQL endpoint should be a service for the registry, but should also be specified for each dataset. This enables dynamic queries over RDF-based knowledge bases.
Sharedname URIs
There will be four types of Sharedname URIs:
Redirection Services
The CoolURIs note advises that non-web documents identified with an HTTP URI (without the hash fragment) should dereference to a document about these entities. The note suggests using the HTTP 303 redirect status code to indicate that another URI, that of a document, will provide a description of the requested resource. Further, the representation may be dependent on content-negotiation: that one can ask for rdf/xml in preference to text/html. However, if we are seeking to develop a global registry, then we must accommodate the possibility of multiple data providers, each of which may be capable of providing a set of representations.
Each of the different types of Sharedname URIs resolves to a different service. Each service outputs RDF or its corresponding HTML, depending on content-negotiation. These services are:
Free and Open Registry
Sharedname should never charge for its services, but it may seek out donations to support it. Sharedname will make no attempt to control the list of data providers, but may choose instead to create a community ranking system for users to rank providers (e.g. wiki). This information can then be published along with the list of data providers for a requested name.
The suggested data model for Sharedname entities (e.g. http://sharedname.org/taxon:9606) is defined by the following ontology (available in OWL ). Data providers may define a Sharedname entity but the will not necessarily use the Sharedname URI (e.g. http://bio2rdf.org/taxonomy:9606). We suggest the use of owl:sameAs to assert that the provider's resource is equivalent to the Sharedname entity. Each resource must be linked with the documents that define it via the rdfs:isDefinedBy property, and each document is typed according to the syntactic representation used. A taxonomy of document types is provided by the ontology. For instance, the RDF for the bio2rdf example will look like this:
<http://sharedname.org/taxon:9606>
owl:sameAs <http://bio2rdf.org/taxonomy:9606> .
<http://bio2rdf.org/taxonomy:9606>
rdfs:isDefinedBy <http://sharedname.org/doc/bio2rdf/rdf/taxon:9606> .
<http://sharedname.org/doc/bio2rdf/rdf/taxon:9606>
rdf:type <http://sharedname.org/RdfXmlDocument>;
sharedname:mimetype "application/rdf+xml" .
Every document is part of ( dc:isPartOf) the Dataset which defines the namespace of the resource. The dc:isPartOf is defined as a transitive property, and it can also be used to define sub-datasets. Documents are associated with their providers via the dc:publisher property.
Datasets are a collection of documents about a particular domain. Real world entities (resources) are not part of a dataset, the documents that describe them are. Any publisher of a document in a dataset is automatically a provider of the dataset. However, more specialized relations can exist between the provider and the dataset (i.e. primary, secondary, tertiary). These define the level of commitment of the provider towards the dataset, and should be used by agents to pick the most appropriate document for an entity.
The following property chain rules can be used to derive the proper relationship between the document and the resource. This relationship follows from the providers rank towards the dataset.
isDefinedBy ° isPartOf ⇒ isDefinedBy
publisher ° is(Primary|Secondary|Tertiary)Provider ° inv(rdfs:isDefinedBy) ⇒ is(Primary|Secondary|Tertiary)DocumentFor
The Sharedname ontology should reuse some of the existing Semantic Web vocabularies. As an initial idea, we suggest the extension from the FOAF and SIOC ontologies.
sharedname:Document
rdfs:subClassOf foaf:Document
sharedname:Provider
rdfs:subClassOf foaf:Agent
sharedname:Dataset
rdfs:subClassOf sioc:Space
sharedname:Document
rdfs:subClassOf sioc:Container
It would be ideal to reuse the dataset concept coined by the voiD vocabulary , but the ontologies are likely incompatible due to the fact that their dataset is restricted to a unique provider.
A primary data provider is the original creator of a dataset. For instance, Cycopr, Inc. is the primary provider of the opencyc dataset. Documents that are published by a primary provider are related to the entity by the isPrimaryDocumentFor relation. The RDF graph bellow partially shows what one would get when requesting the http://sharedname.org/opencyc:Mx4rvVnZe5wpEbGdrcN5Y29ycA (the Internet) resource.
A secondary data provider publishes documents which are exact copies of the primary provider's documents, or assigns a new URI but “only” assert an owl:sameAs to the original resource. Let's assume that example.org downloads the complete DbPedia dataset and hosts it using URIs of the form http://example.org/[identifier]. The following RDF graph partially illustrates how the dbpedia:Berlin entity would look like in the presence of this mirror.
A ternary data provider acts as secondary data providers do, with other information such as self-referential URIs, augmented reverse links, and other linked data. For example, Bio2RDF is a ternary data provider for the Uniprot protein dataset, since it augments the information of the Uniprot resource, and also changes the representation (e.g. eliminates blank nodes). The following RDF graph is an example of an Uniprot resource.
Another example of a Ternary provider is DbPedia.org. DbPedia provides RDF linked data about Wikipedia articles, therefore DbPedia is ternary provider of the wikipedia namespace.
[1] Belleau, F., Nolin, M. A., Tourigny, N., Rigault, P. and Morissette, J. (2008) Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J Biomed Inform. 41, 706-16. link
Discussion
Hi Michel,
I think that due to similarity of project name and aims, and your use of the sharedname.org domain, visitors to this page - such as the ones I'll direct to it - might be confused about the relation of this proposal to the Shared Names proposal (http://sharedname.org/). Do you think you could add a mention of the Shared Names project, together with a bit of description of the differences, e.g. that Shared Names is not looking (yet) to coin URIs for entities other than documents?
Nice to see that you're thinking about this problem. The Shared Names discussion group http://groups.google.com/group/shared-names and wiki http://sharedname.org/ are at your disposal.
Thanks Jonathan