This hack was originally available as an article on my personal website (the following link is no longer available: http://tyrelle.net/2004/03/gbrdf, instead use the PURL: purl.oclc.org/NET/gbrdf). I have added it to the wiki in the hope that it will updated and expanded. –Greg
The so-called “data explosion” in biology has not lead to an equal “metadata explosion”. Little RDF/XML metadata describing biological file formats is currently available, however many databases provide XML data output. RDF/XML formatted biological metadata will allow semantically enabled software agents to reason about biologically relevant resources on the web. For example an agent could automatically determine a file's format and the parser necessary to read the file based on RDF/XML metadata associated with the resource.
Metadata is “data about data”, or a description of a data set. The distinction between metadata and data is often arbitrary. In the context of a Genbank sequence file, the raw sequence data can be considered the “data” and the sequence annotations in the file are sequence metadata. However when considering metadata about a Genbank file there are really two distinct resources being described. The first is the biological sequence and second is the representation of that sequence in a Genbank formated text file. Discovering metadata about biological resources, such as Genbank files, will be important for the evolution of the semantic web for the life sciences.
This area of metadata is about providing information or descriptions of datasets so that agents or users do not have to download and investigate the data itself, this is not a new problem. However with many biological databases now providing XML output, combined with a XSL stylesheet providing XML/RDF metadata about files has become an easier problem to solve.
The NCBI Genbank database provides XML output which can potentially be transformed into RDF/XML metadata. This report describes a proof of principle extensible stylesheet transformation (XSL) of the Genbank XML file format to produce RDF/XML metadata about that file. To achieve this, Genbank data was mapped to the Dublin Core vocabulary and small Genbank specific RDF vocabulary was used value which did not map appropriately.
An example of the Genbank to RDF/XML transform is shown bellow. A small section of the Genbank XML format from the file SERPINC1 is used as the input file:
<?xml version="1.0"?> <!DOCTYPE GBSet PUBLIC "-//NCBI//NCBI GBSeq/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.dtd"> <GBSet> <GBSeq> <GBSeq_create-date>18-JAN-1993</GBSeq_create-date> <GBSeq_definition>H.sapiens gene for antithrombin III</GBSeq_definition> <GBSeq_primary-accession>X68793</GBSeq_primary-accession> <GBSeq_accession-version>X68793.1</GBSeq_accession-version> <GBSeq_other-seqids> <GBSeqid>emb|X68793.1|HSAT3</GBSeqid> <GBSeqid>gi|28906</GBSeqid> </GBSeq_other-seqids> ... </GBSeq> </GBSet>
The file is transformed to RDF/XML metadata:
<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:gb="http://www.ncbi.nlm.nih.gov/"> <rdf:Description rdf:about=""> <dc:format rdf:resource="http://formats.bioid.info/genbank"/> <dc:creator>NCBI - GenBank Database</dc:creator> <dc:publisher>National Center for Biotechnology Information</dc:publisher> <dc:description>H.sapiens gene for antithrombin III</dc:description> <rdfs:seeAlso>emb|X68793.1|HSAT3</rdfs:seeAlso> <rdfs:seeAlso>gi|28906</rdfs:seeAlso> <dc:hasVersion>X68793.1</dc:hasVersion> <dc:created>18-JAN-1993</dc:created> <dc:source>Homo sapiens (human)</dc:source> <gb:accession>X68793</gb:accession> </rdf:Description> </rdf:RDF>
The main intent of this transformation was to show that metadata can easily be extracted from existing file formats. This approach can be used to create local stores of metadata about Genbank files. While the mapping was relatively straight forward a few issues were encountered.
The RDF specification requires that the rdf:about attribute be a URI reference. Currently there is no widely accepted specification for minting new URIs for the life sciences. The life sciences identifier proposal (LSID) is one solution to this problem, but is not in keeping with the current web architecture draft.
The Dublin core metadata terms were used as the basis for mapping the Genbank data. This mapping needs to be made normative, for example for dc:creator what is the “entity primarily responsible for making the content of the resource.” in the case of a Genbank file ?
Only a small portion of the Genbank file was dumped as RDF/XML. It would be relatively ease to map the entire contents of the file. Would this be desirable ?
Some questions: is the actual sequence data metadata ? Is there a suitable RDF vocabulary for citation data ?
In the case of dc:publisher should the object of the predicate be a literal or a resource. If the answer is resource, does the URI “http://www.ncbi.nlm.nih.gov/” represent the NCBI or the NCBI's homepage ?
The stylesheet genbank2rdf.xsl is available for download.
The following Genbank file can be used as a source file x68793-gb.xml.