In this module, David Wild describes how semantic technologies are used to integrate data from many different domains and datasets, enabling searching that goes across traditional silos and which crosses cheminformatics, bioinformatics, and medicine. It is thus an enabler of translational medicine. Some very recent advanced prediction and network data mining tools are also described,



Resources:

Background - semantic technologies


XML (extensible markup language)


XML is a technology specification that enables you to create highly structured documents. The ML in XML stands for Markup Language. A markup language is any language that takes raw text and adds annotations. What is special about XML? It focuses on document semantics. This means that you can identify specific document parts and assign them specific meaning.
For example, if you are representing biological sequence data,you can clearly identify which portion of the document contains sequence identifiers and cross references to public databases, and which portion contains raw sequence data. These sections are clearly marked and organized in a hierarchical document structure. A human reader or a computer program can therefore easily traverse a complete document and extract individual pieces of data.

Here is an example of biological sequence data recorded in HTML and XML.

<html>
<body>
<h1>NM-171533</h1>
Organism: <b>Caenorhabditis elegans</b>
<p>
agcacatgacatgagcagtgccccaaatgatgactgtgagatcgacaaggg
aacaccttctaccgcttcactttttacaacgctgatgctcagtcaaccatcttcttct
acagctgttttacagtgtacatattgtggaagctcgtgcacatcttcccaattgca
aacatgtttattctg</span></span>
<p>
[Full sequence has been omitted for brevity.]
<body>
<html>
                                 HTML Code
 
 
<xmlversion="1.0" encoding="UTF-8" standalone="yes"?>
<Sequence>
<accession>NM-171533</accession>
<organism>Caenorhabditis elegans</organism>
<sequence_data>
agcacatgacatgagcagtgccccaaatgatgactgtgagatcgacaagggaacaccttctaccgcttcactttttacaacgctgatgctcagtcaaccatcttcttct
acagctgttttacagtgtacatattgtggaagctcgtgcacatcttcccaattgcaaacatgtttattc
[Full sequence has been omitted for brevity.]
<sequence_data>
</Sequence>
                                XML Code
So what's wrong with HTML?
  • One tag sets for all applications
  • † Predefined semantics for each tag
  • † Predefined data structures
  • † No formal validation
  • HTML is well suited to simple applications, but poorly suited to more demanding applications such as Large or complex collections of data,data intended to drive scripts or Java applets and etc.
What Does XML Provide ?
  • Extensibility: Users can define new tags as needed.
  • Structure: Hierarchical data can be modeled at any level of complexity
  • Validation:Data can be checked for structural correctness (DTD, XSD)
  • Media Independence:Same content can be published in multiple media

1.<?xml
Shows the beginning of xml document
?>
Shows the end of the declaration
2. version=“1.0”
Shows xml version information, which states that this xml document follow W3C XML1.0 Standard.
3. encoding=“UTF-8”
Allows to use different encodings, such as UTF-8, UTF-16, GB2312
By default: UTF-8
4.standalone=“yes”
DTD is included in the xml document “no” means external DTD will be referenced here.Default: no

Difference between RDBMS and XML format for storing data.


RDBMS
XML
Structure
Tables
Hierarchical Tree ,graph
Schema
Fix schema in advance
Flexible "self describing"
Queries
SQL (Simple..)
Xpath, Xquery
Ordering
None
has inherent ordering

Lets see an Example : How to model in XML
Students Table



Majors Table


Grades Table


id
name
age

id
major

id
course
grade
111
Michael R.
21

111
Biology

111
Math 101
B
112
John D.
20

112
Physics

111
Biology 101
B+




112
Computer Science

111
Statistics 101
A







112
Physics 101
A







112
Math 101
A







112
Programming 101
B+
The above tables can be represented as given below

<Students>
<student id="111">
<name>Michael R.</name>
<age>21</age>
<major>Biology</major>
<results>
<result course="Math 101" grade="B"/>
<result course="Biology 101" grade="B+"/>
<result course="Statistics 101" grade="A"/>
</results>
</student>
<student id="112">
<name>John D.</name>
<age>20</age>
<major>Physics</major>
<major>Computer Science</major>
<results>
<result course="Math 101" grade="A"/>
<result course="Physics 101" grade="A"/>
<result course="Programming 101" grade="B+"/>
</student>
</Students>

RDF, Semantic Web, and Semantic Databases


RDF (Resource Description Framework) is a model for representing data, and more specifically, meaning, on the web. It is different to XML in that XML is a markup language for adding data tags into unstructured data; RDF is a model for expressing data, entities and their relationships. It is possible to express RDF in XML format (RDF-XML) although many other expressions are possible. For a discussion about the differences between XML and RDF see here. At the first approximation, XML is for data, RDF is for meaning.

R(Resource) :
  • Can be everything
  • Must be uniquely identified and be referencable
  • Simple by URI (Uniform Resource identifier)
D(Description)
  • Description of resources
  • Representing properties and relationships among the resources
  • Relationships can be represented as graphs
F(Framework)
  • A combination of web based protocols (URI, HTTP, XML...)
  • Based on formal models (semantics)
  • Defines all allowed relationships among resources.

At the heart of RDF is the assertion, a subject-predicate object relationship. For example, some facts about Bloomington could be expressed very simply in RDF text format as:
Bloomington is_a City
Bloomington has_population 81381
 
We can then have assertions that relate to these, for instance
David_Wild lives_in Bloomington
Each of these assertions is called an RDF Triple (subject-predicate-object). But note that there is a lot of ambiguity here: the data is still too unstructured to be useful. How do we define a city (versus town, etc?). What if we really meant Bloomington, IL, not Bloomington, Indiana in some of these? What if there are two David Wilds?

This relates strongly to our problem in relational databases of needing a primary key to uniquely identify each entity/tuple. In the web world, we also need to uniquely identify each tuple, and we do that using Uniform Resource Identifiers (URIs). URIs usually look a lot like URLs, but they need not map to an actual web address. Let's say I own the domain uri123.com. I could then use URI's to UNIQUELY identify each of the entities, then use these in the RDF. The URL may "point" to a description of the resource. So we may define the following
http://uri123.com/popcenters/City
http://uri123.com/popcenters/Town
http://uri123.com/states/Indiana
http://uri123.com/popcenters/has_population
http://uri123.com/popcenters/lives_in
http://uri123.com/cities/Indiana/Bloomington
http://uri123.com/people/David_Wild
So we can then make an unambiguous statement (within a particular namespace) such as:
http://uri123.com/people/David_wild   http://uri123.com/popcenters/lives_in   http://uri123.com/cities/Indiana/Bloomington
To make this less verbose, we can specify a default RDF name space. For more on this, follow through the RDF-Primer example.
Let's take an example
David Wild has an email djwild@indiana.edu and writes blog at http://allhazards.blogspot.com. Let's see how does it look as a graph at RDF Validator
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:pers="http://cheminfov.informatics.indiana/Personal#">
<rdf:Description rdf:about="http://cheminfov.informatics.indiana/DavidWild">
<pers:hasEmail rdf:resource="djwild@indiana.edu" />
</rdf:Description>
<rdf:Description rdf:about="http://cheminfov.informatics.indiana/DavidWild">
<pers:writesBlog rdf:resource="http://allhazards.blogspot.com"/>
</rdf:Description>
</rdf:RDF>
 
or
 
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:pers="http://cheminfov.informatics.indiana/Personal#">
<rdf:Description rdf:about="http://cheminfov.informatics.indiana/DavidWild">
<pers:hasEmail rdf:resource="djwild@indiana.edu" />
<pers:writesBlog rdf:resource="http://allhazards.blogspot.com/"/>
</rdf:Description>
</rdf:RDF>
 

Making RDF Useful


The "Semantic Web" is really, from a technical perspective, RDF, plus three other technologies that make RDF really useful:

Triple Stores, for storing databases of RDF (equivalent of an RDBMS)
SPARQL, for searching RDF (equivalent of SQL)
Ontologies (in OWL), for describing and mapping RDF data

How Semantic Databases relate to Relational Databases


There are two main differences between relational databases and triple stores

First, semantic databases separate the data from the structure of the data, whilst relational databases tightly couple the data with the structure of the data. This makes it easy to add new cross-silo structure in semantics, and also to develop tools and algorithms that are not tied to the structure of a particular silo - so for instance you can merge together several datasets, and map dataset-level descriptions easily to higher level ontologies (e.g. "this is an Amazon book; this is a Google Book; they are both books"). We can then issue intuitive queries in one statement that are not dataset dependent - e.g. "find me all of the books written by J.K. Rowling"

Second, a semantic database is a network database, meaning that all the RDF triples in aggregate, form a (usually huge) network, or graph of nodes (subjects and objects) and edges (predicates). This is a hugely important property, as it enables us to do all kinds of interesting searching and prediction on the graph (for example, shortest path, subgraph isomorphism, and so on).

See article: Will triple stores replace relational databases?

Searching with SPARQL

Lets Look at the power of SPARQL.
I want to find the actors of the movies in which two popular male actors Arnold Schwarzenegger and Sylsvester Stallone worked together.
Try googling it? Not so easy. Now lets see what we have in semantic web .

Now lets go to SPARQL query editor and do the search with this query given below
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbpedia2: <http://dbpedia.org/property/>
PREFIX dbpedia: <http://dbpedia.org/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
 
SELECT * WHERE {
?film dbpedia2:starring :Arnold_Schwarzenegger.
?film dbpedia2:starring :Sylvester_Stallone.
?film dbpedia2:starring ?actors.
}
 
ORDER by ?film
Now what you get?

Welcome to Semantic web

We will work through a couple of DBPedia examples taken from the W3C site using the DBPedia SPARQL End Point

Find 50 example concepts in DBPedia
SELECT DISTINCT ?concept
WHERE {
    ?s a ?concept .
} LIMIT 50
Find all landlocked countries with a population greater than 15 million
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX type: <http://dbpedia.org/class/yago/>
PREFIX prop: <http://dbpedia.org/property/>
SELECT ?country_name ?population
WHERE {
    ?country a type:LandlockedCountries ;
             rdfs:label ?country_name ;
             prop:populationEstimate ?population .
    FILTER (?population > 15000000) .
}
Let's try some other examples using IU's own Chem2Bio2RDF SPARQL Endpoint. Note the use of the OWL ontology.

What are the side effects of the diabetes drug Troglitazone?
PREFIX c2b2r: <http://chem2bio2rdf.org/chem2bio2rdf.owl#>
PREFIX bp: <http://www.biopax.org/release/biopax-level3.owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
 
SELECT *
FROM <http://chem2bio2rdf.org/owl#>
WHERE
{
?chemical rdfs:label "Troglitazone"^^xsd:string ;
          c2b2r:causeSideEffect [bp:name ?side_effect] .
}
What are the diseases that can be treated by Troglitazone?
PREFIX c2b2r: <http://chem2bio2rdf.org/chem2bio2rdf.owl#>
PREFIX bp: <http://www.biopax.org/release/biopax-level3.owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
 
SELECT *
FROM <http://chem2bio2rdf.org/owl#>
WHERE
{
?chemical rdfs:label "Troglitazone"^^xsd:string ;
          c2b2r:treatDisease [bp:name ?disease] .
}
What drugs are interact with troglitazone? what are their effects?
PREFIX c2b2r: <http://chem2bio2rdf.org/chem2bio2rdf.owl#>
PREFIX bp: <http://www.biopax.org/release/biopax-level3.owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
 
SELECT *
FROM <http://chem2bio2rdf.org/owl#>
WHERE
{
?chemical rdfs:label "Troglitazone"^^xsd:string ;
          c2b2r:hasDrugDrugInteraction [c2b2r:hasPart [bp:name ?name];
                                        c2b2r:description ?description] .
FILTER (str(?name)!="Troglitazone") .
}

Here is Tim Berner's Lee about Next Generation of Web