# 9.4: Principal Components

The theoretical details are quite involved, and relevant papers span hundreds of pages. In essence, there are two principal aspects to it:

1. Ontology-Based Data Access systems (static components):
• An ontology language, for representing the ontology
• A mapping language, for declaring the mappings between vocabulary in the ontology and the data
• The data

2. Query answering in Ontology-Based Data Access systems:

• Reasoning over the TBox
• Query rewriting
• Query unfolding
• Relational database technology

The mapping language and the items under 2 are new notions compared Blocks I and II and a regular database course, which will be described in the following subsections.

## The Ontology Language

For the language, we remain, roughly, within the Semantic Web setting, and take a closer look at the OWL 2 QL profile and similar languages in the “DL-lite” family that is at the basis of OWL 2 QL, and $$DL-Lite_{\mathcal{A}}$$ in particular [CGL+07]. Most significantly, the trade-off between expressive power and computational complexity of the reasoning services leans strongly towards the scalability of reasoning (including query answering) over large amounts of data; or: one can’t say a lot with either OWL 2 QL or $$DL-Lite_{\mathcal{A}}$$.

### Syntax of $$DL-Lite_{\mathcal{A}}$$

Syntax of $$DL-Lite_{\mathcal{A}}$$. As is common in DLs and OWL, we distinguish between (abstract) objects and (data) values. A class expression denotes a set of objects, a datatype denotes a set of values, an object property denotes a binary relationship between objects, and a data property denotes a binary relation between objects and values. We assume to have a set $$\{T_{1}, . . . , T_{n}\}$$ of pairwise disjoint and unbounded datatypes, each denoting a set $$val(T_{i})$$ of values (integers, strings, etc.), and $$\top_{d}$$ denotes the set of all values. Class expressions, $$C$$, and object property expressions, $$R$$, are formed according to the following syntax, where $$A$$ denotes a class, $$P$$ an object property, and $$U$$ a data property:

$$C \longrightarrow A | \exists R| \delta (U),$$ $$R \longrightarrow P | P^{−}.$$

Observe that $$\exists R$$ denotes an unqualified existential class expression, $$\delta (U)$$ denotes the domain of $$U$$, and $$P^{−}$$ denotes the inverse of $$P$$.

A $$DL-Lite_{\mathcal{A}}$$ ontology $$\mathcal{O} = \langle \mathcal{T , A}\rangle$$, consists of a TBox $$\mathcal{T}$$ and an ABox $$\mathcal{A}$$, where the TBox is constituted by a set of axioms of the form

$$C_{1}\sqsubseteq C_{2}$$, $$\rho (U)\sqsubseteq T_{i}$$ , $$R_{1}\sqsubseteq R_{2}$$, $$U_{1}\sqsubseteq U_{2}$$,

$$(\texttt{disj} C_{1} C_{2})$$, $$(\texttt{disj} R_{1} R_{2})$$ , $$(\texttt{disj} U_{1} U_{2})$$,

$$(\texttt{funct} R)$$, $$(\texttt{funct} U)$$.

The axioms in the first row denote inclusions, with $$\rho (U)$$ the range of $$U$$. The axioms in the second row denote disjointness; note that distinct datatypes are assumed to be disjoint. The axioms in the third row denote functionality (at most one) of an object property expression and of a data property expression, respectively. The ABox is constituted by a set of assertions of the form $$A(a), P(a, a^{'})$$, and $$U(a, \mathcal{l})$$, where $$a, a^{'}$$ are individuals denoting objects and $$\mathcal{l}$$ is a literal (denoting a value). To ensure that $$DL-Lite_{\mathcal{A}}$$ maintains the computationally wellbehaved computational properties of the $$DL-Lite$$ family [CGL+07], the form of the TBox has to be restricted (as we have seen for OWL 2 QL in Chapter 3). In particular, object and data properties occurring in functionality assertions cannot be specialized, i.e., the cannot appear in the right hand side of an inclusion axiom.

### Semantics of $$DL-Lite_{\mathcal{A}}$$

Semantics of $$DL-Lite_{\mathcal{A}}$$. The semantics of $$DL-Lite_{\mathcal{A}}$$ is based on first-order interpretations $$\mathcal{I} = (\Delta^{\mathcal{I}} , \cdot^{\mathcal{I}})$$, where $$\Delta^{\mathcal{I}}$$ is a nonempty interpretation domain, which is partitioned into a $$\Delta^{\mathcal{I}}_{O}$$ of objects, and a $$\Delta^{\mathcal{I}}_{V}$$ of values. The interpretation function $$\cdot^{\mathcal{I}}$$ maps each individual $$a$$ to $$a^{\mathcal{I}} \in \Delta^{\mathcal{I}}_{O}$$, each class $$A$$ to $$A^{\mathcal{I}}\subseteq\Delta^{\mathcal{I}}_{O}$$, each object property $$P$$ to $$P^{\mathcal{I}}\subseteq\Delta^{\mathcal{I}}_{O}\times\Delta^{\mathcal{I}}_{O}$$, and each data property $$U$$ to $$U^{\mathcal{I}}\subseteq\Delta^{\mathcal{I}}_{O}\times\Delta^{\mathcal{I}}_{V}$$ , whereas each literal $$\mathcal{l}$$ is interpreted as the value $$\mathcal{l}^{\mathcal{I}} = val(\mathcal{l})$$, each datatype $$T_{i}$$ as the set of values $$T^{\mathcal{I}}_{i} = val(T_{i})$$, and $$\top^{\mathcal{I}}_{d} = \Delta^{\mathcal{I}}_{V}$$ . The semantics of expressions:

$$(\exists R)^{\mathcal{I}} = \{o | \exists o^{'}. (o, o^{'})\in R^{\mathcal{I}}\}$$, $$(P ^{−})^{\mathcal{I}} = \{(o, o^{'} ) | (o^{'} , o)\in P^{\mathcal{I}} \}$$,

$$(\delta (U))^{\mathcal{I}} = \{o | \exists v. (o, v)\in U^{\mathcal{I}} \}$$, $$(\rho (U))^{\mathcal{I}} = \{v | \exists o. (o, v)\in U^{\mathcal{I}} \}$$.

Contrary to OWL, $$DL-Lite_{\mathcal{A}}$$ (and its implementation) adopts the unique name assumption, meaning that for every interpretation $$\mathcal{I}$$ and distinct individuals or values $$c_{1}, c_{2}$$, we have that $$c^{\mathcal{I}}_{1}\neq c^{\mathcal{I}}_{2}$$ (which is the norm in the database setting).

As for other DLs and OWL species, $$\mathcal{I}$$ satisfies $$\alpha_{1}\sqsubseteq\alpha_{2}$$ if $$\alpha^{\mathcal{I}}_{1}\subseteq\alpha^{\mathcal{I}}_{2}$$ , it satisfies $$(\texttt{disj} \alpha_{1}\alpha_{2})$$ if $$\alpha^{\mathcal{I}}_{1}\cap\alpha^{\mathcal{I}}_{2} = ∅$$, and it satisfies $$(\texttt{funct} S)$$ if $$S^{\mathcal{I}}$$ is a function (that is, if $$(o, z_{1})\in S^{\mathcal{I}}$$ and $$(o, z_{2})\in S^{\mathcal{I}}$$ , then $$z_{1} = z_{2}$$)). $$\mathcal{I}$$ satisfies $$A(a)$$ if $$a^{\mathcal{I}}\in A^{\mathcal{I}}$$ , it satisfies $$P(a, a^{'} )$$ if $$(a^{\mathcal{I}} , a^{' \mathcal{I}})\in P^{\mathcal{I}}$$ , and it satisfies $$U(a, \mathcal{l})$$ if $$(a^{\mathcal{I}} , val(\mathcal{l}))\in U^{\mathcal{I}}$$ .

## Mappings

Here, a few definitions and an example is included; the chapter literature contains further technical details and more examples of mappings.

Definition $$\PageIndex{1}$$

(Mapping assertion between a database and a TBox). A mapping assertion between a database $$\mathcal{D}$$ and a TBox $$\mathcal{T}$$ has the form $$\Phi\rightsquigarrow\Psi$$ where

• $$\Phi$$ is an arbitrary SQL query of arity $$n > 0$$ over $$\mathcal{D}$$;
• $$\Psi$$ is a conjunctive query over $$\mathcal{T}$$ of arity $$n^{'} > 0$$ without non-distinguished variables, possibly involving variable terms.

Definition $$\PageIndex{2}$$

(Mapping assertion in $$\mathcal{M}$$ in an OBDA system). A mapping assertion between a database $$\mathcal{D}$$ and a TBox $$\mathcal{T}$$ in $$\mathcal{M}$$ has the form $$\Phi (\overrightarrow{x} )\rightsquigarrow \Psi(\overrightarrow{t}, \overrightarrow{y})$$ where

• $$\Phi$$ is an arbitrary SQL query of arity $$n > 0$$ over $$\mathcal{D}$$;
• $$\Psi$$ is a conjunctive query over $$\mathcal{T}$$ of arity $$n^{'} > 0$$ without non-distinguished variables;
• $$\overrightarrow{x}, \overrightarrow{y}$$ are variables with $$\overrightarrow{y}\subseteq\overrightarrow{x}$$;
• $$\overrightarrow{t}$$ are variable terms of the form $$f(\overrightarrow{z})$$, with $$f\in\Lambda$$ and $$\overrightarrow{z}\subseteq\overrightarrow{x}$$.

Concerning the the semantics of mappings, intuitively: $$\mathcal{I}$$ satisfies $$\Phi\rightsquigarrow\Psi$$ with respect to $$\mathcal{D}$$ if all facts obtained by evaluating $$\Phi$$ over $$\mathcal{D}$$ and then propagating answers to $$\Psi$$, hold in $$\mathcal{I}$$.

Definition $$\PageIndex{3}$$

(Satisfaction of a mapping assertion with respect to a database). An interpretation $$\mathcal{I}$$ satisfies a mapping assertion $$\Phi (\overrightarrow{x})\Psi (\overrightarrow{t},\overrightarrow{y})$$ in $$\mathcal{M}$$ with respect to a database $$\mathcal{D}$$, if for each tuple of values $$\overrightarrow{v}\in Eval(\Phi,\mathcal{D})$$, and for each ground atom in $$\Psi [\,\overrightarrow{x}/ \overrightarrow{v}]\,$$ we have that:

• If the ground atom is $$A(s)$$, then $$s^{\mathcal{I}}\in A^{\mathcal{I}}$$ ;
• If the ground atom is $$P(s_{1}, s_{2})$$, then $$(s^{\mathcal{I}}_{1} , s^{\mathcal{I}}_{2} )\in P^{\mathcal{I}}$$

Note $$\PageIndex{1}$$

($$Eval(\Phi, \mathcal{D})$$ denotes the result of evaluating $$\Phi$$ over $$\mathcal{D}$$, $$\Psi [\, \overrightarrow{x} / \overrightarrow{v} ]\,$$ denotes $$\Psi$$ where each $$x_{i}$$ is substituted with $$v_{i}$$

An example is shown in Figure 8.4.1 with the OBDA plugin for Protégé. There is an ontology that happens to have a class PromiscuousBacterium, among other things, and a relational database (HGT) with several tables, such as organisme and flexcount. Now we have to link the two with a mapping, which means (i) constructing a database query such that it retrieves only the promiscuous bacteria, and (ii) solving the ‘impedance mismatch’ (recollect Chapter 7) with a functor so that the values returned by the database query become objects in the ontology, which is what getPromBact does. Informally, the functor can be considered as a URI building mechanism for individuals in the ontology taken from the database (theoretically, they are skolem functions).

Recall the outline of the ADOLENA ontology from the exercise and Figure 7.7.7. A query could be “retrieve the devices that ameliorate paraplegia”

q(x) :- Device(x), ameliorates(x,y), Paraplegia(y)

For this to work, we have to introduce three aspects. First, we need a computer-processable serialization of the query, a notion of what kind of queries we can pose, and a way how it will do the answering. The query language is SPARQL (see footnote 4). The kind of queries are (unions of) conjunctive queries. A conjunctive query (CQ) $$q$$ over an ontology $$\mathcal{O}$$ is an expression of the form $$q(\overrightarrow{x})\leftarrow\exists\overrightarrow{y} .conj(\overrightarrow{x}, \overrightarrow{y})$$, where $$q(\overrightarrow{x})$$ the head, $$conj(\overrightarrow{x},\overrightarrow{y})$$ the body, the variables in $$\overrightarrow{x}$$ are distinguished variables and $$\overrightarrow{y}$$ the non-distinguished variables, and where $$conj(\overrightarrow{x},\overrightarrow{y})$$ is a conjunction

Figure 8.4.1: An example of a mapping; see text for explanation. (Source: based on [Kee10c])

of atoms of the form $$D(z), S(z, z^{'} ), z = z^{'}$$ , where $$D$$ denotes a class or a datatype, $$S$$ an object property or data property in $$\mathcal{O}$$, and $$z, z^{'}$$ are individuals or literals in $$\mathcal{O}$$ or variables in $$\overrightarrow{x}$$ or $$\overrightarrow{y}$$. Given an interpretation $$\mathcal{I} = (\Delta^{\mathcal{I}} ,\cdot^{\mathcal{I}})$$, then $$q^{\mathcal{I}}$$ is the set of tuples of $$\Delta^{\mathcal{I}}$$ that, when assigned to the variables $$\overrightarrow{x}$$, make the formula $$\exists\overrightarrow{y} .conj(\overrightarrow{x},\overrightarrow{y})$$ true in $$\mathcal{I}$$. The set $$cert(q, \mathcal{O})$$—certain answers to $$q$$ over $$\mathcal{O}$$—is the set of tuples $$\overrightarrow{a}$$ of individuals or literals appearing in $$\mathcal{O}$$ such that $$\overrightarrow{a} ^{\mathcal{I}} \in q^{\mathcal{I}}$$ , for every model $$\mathcal{I}$$ of $$\mathcal{O}$$.

Regarding query answering, consider Figure 8.4.2, where $$q$$ is our query, $$\mathcal{T}$$ the TBox (ontology or formal conceptual data model), and $$\mathcal{A}$$ the ABox (our instances, practically stored in the relational database). Somehow, this combines to produce the answers, using all components. First, there is a “reformulation” (or: rewriting) step, which computes the perfect reformulation (rewriting), $$q_{pr}$$, of the original query $$q$$ using the inclusion assertions of $$\mathcal{T}$$ so that we have a union of conjunctive queries. That is, it uses the knowledge of the ontology to come up with the ‘real’ query. For instance, recollect Figure 7.7.7 about the ontology of abilities and disabilities, then a query “retrieve all Devices that assistWith UpperLimbMobility”, i.e.,

q(x) :- Device(x), assistsWith(x,y), UpperLimbMobility(y)

or, in SPARQL notation:

SELECT $device  WHERE {$device rdf:type :Device.

$device :assistsWith$y.

\$y rdf:type :UpperLimbMobility}

will traverse the hierarchy of devices until it finds those devices that have an object property declared with as range UpperLimbMobility. In this case, this is MotorizedWheelchair, hence, the ‘real’ query concerns only the retrieval of the motorized wheelchairs, not first retrieving all devices and then making the selection.

Second, the “unfolding” step computes a new query, $$q_{unf}$$ , by using the (split version of) the mappings that link the terms of the ontology to queries over the database.

Third, the “evaluation” step delegates the evaluation of $$q_{unf}$$ to the relational DBMS managing the data, which subsequently returns the answer.

Figure: 8.4.2: Intuition of the top-down approach to query answering.

## Examples

Early trials and tribulations of trying to set up an OBDA system (and of which I know the details) are described in [KAGC08] and [CKN+10], which provided useful learning moments [CCKE+17]. The ontology in [KAGC08] practically was developed separately from the existing database, which caused some mismatches; or: the content in the ontology and database really should be compatible (as an aside: the query answering examples in the previous section were based on that system). This was much better aligned in the experiment described in [CKN+10]: the ontology was a reverse engineered physical (SQL) schema and dressed up with neater modeling, having following the steps described in Section 7.1: Relational Databases and Related ‘legacy’ KR. A downside with the knowledge of this particular scenario was that there was a considerable hierarchy, so the system ended up creating very big conjunctive queries, which slowed down the database side in actually answering the SQL query (albeit still faster than the original database). This has been optimized in the meantime, among other things [CCKE+17].

More recent examples from industry and with the ‘next generation’ OBDA, i.e., the extended Ontop setting, can be found in, e.g., [KHS+17], describing industry’s motivations, mappings, a rather large SQL query, and streaming (i.e., temporal) data, and there are several references in [CCKE+17] that describe applications and usage of Ontop. It is, perhaps, also worth mentioning that OBDA has seen many people involved since the early days in 2005, and by now has been shown to be robust, albeit not a trivial matter to understand and set up from a system’s development viewpoint. You can practice with this in the Chapter’s exercises, as all the code and software is freely available online.