The starting position for leveraging the knowledge encoded in a relational database to develop an ontology could be its conceptual data model. However, despite academics’ best efforts to teach good design and maintenance methodologies in a degree program, it is not uncommon in organizations that if there was a conceptual model for the database at all, it is outdated by the time you would want to use it for ontology development. New columns and tables may have been added in the database, constraints removed, tables joined (further denormalized) for better performance or vice versa for cleaner data, and so on, and no-one may have bothered to go back to the original conceptual, or even relational, model and update it with the changes made. Practically, there likely will be a database with multiple tables that have many (15-50) columns. This is represented at the bottom of Figure 7.1.2.
Figure 7.1.2: Denormalized relational database (bottom), where each table is reverse engineered into an entity in a ‘flat’ EER diagram (middle), and subsequently refined with respect to the hidden entity types and annotations, such as the Environment ontology (top), which then finally can be transformed/translated into an ontology.
If one were to simply convert that SQL schema into an OWL ontology, the outcome would be a bunch of classes with many data properties and a unnamed object property between a subset of the classes based on the foreign key constraints. This won’t do as an ontology. Let us have a look at the additional steps.
Reverse Engineering the Database
There are several reverse engineering tools for SQL schemas of relational databases, where a first pass results in one of the possible logical models (i.e., the relational model for an RDBMSs), and another iteration brings one up to the conceptual data model (such as ER, ORM) [HCTJ93]. Such a first draft version of the EER model is depicted in EER bubble notation in Figure 7.1.2, where each table (relation) has become an entity type and each column an attribute. The main problematic consequence for reverse engineering the conceptual data model to feed into an OWL ontology is that the database structure has been ‘flattened out’, which, if simply reverse engineered, ends up in the ‘ontology’ as a class with umpteen attributes with which one can do minimal (if at all) automated reasoning (see the first diagram above the table in Figure 7.1.2).
To avoid this, should one perform some normalization steps to try to get some structure back into the conceptual view of the data alike in the diagram at the top in Figure 7.1.2, and if so, how? Whether done manually or automatically, it can be cleaned up, and original entity types (re-)introduced, relationships added, and the attributes separated accordingly, thereby making some knowledge implicit in the database schema explicit, which is depicted in the top-half of Figure 7.1.2. A tried and tested strategy to do this semi-automatically is by discovering functional dependencies in the data stored in the database tables. Such reverse engineering opens up other opportunities, for one could use such a procedure to also establish some mechanism to keep a ‘link’ between the terms in the ontology and the source in the database so that the ontology can be used to enhance data analysis through conceptual model or ontology-based querying. A particular algorithm up to obtaining a DL-formalized conceptual data model based on a fully normalized database can be found in, e.g., [LT09]. Most of the reverse engineering achievements up to conceptual models were obtained in the 1990s.
Figure 7.1.2 may give the impression that it is easy to do, but it is not. Difficulties have to do with the formal apparatus of the representation languages1, and the static linking between the layers and the procedures—conveniently depicted with the three arrows—hide the real complexity of the algorithms. Reverse engineering is not simply running the forward algorithm backwards, but has a heuristics component to second-guess what the developers’ design decisions may have been along the stages toward implementation and may have a machine learning algorithm to find constraints among instances. Most solutions to date set aside data duplication, violations of integrity constraints, hacks, outdated imports from other databases and assume to have a well-designed relational database in at least 3NF or BCNF, and, thus, the results are imperfect.
In addition to this procedure, one has to analyze the data stored in the database on its exact meaning. In particular, one may come across data in the database that are actually assumed to be concepts/universals/classes, whereas others represent real instances (typically, a tuple represents an instance). For instance, a Content Management System, such as Joomla, requires the content provider to store a document under a certain category that is considered a class by its user, which, however, is stored in a cell of a row in the back-end database, hence, mathematically an instance in the software. Somehow, we need to find that and extract it for use in the ontology in a way that they will become classes. Another typical case is where a structured controlled vocabulary, such as the Gene Ontology we have seen in Section 1.3: What is the Usefulness of an Ontology?, has been used in the database for annotation. This is depicted on the right-hand side with
Env:444 and so on. Knowing this, one can reverse engineer that section of the database into a taxonomy in the conceptual data model (shown in Figure 7.1.2 in the top figure on the right-hand side). Finally, there is a so-called ‘impedance mismatch’ between database values and ABox objects, but this is relevant mainly for ontology-based data access (see Chapter 8). Thus, we end up having to process the case that some, or all, data where the values are actually concepts, should become OWL classes and values that should become OWL individuals.
Enhancing and Converting the Conceptual Model
Having completed all the reverse engineering and data analysis to obtain the conceptual data model, one can commence with the ontological analysis. For instance, whilst improving the conceptual data model, one could add a section of another ontology for use or interoperability, alike the GO, improve on the naming and meaning of the relationships as perhaps some of them have the same meaning as those in a foundational ontology, add constraints (notably: disjointness), and so forth. Subsequently, it will have to be converted to a suitable logic.
There are several tools that convert a conceptual model, especially UML Class Diagrams, into OWL, but they have only partial coverage and its algorithms are unclear; for instance, on how one should transform ternaries and what to do with the attributes (recall Section 6.1: Foundational Ontologies "Typical Content of a Foundational Ontology"). In addition, they work only with a subset of UML diagrams due to the differences in UML tool implementations (which is due to ambiguity emanating from the OMG standard and differences across versions); hence, a careful post-transformation analysis will have to be carried out.
One also could switch these steps by first converting a schema to OWL and then perform the ontological analysis.
Other Languages and OWL
Imperfect transformations from other languages, such as the common OBO format [GH07, HOD+10] and a pure frames-based approach [ZBG06], are available, which also describe the challenges to create them.
OBO is a Directed Acyclic Graph mainly for classes and a few relationships (mainly
is a and
part of), which relatively easily can be mapped into OWL, and the extras (a.o., date, saved by, remark) could go in OWL’s annotations. There are a few mismatches and ‘work-arounds’, such as the
inverse-necessary, and a non-mappable antisymmetry (cannot be represented in OWL). As a result, there are several OBO-in-OWL mappings, of which some are more comprehensive than others. The latest/official mapping available from oboformat.org (superseding the earlier mapping by [GH07]), which is also implemented in the OWL API. Most OBO ontologies now also have an OWL version (consult OBO Foundry, BioPortal), but keep both, for each has their advantages (at present). There is one salient difference between OWL and OBO ontologies—more precisely: the approach to modelling—which also affects multilingual ontologies (Section 9.1: Toward Multilingual Ontologies), and how an OBO ontology in Protégé is displayed. In OWL, you typically give a class a human readable name, whereas in OBO, a class is assigned an identifier and the name is associated to that with a label (OBO people who moved to OWL maintain that practice, so numbers as class names do not imply it was natively an OBO ontology). Newer versions of ontology editors lets the user choose how to render the ontology in the interface, by name or by label. If you find an ontology online and the class names are something alike
IAO12345, then it was likely an OBO ontology converted into OWL, and you’ll have to change the view so that it will show the labels instead of those meaningless numbers.
While OBO and the older frames-based Protégé do serve a user base, their overall impact on widespread bottom-up ontology development for the Semantic Web is likely to be less than the potential that might possibly be unlocked with leveraging knowledge of existing (relational) databases to start developing ontologies.
1For conceptual data modeling languages, among others, the Object Management Group’s Ontology definition metamodel (http://www.omg.org/spec/ODM/1.0/) is exploring interactions between UML and OWL & RDF, and there are various results on mapping ER, EER, UML, ORM and/or ORM2 into a suitable or convenient DL language. This ‘application of Ontology and ontologies’ areas are deemed outside the book’s current scope.