Besides a top-down approach, another option to developing an ontology without starting with a blank slate, is to reuse exiting data, information, or knowledge. A motivation to consider this are the results obtained by Simperl et al [SMB10]: they surveyed 148 ontology development projects, which showed that “domain analysis was shown to have the highest impact on the total effort” of ontology development, “tool support for this activity was very poor”, and the “participants shared the view that process guidelines tailored for [specialised domains or in projects relying on end-user contributions] are essential for the success of ontology engineering projects”. In other words: the knowledge acquisition bottleneck is still an issue. Methods and tools have been, and are being, developed to make it less hard to get the subject domain knowledge out of the experts and into the ontology, e.g., through natural language interfaces and diagrams, and to make it less taxing on the domain experts by reusing the ‘legacy’ material they already may have to manage their information and knowledge. It is the latter we are going to look at in this chapter: bottom-up ontology development to get the subject domain knowledge represented in the ontology. We approach it from the other end of the spectrum compared to what we have seen in Chapter 6, being starting from more or less reusable non-ontological sources and try to develop an ontology from that.
Techniques to carry out bottom-up ontology development range from manual to (almost) fully automated. They differ according to their focus:
- Ontology learning to populate the TBox, where the strategies can be subdivided into:
– transforming information or knowledge represented in one logic language into an OWL species;
– transforming somewhat structured information into an OWL species;
– starting at the base.
- Ontology learning to populate the ABox.
The latter is carried out typically by either natural language processing (NLP) or one or more data mining or machine learning techniques. In the remainder of this chapter, however, we shall focus primarily on populating the TBox. Practically, this means taking some ‘legacy’ material (i.e., not-Semantic Web and, mostly, not-ontology) and convert it into an OWL file with some manual pre- and/or postprocessing. Input artefacts may be, but are not limited to:
- Conceptual data models (ER, UML)
- Frame-based systems
- OBO format ontologies
- Biological models
- Excel sheets
- Tagging, folksonomies
- Output of text mining, machine learning, clustering
It is not equally easy (or difficult) to transform them into a domain ontology. Figure 7.1.1 gives an idea as to how far one has to ‘travel’ from the legacy representation to a ‘Semantic Web compliant’ one. The further the starting point is to the left of the figure, the more effort one has to put into realising the ontology learning such that the result is actually usable without the need of a full redesign. Given that this is an introductory textbook, not all variants will pass the revue. We shall focus on using a databases as source material to develop an ontology (Section 7.1: Relational Databases and Related ‘legacy’ KR), spreadsheets (Section 7.2: From Spreadsheets to OWL) , thesauri (Section 7.3: Thesauri), and a little bit NLP (Section 7.4: Text Processing to Extract Content for Ontologies). Lastly, we will introduce ontology design patterns in Section 7.6: Ontology Design Patterns, which are a bit in the middle of bottom-up and top-down.
Figure 7.1.1: Various types of less and more comprehensively formalised ‘legacy’ resource.