Design, Optimization, and Evaluation of a Path Query System

(1)

Design, Optimization, and Evaluation of a

(2)

(3)

Design, Optimization, and Evaluation of a

Path Query System

PROEFSCHRIFT

ter verkrijging van de graad van doctor

aan de Technische Universiteit Delft,

op gezag van de Rector Magnificus prof. dr. ir. J.T. Fokkema,

voorzitter van het College voor Promoties,

in het openbaar te verdedigen op maandag 9 mei 2005 om 10.30 uur

door

Iv´

an SZ ´

ANT ´

O

m´

ern¨

ok, TU Budapest

geboren te Budapest, Hongarije

(4)

Dit proefschrift is goedgekeurd door de promotor: Prof. Dr.-Ing. habil. W. Gerhardt-H¨ackl

Samenstelling promotiecommissie:

Rector Magnificus voorzitter

Prof. Dr.-Ing. habil. W. Gerhardt-H¨ackl Technische Universiteit Delft, promotor Prof. dr. ir. J.L.G. Dietz Technische Universiteit Delft

Prof. dr. ir. M. Looijen Technische Universiteit Delft Prof. dr. H. Koppelaar Technische Universiteit Delft Prof. dr. ir. Th.P. van der Weide Radboud Universiteit Nijmegen Prof. dr. ir. P. van Oosterom Technische Universiteit Delft Dr. ir. M.N. Sim Bitbybit Information Systems, Delft

Copyright c 2005 by I. Sz´ant´o

All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without the prior permission of the author.

(5)

Preface

1

1 General Query Principles

3

1.1 Query definitions . . . 3

1.2 Query systems in context . . . 4

1.2.1 Information retrieval . . . 4

1.2.2 Query history in database systems . . . 6

1.3 Query system architecture . . . 7

1.3.1 Query system components . . . 7

1.3.2 Query parser, tree, engine and optimizer . . . 8

1.3.3 Query system output: query result . . . 10

1.3.4 Views — more query input and output . . . 10

1.4 Classification of database languages . . . 10

1.4.1 Language aspects . . . 12

1.4.2 Data aspects . . . 16

1.4.3 Query languages . . . 20

1.5 Maintenance of database systems and queries . . . 22

1.5.1 Avoiding the double maintenance problem . . . 22

1.5.2 Reuse . . . 23

1.5.3 Database design . . . 23

1.5.4 Consistency control of data . . . 24

1.6 Query system requirements . . . 25

1.6.1 Query language requirements . . . 25

1.6.2 Requirements for the parser, engine and optimizer . . . 29

1.6.3 Requirements for object query systems . . . 29

1.6.4 Requirements for XML query systems . . . 31

1.7 Overview . . . 32 i

(6)

ii CONTENTS

2 Research Goal, Questions, Method

33

2.1 Motivation . . . 33

2.2 Research problem, goal, and framework . . . 34

2.3 Research method . . . 35

2.4 Research requirements . . . 35

2.4.1 Requirement: support the object model . . . 36

2.4.2 Requirement: support and use the object-oriented paradigm 37 2.4.3 Requirement: expressive power . . . 38

2.4.4 Requirement: support the user . . . 39

2.5 Research scope . . . 40

3 Perspective-DB ODBMS

43 3.1 Object model . . . 43

3.2 System architecture . . . 46

3.2.1 Generic interfaces: virtual objects and TOML files . . . 47

3.2.2 OS: operating system and services . . . 49

3.2.3 VOIS: virtual object interface system . . . 50

3.2.4 VOMS: virtual object management system . . . 52

3.2.5 OMS: object management system . . . 52

3.2.6 GUI: graphical user interface . . . 55

3.2.7 Frame: framework . . . 57

3.3 Basic data manipulation, backup and restore, export and import . 59 3.3.1 Syntax of Instantiation TOML files . . . 59

3.3.2 Processing and using Instantiation TOML files . . . 61

3.3.3 The syntax of SOML files . . . 62

3.3.4 Processing and using SOML files . . . 62

3.4 Revisiting the research requirements . . . 64

4 The Perspective Query System

67 4.1 The role of the query subsystem . . . 67

4.2 Initial query system architecture . . . 68

4.3 Initial PQL syntax . . . 69

4.3.1 Syntax description . . . 69

4.3.2 Illustrative examples . . . 70

4.3.3 PQL is intuitive and user-friendly . . . 71

4.4 Query result: nested tables . . . 72

4.4.1 The columns and rows of the query tables . . . 72

4.4.2 Query table examples . . . 72

4.5 Query tree . . . 74

(7)

CONTENTS iii

5 Incremental View Maintenance

77

5.1 Materialized views . . . 77

5.1.1 State of the art maintenance policies . . . 78

5.1.2 Our variant compared to other work . . . 78

5.1.3 Issues to solve . . . 80

5.2 Our view maintenance algorithm . . . 80

5.2.1 Algorithm overview - notifications . . . 80

5.2.2 Algorithm details — finding the row to be changed . . . 81

5.2.3 Further details — refresh and propagate . . . 82

5.2.4 The issue of objects that cannot register with any table . . 84

5.3 Performance issues . . . 85

5.3.1 Missing metadata in the notifications . . . 85

5.3.2 A lesson learned from View Cache . . . 85

5.4 Conclusion and future work . . . 86

6 Combining the results of path expressions

89 6.1 Arithmetic and logical operators . . . 90

6.1.1 Extending PQL with arithmetic and logical operators . . . 90

6.1.2 Adding the scheduler . . . 92

6.1.3 String operations . . . 92

6.1.4 Automatic type conversion . . . 94

6.2 Calling methods . . . 94

6.2.1 Syntax extension for calling methods . . . 95

6.2.2 System extension for calling methods . . . 95

6.3 Aggregate functions . . . 96

6.3.1 Syntax extension for aggregate functions . . . 96

6.3.2 Implementation of aggregate functions . . . 97

6.3.3 Integration of aggregate functions and the view mainte-nance mechanism . . . 98

7 Filtered path queries

101 7.1 Simple filters . . . 102

7.1.1 Extending PQL with filters . . . 102

7.1.2 Scheduling simple filters: correctness and optimization . . . 103

7.2 Integration of filters and of the view maintenance mechanism . . . 104

7.2.1 Filtering one relation and nested relations . . . 105

7.2.2 Active results vs. optimization . . . 108

7.3 Combined filters . . . 108

7.3.1 Combined filters: filtering at the highest level (correctness) 108 7.3.2 Combined filters: filtering at every level (optimization) . . . 109

7.4 Integration of filters and aggregation . . . 110

(8)

iv CONTENTS

7.4.2 Duplicating nodes to enable two iterations . . . 111

8 Recursive and reusable path queries

113 8.1 Recursiveness in schema and in queries . . . 113

8.2 Motivation and goal of the chapter . . . 114

8.3 State of the art . . . 115

8.4 Recursive path queries and ownership-hierarchies . . . 116

8.5 Reusable queries . . . 117

8.6 Ownership-hierarchies . . . 118

8.7 Combining recursive path queries with filters . . . 122

8.8 Conclusion . . . 124

9 Extending export and import

127 9.1 Get OML . . . 127

9.1.1 System architecture and user interface . . . 128

9.1.2 Syntax description: Get OML and TOML Get . . . 128

9.1.3 Syntax description: SOML Get . . . 130

9.2 Set OML . . . 131

9.2.1 System architecture and user interface . . . 131

9.2.2 Syntax description: Set OML and TOML Set . . . 131

9.3 Summary, evaluation . . . 132

10 Further features and future improvements

133 10.1 Passing arguments to the query . . . 133

10.1.1 Syntax extension . . . 133

10.1.2 GUI integration . . . 134

10.1.3 A problem with the syntax . . . 134

10.2 Local filters and variables . . . 135

10.2.1 Local filters . . . 135

10.2.2 Variables . . . 135

10.3 Quantification . . . 136

10.4 Restructuring the results . . . 136

10.4.1 Nest, unnest, order, and group . . . 136

10.4.2 Mappings for the user and for the developer . . . 136

10.5 Optimization . . . 137

10.5.1 Reusing the query tree . . . 137

10.5.2 Reusing the query result . . . 137

10.5.3 Reusing parts of the query result . . . 137

10.6 Visualization . . . 138

(9)

CONTENTS v

10.6.2 Graphical query composition . . . 138

10.7 Further open problems . . . 139

11 Functional description

141 11.1 Object types . . . 141

11.2 Object instances . . . 147

11.3 Object creation . . . 150

11.4 Path expression types . . . 151

11.4.1 Simple path expressions . . . 151

11.4.2 Branching path expressions . . . 155

11.5 Path expression instances . . . 157

11.6 Extending the query . . . 159

11.6.1 Arithmetical and logical expressions . . . 159

11.6.2 Filters . . . 162 11.6.3 Operations . . . 164 11.6.4 Meta-members . . . 164 11.6.5 Recursive queries . . . 165 11.7 Summary . . . 166

12 Implementation details

167 12.1 Overview of object types . . . 167

12.2 Navigators . . . 169

12.2.1 Its . . . 169

12.2.2 Its Operator . . . 169

12.3 Query tree nodes . . . 170

12.3.1 Meta nodes . . . 170

12.3.2 Query execution . . . 171

12.3.3 Building the query tree . . . 175

12.3.4 Scheduler . . . 176

12.4 Query tables . . . 177

12.4.1 Query Table . . . 177

12.4.2 Query Result . . . 178

12.4.3 Query Mapped Result . . . 186

13 PQS maintenance

187 13.1 Error handling and trace messages . . . 188

13.1.1 In case of an error, log it and return an error . . . 188

13.1.2 Tracing program execution . . . 188

13.2 Visualization tools . . . 189

13.2.1 Visualizing the query tree . . . 189

13.2.2 Visualizing the query tables . . . 189

(10)

vi CONTENTS

13.4 Extending PQS . . . 192

13.5 PQS test suite . . . 193

13.5.1 Simple test cases . . . 194

13.5.2 Filtered test cases including arithmetical and logical ex-pressions . . . 194

13.5.3 Recursive test cases . . . 195

13.6 A debugging session . . . 195

14 Conclusion

197 14.1 PQS generic evaluation . . . 197 14.1.1 System architecture . . . 198 14.1.2 Query result . . . 198 14.1.3 View support . . . 198 14.1.4 Language aspects . . . 199

14.2 Revisiting the research requirements . . . 201

14.2.1 Support the object model . . . 201

14.2.2 Support and use the object-oriented paradigm . . . 202

14.2.3 Expressive power . . . 203

14.2.4 Support the user . . . 204

14.3 Reusable concepts and features . . . 206

14.3.1 Concepts also found in other query systems . . . 206

14.3.2 Features unique to PQS . . . 207

Bibliography

209

A PQL syntax (BNF description)

219

B Description of the Hospital case

221

C UML class diagram

223

D Example code

225 D.1 Example Instance.toml . . . 225

D.2 Example SOML Instance . . . 228

D.3 Example TOML Get . . . 229

D.4 Example TOML Set . . . 229

D.5 Example SOML Get . . . 231

D.6 Example SOML Set . . . 232

E XML Query requirements

233 E.1 Goals . . . 235

E.2 Usage Scenarios . . . 235

(11)

CONTENTS vii

E.2.2 Data-oriented documents . . . 236

E.2.3 Mixed-model documents . . . 236

E.2.4 Administrative data . . . 236

E.2.5 Filtering streams . . . 236

E.2.6 Document Object Model (DOM) . . . 236

E.2.7 Native XML repositories and web servers . . . 236

E.2.8 Catalog search . . . 236

E.2.9 Multiple syntactic environments . . . 237

E.3 Requirements . . . 237

E.3.1 Terminology . . . 237

E.3.2 General Requirements . . . 238

E.3.3 XML Query Data Model . . . 239

E.3.4 XML Query Functionality . . . 240

E.4 Relationship to Other Activities . . . 244

E.5 References (non-normative) . . . 246

F Test suite

249 F.1 Company schema . . . 249

F.2 Simple test cases . . . 249

F.3 Filtered test cases . . . 250

F.4 Recursive test cases for Company . . . 251

F.5 Recursive test cases for Department . . . 252

Index

255

Summary

259

Samenvatting

263

Acknowledgments

267

(12)

(13)

Preface

This work describes a Path Query System. A Path Query System is a software system which offers query services based on path expressions defined primarily by the relationships among database schema objects and by the attributes of the same database schema objects. This work describes the research results and the evaluation of their practicability.

The work was motivated by the observation that an object data manage-ment system based on a powerful object model has certain advantages over other database management systems. Therefore it was interesting to find out what op-portunities such an object data management system offers to its query subsystem and what consequences the object model has for the query subsystem.

The examination of the generic requirements of query systems and of the usability and maintenance aspects of existing query systems revealed that for query systems it is important to encourage the formulation of semantically sound queries, to support the formulation of reusable queries, to facilitate the usage of recursive queries through intuitive syntax and semantics, and to enable flexible, online, user-defined consistency control of the entered data.

To fulfill these requirements, it proved to be a solution to use the visualized object schema for supporting query formulation. The visualized object schema enables the user formulating the query to understand the semantics of the objects in the database schema. We have put this understanding to use in the query system by making the query language syntax follow the object semantics and by supplying it with properties that allow a mapping to the object representation concepts.

The designed and realized query system uses a path query language to navigate from object to object, and respects the encapsulation paradigm by using the methods of objects to access their members. This feature is unique, because to the author’s best knowledge, no other system is built with this requirement and because this requirement resulted in an optimization algorithm different from the others. Further optimization potential has been found in the use of a partial view

(14)

2 PREFACE

maintenance algorithm applied to the query results in several situations, which also enables flexible, online, user-defined consistency control of the entered data. Other unique features are query reusability, and that the query system is reused to achieve more control over the backup-restore and export-import procedures.

The Path Query System has been developed as a subsystem of the object database management system Perspective-DB. As such, the system is called Per-spective Query System. On the one hand, PerPer-spective-DB was chosen as research and evaluation environment for the Path Query Language because of its proper-ties which provided the preconditional basic concepts. On the other hand, there was a lack of a query system in Perspective-DB itself.

(15)

Chapter

1

General Query Principles

This chapter gives a comprehensive overview of the general principles of query systems, in order to elaborate on the motivations of this work, and to present a number of requirements that were used to develop and to evaluate the path query system. The general principles of query systems discussed here include query defi-nitions, the context of query systems including information retrieval and database systems, query languages and their classification, the architecture of query sys-tems, the query result, filtering, optimization, and query system requirements.

1.1 Query definitions

The word query has two definitions. First, query is an information system service that enables its user to retrieve and to manipulate information in a high-level, declarative way. Second, query is a high-level, declarative request that is sent to the query service by its user, which causes the query service to return information to the user or to manipulate information. We will make a distinction between the two meanings by using the term “query service” for the first meaning, and by using the word “query” or the term “query request” for the second meaning.

Although the meaning of the English word query implies read-only access, many query services support modification of information as an effect of a query.

A query language is a high-level, declarative computer language designed to formulate query requests. It is a notion frequently associated with query; many query services have a query language. A query system is a system built to provide a query service. A query subsystem is a synonym to query system, but this term also indicates that it works as a component of a more comprehensive system, usually of a database system or other information retrieval system. If a query system returns information to the user, that information is referred to as the query result.

(16)

4 GENERAL QUERY PRINCIPLES 1.2

Some query languages use path expressions as building blocks of queries. A path expression defines the steps to take on the path toward the objects to be retrieved. A query language based on path expressions is a path query language, and a query system using such a language is a path query system.

1.2 Query systems in context

In order to understand query systems, it is essential to see what they are sur-rounded with. Therefore, this section zooms out to show the context of query systems. Its goal is to give an overview by highlighting the most important land-marks, rather than to explain the details, which is postponed to the following sections.

The notion of query is related to the field of database systems as well as to the field of information retrieval. Databases contain formatted data (or facts); while information retrieval systems deal with unformatted data, i.e. texts. It is known [Fuh96] that the two fields have little overlap, therefore their integration is still actively researched.

1.2.1 Information retrieval

This subsection highlights some areas of the field of information retrieval through one of its most important subfields: document retrieval, which deals with the retrieval of documents and document parts from a document collection. The user specifies a query, e.g. in the form of one or more keywords, and the document retrieval system will find the matching documents and present them to the user. In order for such a system to work efficiently, it first has to analyze, describe and organize the documents in the collection. Information retrieval includes document retrieval, but in addition to documents and parts of documents, it also deals with retrieving other types of information, including images, sounds, movies, and so forth.

The paradigm of information retrieval has started long before the computer age. People needed to find information in a document collection as soon as there were documents. Even in the libraries of ancient times (e.g. the Library of Alexandria [Bru95], [Can89]), it was not enough to have many books, but the books had to be categorized and organized, in order to allow users to find those they were interested in. A library with an advising librarian can be thought as an information retrieval system, because users can direct questions (queries) to the librarian, who will try to locate books in the library to formulate and present an answer.

Automatic indexing Automating information retrieval by using computers involves automating each phase [Sal80]. The task of information analysis, also known as indexing, consists of the assignment of one or more identifiers, or index terms,

(17)

1.2 QUERY SYSTEMS IN CONTEXT 5

to each information item. Authors and librarians assign keywords and index words to documents, but there are known techniques to assign them automatically as well. Such a well-known technique is the vector-space model [Sal73] [Sal83], in which each document is represented by a vector, and each vector dimension represents the number of times the corresponding term occurs in the corresponding document. The textural features proposed by Haralick et al. [Har73] can be used to construct a vector-space for images, consequently, the vector-space model can be applied in image retrieval. Other than textural features can be used as well, including the outcome of object recognition or face recognition. For indexing sound, similar techniques are available, including speech recognition and using the words as index terms.

In video indexing, it is necessary to use more fine grained information items than whole films or video sequences, therefore, these are first segmented. Video segmentation is typically based on cut detection. The outcome of the segmen-tation process are scenes. For each scene, a representative frame, a so-called key-frame is selected. To facilitate dealing with the high number of scenes in a film, scenes are hierarchically clustered, and for each cluster, a representative key-frame is selected. After all this preprocessing, the clusters have to be in-dexed, for which several options are available, including using the features of the images found on the key frames, using the outcome of speech recognition if audio is available, or using the words in the subtitles if that is available.

Organization In addition to answering queries using keywords, documents, video and other kind of information can also be retrieved by browsing, which activity can be facilitated by cluster analysis [Sta04].

Search After the index has been created, the user may specify a query. The part of an information retrieval system that answers the user query is called query engine or search engine. The term search engine sounds familiar because search engines are widely used to search the Internet. These return a list of matching documents as response to their users’ queries that can take the form of a number of keywords or of logical expressions. Search engines are based on two components, the indexing component that scans the web pages on the Internet, puts the addresses of the visited pages along with their extracted index terms and sometimes with the cached copies of the documents into the database of the search engine, and the query component that uses the database to answer the user queries.

Presentation Once the search engine has found the results (hits) to the query, they have to be presented to the user. This does not only involve considerations as to the number of results to be presented at a time and to the description of each retrieved item, but ranking and changing the display adaptively to fit the user’s needs.

(18)

Question answering [Voo01] is an advanced way to present retrieved informa-tion, because instead of showing each retrieved document as a hit, it is trying to formulate an answer to the query request formulated as a question.

1.2.2 Query history in database systems

This subsection gives a brief historic overview of the world of query systems, when they are used as a subsystem of a database system.

Past In the context of database systems, queries were first used in relational database systems, where the query system expects the user to formulate high-level query requests in a declarative way, which are then interpreted by the system. A query is said to be a high-level request, because it requests data access at the set level, which is convenient for the user, since that enables higher productivity, that is, shorter queries for the same job. A query is declarative, because it specifies what is to be done, but it does not specify how it is to be done. This is in contrast to the low level, procedural data manipulation languages that were used in pre-relational systems, whose users had to deal with the detailed specification of how data is to be accessed. For the user, it is convenient to formulate a query in a declarative way, since then she does not need to deal with the details of how the query is executed. For the system, a declarative query enables to choose an execution plan for the query to provide better performance.

It was the relational data model along with the relational algebra and with the relational calculus that enabled the production of the first query systems, because they provided a firm basis for the optimization and scheduling of declarative queries.

Most of the query systems used to be built on the concept where the query is formulated as a string that conforms to a query language, such as SQL. Later, visual query systems were devised that are more intuitive and more attractive to end-users (such as QBE).

Present Today, query systems are widely used in database systems, and in ap-plications that use database systems as underlying technology. Database systems offer query services as an interface to retrieve and to manipulate data stored in the database. Such query services can be used by the various user groups of the database, including end-users, administrators, application developers, database designers, and so forth.

Query systems for semistructured databases are the subject of research, be-cause of the appearance and fast expansion of the Internet, as well as for integra-tion of data from heterogeneous data sources.

Future We anticipate that query systems will develop further. One development direction is to make query systems that try to combine various sources of

(19)

in-1.3 QUERY SYSTEM ARCHITECTURE 7

formation to answer the users’ queries. Other development directions include the enhancement of user interfaces by more sophisticated visualization or by a thorough understanding of queries formulated in natural languages.

1.3 Query system architecture

This section discusses the internal architecture of query systems, but first it states that every query system is either a subsystem of a database management system or a database management system itself, so that it can put query systems in the context of databases systems. This section also discusses views.

The term database (DB) is defined [Elm00] as a collection of related data, and the term database management system (DBMS) as a collection of programs that enables users to create and maintain a database (for more explanation and for the definition of other database terms, please refer to the book [Elm00]). To maintain a database, data retrieval and data manipulation are necessary. Since every query system (QS) is made for accessing a collection of related data, every QS is a part of a DBMS (provided the QS is a collection of programs), or if the DBMS contains no other system components the QS is a DBMS.

1.3.1 Query system components

The components common to all query systems are shown in Figure 1.1.

Input to the query system is a query request as well as the database and metadata describing the database. Output of the query system is the query result. The query request is either a query string, or a request formulated using a visual user interface. In the former case, the string is passed to a parser that translates the query to an initial query tree, in the latter case the visual user interface outputs the initial query tree. The initial query tree may in both cases be passed on to an optional optimizer that transforms the initial query tree to a query tree that serves as execution plan. The optionally optimized query tree is then used by the query engine to access the database and to generate the query result.

The following deviations are possible from this architecture. The optimizer is usually implemented, but sometimes it is omitted, because it is not necessary or for other reasons. The query engine is always there, but sometimes it has access to multiple databases, which may be taken into account in the query language, and in the optimizer. Figure 1.2 shows a generic query system that works with a query language, because this is a frequently used variation.

The following subsections give more details about each system component and about the query result, whereas the next section discusses query languages.

(20)

Figure 1.1: The architecture of a generic query system

1.3.2 Query parser, tree, engine and optimizer

Once the query system receives a query string, it is first passed to the query parser. The query parser is based on the syntax of the query language; it accepts only query strings that conform to the query language. Moreover, the parser checks the query string against the database schema (if any), and it outputs an initial query tree.

The query tree is an internal representation of the query request. It is used throughout the query system, as described previously: the parser produces its initial form that is given to the optimizer, the optimizer rewrites it to a query tree that is in many cases better than the initial one, but it is still equivalent with it with respect to the query result, and this rewritten query tree is then called execution plan and is used by the query engine to access the database instance and to execute the query. Although, the query tree is initially produced from the query string, they are different, because the query string is the declarative specification of the query, whereas the query tree is the procedural description of the steps that have to be taken to the fulfill the query request.

In most cases, the query engine is able to execute the same query in several ways, because the query engine has various access methods and execution algo-rithms, which perform the same task in different ways. Some ways of executing the same query are faster than the others, but all of them have the same result. The same query means the same query input, the same query result, and the several ways can be represented as different query trees. The set of these different but equivalent query trees constitutes the search space of a specific query for the

(21)

1.3 QUERY SYSTEM ARCHITECTURE 9

Figure 1.2: The architecture of a generic query system with a query language

query optimizer.

The query optimizer [Ioa96] inputs the initial query tree constructed by the parser, and it has the task to transform it to a query tree that is equivalent to the initial query tree, but that can be executed by the query engine faster than some other equivalent query trees, faster than the initial query tree, and in the ideal situation, faster than all other equivalent query trees. To perform its task, it is useful for the optimizer to have access to the database schema (if any), to the available access paths that can be used by the query engine to access the various parts of the database instance, and to statistical information about the database instance.

The optimizer can work in several ways, such as based on heuristic rules or based on cost functions. A rule-based optimizer uses heuristic rules to gradually transform the initial query tree so as to arrive at the execution plan. A cost-based optimizer uses the initial query tree to generate the search space. Each point in the search space is a query tree for which it can calculate the cost. Consequently, it has to perform a search to find the query tree that has the smallest cost. If the optimizer performs an exhaustive search it may take a long time, even longer than executing the query by using a sub-optimal query tree. Consequently, the optimizer usually does not perform an exhaustive search and does not find the optimal query tree, only a sub-optimal one.

(22)

1.3.3 Query system output: query result

In some respect, the query result conforms to the database and in some respect it does not: it is usually conforms to the data model of the DBMS, but it does usually not conform to the database schema. Moreover, the advantage of the query result conforming to the data model is that it enables queries to be nested in other queries: subqueries.

For example, the result of an SQL query is a table that conforms to the data model of the RDBMS, but there may not be any table in the database that has the same columns as the query result. Similarly, the result of an OQL query is an object, but it is not necessarily of an object type defined in the database schema. Both languages support the formulation of queries with subqueries.

1.3.4 Views — more query input and output

Many database systems support the definition of database views. So do many query languages, e.g. SQL and OQL. A view is a derived table or a derived object that is based on stored tables or stored objects. The stored tables or stored objects that the view is based on are called base tables or base objects. A view can be defined using a view definition language, or a query language. Indeed, the result of every read-only query may be regarded as a view definition. After a view has been defined, its definition is placed into the metadata; consequently, it can be used as input to subsequent queries. The book [Elm00] even discusses some cases when it is possible for the user to use views to update the underlying data, however, it also mentions that this is not possible for every view.

A view is a materialized view if its data is stored in the database otherwise the view is virtual. If the base tables or base objects of a materialized view change, the view has to be maintained so that its data is consistent with the base tables or base objects.

If a query refers to a virtual view defined in the same query language, the view definition can be placed in the query as a subquery, and the optimizer can optimize the whole query as one. If a query refers to a materialized view, the optimizer will have the choice between (1) placing the view definition in the query and instructing the engine to access the base tables of the view, and (2) instructing the engine to access the materialized view. Levy surveyed answering queries using views materialized earlier in [Hal01], various authors contributed to [Gup99], a book about relational materialized views, Kuno and Rundensteiner deal with materialized object views in [Kun98].

1.4 Classification of database languages

If a query system has a query language they will influence each other. This is because the query system has to support all the query requests that can be formulated in the language, and vice versa, the query language has to have the

(23)

1.4 CLASSIFICATION OF DATABASE LANGUAGES 11

expressive power to enable the formulation of all the query requests that are to be supported by the query system in the query language. This is why it is usual that query systems are studied by studying their languages.

A database language (DBL) is a language used in a DBMS, and as the previous section stated, every QS is a part of a DBMS (provided the QS is a collection of programs), or if there are no other system components in the DBMS, the QS is a DBMS. Consequently, every query language (QL) is a DBL, and since a data manipulation language (DML) is a DBL that is used to manipulate and to retrieve data, every QL is a DML.

Studying the classification of database languages facilitates placing query sys-tems into a context, more specifically, to place the Perspective Query System into a context, and to compare it to other query systems. A comprehensive study of database languages is [Bla87], but the authors chose to make this classification to pay special attention to DMLs and QLs, and because [Bla87] is rather old, consequently, it does not contain the new developments.

Before examining the various classes of database languages, this section gives a brief overview of interfaces to DBS and QS.

In the sequel, we will use a number of terms from the field of database systems, in the sense defined by the book [Elm00].

Interfaces to database systems Database systems provide various interfaces for their various users, including [Elm00] database languages, visual interfaces, such as menu-based interfaces for browsing, forms-based interfaces, graphical user in-terfaces, interfaces for parametric users, and for database administrators (DBAs), as well as natural language interfaces.

Interfaces for DBAs include backup-restore and export-import facilities. Backup-restore facilities are different from export-import facilities in that the former use their own internal file format, while the latter use an external file format understood by other systems.

The query service provides an interface or interfaces to a database system, or to several database systems. In the simplest situation, a database system has a query subsystem that provides one interface to the database system, which is a query language, that can be classified as a high-level, declarative DML (first clas-sification), and set-oriented interface (second classification). In other situations, a database system has a query subsystem with multiple interfaces, or multiple database systems are federated using a common query system that provides a unified interface to those database systems.

Query system interfaces A visual query system uses a visual representation to depict the domain of interest and to express related requests [Cat97]. Such a system may be more intuitive and more attractive to end-users; it may help the user to formulate a request by enabling a two-way interaction as opposed to the one-way communication of query languages. Examples for visual query systems

(24)

are QBE and ABF [Row85]. Visual query system interfaces are sometimes also called visual query languages. In the sequel, we will use the term query languages to mean textual query languages, but if it is necessary, we will use the terms visual query languages and textual query languages to emphasize the difference between them.

The user of a natural language interface may use a natural language to express queries. This is advantegeous, because there is no need for the users to learn the query language, they can start asking questions right away. However, the answers of such systems are usually not as precise and as reliable as the answers of query systems using a well-defined query language [Voo01], because that would require the computer to understand every question that can be asked by humans, which is currently not possible for a computer program. To focus on the issues related to query languages, we limit the sequel to textual and visual query languages and systems that use those kinds of languages.

Database language aspects To put query languages in context, this section presents a classification of database languages along the following aspects:

language aspects

• the operations that the database language is able to perform (data definition, data manipulation, view definition, etc) • the level at which the language accesses data

(attribute-level, record-level, set-level, etc) • declarative vs. procedural

• standalone vs. embedded data aspects

• the extent to which the data to access is structured (structured, semistructured, no structure)

• the data model that the data to access conforms to (if any)

(hierarchical, network, relational, semantic, object-oriented, semistruc-tured, etc)

1.4.1 Language aspects

Operations performed The book [Elm00] mentions the following database lan-guages as interfaces to database systems: data definition lanlan-guages (DDLs), data manipulation languages (DMLs), storage definition languages, view definition lan-guages (VDLs), and mapping specification lanlan-guages. Most DMLs are able to perform not only data retrieval but also to modify the database instance.

(25)

It is notable that the standard SQL contains DDL elements, DML elements and view definition language elements. However, we will focus on DMLs, because the set of query languages is a subset of the set of DMLs. In that respect, SQL is able to perform data retrieval and to modify the database instance. OQL is a DML that is designed for data retrieval, but it can also modify the database instance by invoking methods.

[Sto77] explains that the categories DDL and DML are relative, since infor-mation can be moved back and forth between a DDL and a DML, in the following way. Assume a query with a subquery. The query will look simpler if the sub-query is defined as a view. The sub-query will look even more simple if the whole query is defined as a view. Consequently, the DML cannot be considered without the corresponding DDL. SQL is a DDL and a DML in one, OQL is a DML, and its corresponding DDL is ODL. Xplain [Bek99] has a DDL and a DML, Lorel [Abi97] and XML-QL [Deu99] are DMLs.

High-level or low-level [Loc87] presents a classification of database interfaces by defining the extended layer model that is the extension of the ANSI/SPARC architecture. It comprises 6 layers where interfaces may reside (see Figure 1.3), including set-oriented interfaces (I5), tuple-oriented interfaces (I4), internal interfaces (I3), segment interfaces (I2), file interfaces (I1), and device in-terfaces (I0). A database language may operate at one or more of these inin-terfaces, but a query language must be able to operate at I5, according to the definition, so does SQL and OQL.

For an RDBMS, I5 is an interface to sets of tuples, or relational tables, and I4 is an interface to tuples. For non-relational query languages, I5 is an interface to sets of objects, defined by extents (Xplain, OQL) or by path expressions (Xplain, OQL, Lorel, XML-QL), and I4 is an interface to individual objects.

Declarative or procedural When using a declarative language, the user can describe what she wants to achieve with the program, as opposed to using a procedural language, when the user must describe how she wants to achieve it. That is, the procedural description of a request must contain step-by-step instructions, whereas the declarative description contains the goal of the request in an unam-biguous and concise manner.

The advantage of declarative languages is their conciseness, and there is exper-imental proof for their increasing of user productivity. Also, declarative languages leave space for choosing the algorithm that is used to execute the request, conse-quently, for such a language, an optimizer program can be built that chooses an algorithm that is potentially better than one written in a procedural language. This is true, because once a new efficient algorithm element is found, and built into the system using a declarative language, it can be used by all requests, as opposed to a system using a procedural language, where every request that wishes to use this new efficient algorithm element has to specify it explicitly.

(26)

14 GENERAL QUERY PRINCIPLES 1.4 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Physical system storage File management,

mapping of files to memory System buffer management, page addressing

Storage of tuples in memory, access path implementation, access control, synchronization log and recovery Groups of tuples, access path selection, transaction management, access control, and integrity control

Access mapping, access optimization, access control, and integrity control

I0 — Device interface I1 — File interfaces: files,

blocks, file access

I2 — Segment interface: pages, segments, page addresses, preparation of pages I3 — Internal interface: tuples,

groups of tuples, access paths

I4 — Tuple-oriented interface: tuples, logical access paths, navigational access

I5 — Set-oriented interface: relations, layers, tuples, non-procedural languages

(27)

The advantage of procedural languages is that they are usually computationally complete (they can express any computable function [Atk89]), whereas declara-tive languages may have limits in their expressive power1_{. Moreover, procedural}

languages can usually get closer to the computer hardware, because the computer hardware is accessed by procedural machine code.

General-purpose programming languages are usually procedural languages, such as Pascal, PL/I, or C. Query languages are declarative languages by defini-tion, such as SQL and OQL.

Relational calculus is declarative as opposed to relational algebra that is proce-dural, because a certain order among the operations is always explicitly specified in a relational algebra expression [Elm00]. QUEL is declarative and SQL is proce-dural, because QUEL is based on the tuple relational calculus, and SQL is based on relational algebra, and SQL also supports subqueries that are processed before the query in which they are nested [Sto77]. However, it has been proven that any query expressed in relational algebra can be expressed in relational calculus as well, consequently, from the aspect of optimization, these two are equivalent. Similarly, subqueries can usually be unnested, consequently, from the aspect of optimization SQL and QUEL are equivalent. From the aspect of the user, sub-queries are an option that give more freedom to the user to express query requests. Some query languages, e.g. OQL, Xplain, and Lorel use path expressions as building blocks of queries. Path expressions are declarative, because they define a set of objects by not telling which exact steps lead to them, but they are also procedural to some extent, because path expressions define the steps to take on the path toward the objects. Xplain, as mentioned, works with path expressions, but it has further procedural characteristics: it enables the user to define variables (value command) and to extend the base tables with temporary columns (extend command) before actually telling what to get (get command) from the database. The other mentioned query languages that use path expressions will be dealt with later.

Embedded database languages The limited expressive power of declarative lan-guages and the advantages of declarative lanlan-guages over procedural lanlan-guages together raise the need for a combined use of declarative and procedural lan-guages. This can be achieved by using a query language that is embedded in a general-purpose programming language. In this context, the query language is called data sublanguage, and the general-purpose programming language is called host language.

By generalizing the idea of embedding a query language, a database language can also be embedded in a host language. This is a way to enable the users of the host language to access the DB and to invoke DBMS functionality.

1_{Especially declarative query languages (such as SQL or OQL) have limited expressive power.}

(28)

Embedded SQL has limited functionality as opposed to standalone SQL, since embedded SQL has no DDL elements.

1.4.2 Data aspects

The architecture of a database language is influenced highly by the characteristics of the data that the database language is designed to access.

Data with or without structure Data to be accessed can be classified according to the extent to which it is structured. Structured data has a structure that conforms to a database schema defined in some data model.

The next level is semistructured data, which may have a predefined structure, or a structure discovered or inferred by the system, but no solid schema. We will come back to semistructured data as one of the data models below.

When the data has no predefined structure, and the system cannot determine any structure then the possibilities of a database language are limited. A special case may be that the data is text, and a search/replace facility is working on it. However, we do not deal with this class.

Data model

Query subsystems of database systems depend on the data model of the database system, such as SQL depends on the relational model, OQL depends on the ODMG object model, and Xplain depends on its semantic data model. Here is an overview of the data models that are important for the later discussions or because of their widespread usage.

Pre-relational data models The well-known pre-relational data models are the hi-erarchical model and the network model. The IMS DBMS (IBM) is based on the hierarchical model, and it has its own data definition language, and it uses DL/I as the data manipulation. The CODASYL DBTG has proposed a DDL, a Sub-schema DDL, and a DML for databases with a Sub-schema conforming to the network model. DML and DL/I are low-level, procedural database languages, the other mentioned languages are data definition languages, to which these terms do not apply. These data models certainly have further associated database languages, but we do not deal with them here, and we do not discuss the details of the mentioned languages.

Relational data model The standard relational database language is SQL. It was first used in the IBM prototype System R [Ast76], and later it was standard-ized by ANSI. The standard had multiple versions (e.g. 86, 89, SQL-92, SQL:1999) and its current version is still under construction [Eis00], many RDBMS products include additional features, many do not support the whole standard.

(29)

Entity-relationship data model The entity-relationship data model [Che76] is widely used for making graphical models of the universe of discourse. To the best knowl-edge of the author, there is no database system and no DML based on the entity-relationship model.

Extensions to the relational data model There have been various proposals to extend the relational data model, including RM/T [Cod79], GEM [Zan83] and [Tsu84]. A survey of such extensions is [Sch92]. The proposal to extend the relational systems with object-oriented features has the most influence today [Sto96], because today, several relational DBMSs do have object-oriented features that were accepted in the SQL standard. Another notable extension is the nested relational model (e.g. [ ¨Ozs85]), research on which served as a basis for the development of object databases.

Semantic data models Semantic data models were designed to enable the database schema to capture more semantics than a relational schema can. Hull and King have prepared a survey of the semantic data models [Hul87] in 1987, which is now rather old, as are many research projects on semantic data models, most of which are not in use any longer. It describes the entity-relationship model, and various extensions to the relational model, as well as the functional data model [Shi81] with its DAPLEX data language. An example that is still being used is Xplain [Bek92], a data model, a DBMS, and a database language, which is shown [Bek97] in many situations and in many respects to be better than SQL and an RDBMS. The survey counts the entity-relationship model as a semantic data model, which may be disputed, but that model is also being used, mainly for modeling.

Object data models Object data models describe the universe of discourse in terms of objects, which is very close to what the human mind does. Object data models also enable the database schema to capture more semantics than a relational schema can. They were inspired by the flexibility and anticipated growth in use of the object-oriented programming languages. The first agreement reached on the requirements of object models is [Atk89]. The current object data standard is ODMG 3.0 [Cat00] defined by OMG, which contains the description of the Object Definition Language (ODL) and of the Object Query Language (OQL), as well as of the language bindings that are also data manipulation interfaces for object databases. More recent work [Tri02] has addressed the semantic optimization of OQL queries.

Semistructured data models Query systems for semistructured databases are usu-ally not the subsystem of any database system, but they also depend on a data model, a semistructured data model.

(30)

According to [Suc98], research on semistructured data started from the ob-servation that much of today’s electronic data does not conform to traditional relational or object-oriented data models. It defines semistructured data as data that could be modeled as object-oriented data, but that has an irregular structure: some objects may have missing attributes, others may have multiple occurrences of the same attribute, the same attribute may have different types in different ob-jects, semantically related information may be represented differently in various objects. The application field of semistructured data includes HTML, SGML and XML documents, as well as the integration of heterogeneous data sources.

Semistructured data is usually modeled as a labeled graph. Known semistruc-tured data model variants are OEM, XML, UnQL, Lorel. Semistrucsemistruc-tured data often exhibits some regularities that can be captured in a schema. This schema can either be specified by an expert or be inferred from the data. In both cases, the schema can either be flexible by enabling more variations, which can go as far as imposing no constraints on data at all, or it can be rigid by fixing data structures, which can go as far as imposing a schema of an object database.

Research on semistructured data resulted in a number of query language pro-posals for semistructured data, some of which were also implemented, such as Lorel, XML-QL, and UnQL. These query languages provide read-only access to semistructured databases. More recent query languages for XML data in-clude XPath [XP99], [XP05], [Mik04], [Gup03a], [Got03], [Bun03b], [Got02], and XQuery [XML03], [Fer03].

Summary and examples Table 1.1 and Table 1.2 summarize the classification of a number of database languages along the previously discussed aspects.

language data model operations

DML DDL VDL

retrieve modify

SQL relational + + + +

embedded SQL relational + + -

-QUEL relational + + + +

relational algebra relational + - -

-relational calculus relational + - -

-ODL object - - +

-OQL object + - -

-Lorel semistructured + + -

-Xplain semantic + + +

-XML-QL semistructured + - -

-Table 1.1: Query languages and their characteristics, part 1

(31)

language level declarative

SQL I3, I5 + (except subqueries)

embedded SQL I4, I5 + (but host language is procedural)

QUEL I5 +

relational algebra I5 + (the order of the operations is a procedure) relational calculus I5 +

ODL I5 n/a

OQL I4, I5 + (except subqueries) Lorel I4, I5 + (except subqueries)

Xplain I4, I5 +

XML-QL I4, I5 +

Table 1.2: Query languages and their characteristics, part 2

database: SQL, PL/SQL and embedded SQL. All of these languages are interfaces to the same query subsystem, however, this query subsystem can work in a dis-tributed database system as well. The languages are built around the Oracle ver-sion of SQL, the Structured Query Language that is a hybrid language, because it contains both DDL and DML elements, as well as language elements for mapping specification. Furthermore, SQL provides interfaces to multiple layers: I1, I2, I4, and I5, although Oracle does not enforce clear separation of these layers. Oracle also provides further interfaces, such as export/import utilities, backup/restore programs, and performance tuning applications, but custom applications must use some form of SQL to retrieve or manipulate data in the database.

Another example is any ODMG compliant object DBMS (ODBMS), that pro-vides ODL and OQL among its interfaces to the database. ODL is a DDL, OQL is a DML, a query language. Such an ODBMS probably has language bindings that support access to the database from a general purpose programming lan-guage such as C++. The lanlan-guage binding along with the programming lanlan-guage is also a DML, therefore, custom applications may use that to retrieve or manip-ulate data in the database, it is not compulsory for them to use a query language. OQL is a query language that supports retrieval of data directly, and manipula-tion of data indirectly, by calling methods. OQL works at the I4 and I5 layers, the language bindings may support access to the lower layers.

Interfaces to query systems Query system interfaces can be classified as query languages, visual query system interfaces and natural language interfaces. As mentioned by the previous section, a query language is a computer language that is used to formulate queries, in the form of query strings. Examples for query languages are SQL and OQL.

(32)

1.4.3 Query languages

After the general classification of database languages, we now direct our attention to query languages and describe some of them in more detail.

First, textual query languages are discussed then visual query languages. The standard SQL query language is based on relational theory, therefore, first we discuss relational algebra and relational calculus so as to be able to present SQL and the select-from-where syntax from their point of view.

Relational algebra and calculus

The relational algebra was introduced by Codd’s classical paper [Cod70]. Since relations are based on sets, it includes set operations: union, intersection, set difference, Cartesian product, and division. Additionally, it introduced select, project and join.

The relational calculus has two flavors: tuple relational calculus and domain relational calculus. Both have the same expressive power as relational algebra, that is, any query that can be expressed in relational algebra, can also be expressed in relational calculus and vice versa.

Languages using the select-from-where syntax

Several textual query languages are based on the select-from-where syntax, such as SQL, OQL, and Lorel.

In the SELECT clause, the user defines the things she wants to appear in the result, in the FROM clause she defines the entry point to the database, usually the name of a table (for relational databases), of an extent or of an object (for object databases and for semistructured data). In the WHERE clause, the user defines filters to filter out data that is of no interest for the user in the current query.

The SELECT clause of SQL corresponds to the project operation of the re-lational algebra, the WHERE clause of SQL corresponds to the select operation and to the specification of the join condition of the relational algebra. The FROM clause of SQL specifies the tables that participate in the query, that is, those that are the subject of any of the other operations. The FROM clause of OQL speci-fies object instances and extents, and the FROM clause of Lorel specispeci-fies object instances. It is notable that the FROM clause has some controversies:

• In Oracle SQL, any table enumerated in the FROM clause is to be joined with the other tables, even if this is unnecessary for computing the query result.

• In Lorel, the FROM clause defines the scope of the query. It may be omitted, in which case it is generated from the SELECT clause. This means, that the FROM clause is implicitly specified by the SELECT clause, which can differ from the FROM clause the user had in mind.

(33)

These operations are the relational algebra operations or their counterparts in object models or semistructured object models.

Apart from these operations, these query languages usually support post-processing of the data retrieved from the database, using arithmetical and logical expressions in the select clause, as well as in the where clause. Some of these query languages have syntactical constructs to sort and group the results. Also, these query languages usually enable the formulation of nested queries, where one query is embedded in another.

The use of variables The query languages based on the select-from-where syntax usually support the use of variables to refer to data items. This is done by defining and binding the variables in the FROM clause, which is useful because this supports the differentiation between two occurrences of the same data item. This is sometimes also useful to give a local name to a data item in a query, which is shorter than its global name (user convenience).

For example, in SQL, variables can be used to differentiate between two oc-currences of the same table in one query.

Subqueries The query languages based on the select-from-where syntax usually support the use of subqueries. A subquery is a query nested in the from clause or in the where clause of a parent query, to which it supplies data items. In some cases, the result of the subquery is used as read-only data to the parent query, in other cases, the result of the subquery is a set of data items, some of which is modified by the parent query, in yet other cases, the subquery and the parent query are correlated, that is, they have common variables and cannot be interpreted without each other.

Other textual query languages

Apart from the query languages based on the select-from-where syntax there are other textual query languages, such as QUEL, and Xplain.

QUEL is not based on the select-from-where syntax, because it was developed at the same time as SQL, the first query language based on the select-from-where syntax. Also, QUEL is based on the relational calculus, as opposed to SQL that is based on the relational algebra, which has the consequence that a different syntax suits it better. QUEL is based on the range-retrieve-where syntax. The RANGE clause is used to specify the range of the variables that are to be used in the query. The RETRIEVE clause specifies the expressions the user wishes to get in the result of the query. The WHERE clause specifies the filter expressions that need to be fulfilled by the results retrieved by the query.

Xplain is a query language for the Xplain semantic database system. It has a get-where syntax, instead of a select-from-where syntax. It is based on path expressions. The requested path expressions are specified in the GET clause, and the filters are specified in the WHERE clause.

(34)

The author cannot see any particular reasons that explain why the select-from-where syntax is accepted as the standard for query languages. However, it is remarkable that all the query languages known to the author have a place to specify the data items the user wants to appear in the result (SELECT, RE-TRIEVE, and GET clause), and the filter is specified in the WHERE clause.

Visual query languages

Visual query languages are intended to make it easier for the user to ask queries. A notable example is QBE that is a query language for the relational model, therefore, it shows tables on the screen to represent the database, and to enable the user to formulate queries. To formulate a query, the user first asks for the tables that she will use. The system responds by putting the columns of the requested tables on screen. The user can then request values from the database by putting the data request operator into any of the columns shown on screen. Filters and variables can also be specified by putting them into the columns shown.

1.5 Maintenance of database systems and queries

This section addresses the problems the engineers need to deal with in practice when they use databases and queries.

1.5.1 Avoiding the double maintenance problem

It is well known that every file on which people are working will have many ver-sions, and this is true for source files of programs and queries as well. Maintaining these versions will cause many headaches to the engineers unless they know about software configuration management:

”Configuration management is the art of identifying, organizing, and control-ling modifications to the software being built by a programming team.” [Bab86]

One of the problems that usually arises is the double maintenance problem. Whenever a programmer copies and pastes a piece of code, it is potentially dan-gerous, because once this piece of code needs to be changed, it will have to be changed twice. The programmer, who has to change it, needs to remember, that it has to be changed twice, and people tend to forget things. It may even happen that the programmer who has to change it is not the same as the programmer who duplicated it. In this case, unless the first programmer has documented du-plication, it is impossible for the second programmer to know that the piece of code appears twice, and has therefore to be changed in both places.

To avoid all these problems, the solution is: no code should be duplicated. Since a query is a piece of code, this holds for queries, too.

(35)

1.5 MAINTENANCE OF DATABASE SYSTEMS AND QUERIES 23

1.5.2 Reuse

Code reuse is a concept related to the double maintenance problem. It is one of the concepts taught by the object oriented paradigm.

Once a piece of code has been completed, it is worth using it again and again, because completing a piece of code involves design, implementation, optimization, throrough testing, correcting the bugs and maintenance. Consequently, reusing a piece of code means reusing this investment again and again. It means that this investment has to be done once, and can be used many times. Trough reuse, the piece of code gets even more testing, even more bugs get corrected, thus it gets even more robust.

In order to achieve code reuse, the code has to be written in a way that allows it to be reused. The object-oriented paradigm facilitates reuse by encapsulation and inheritance.

Reusing queries would probably result in the same advantages as reusing pro-gram code written in propro-gramming languages, but it has not received much at-tention.

Query reuse can be achieved only if the queries can be made generic enough to enable them to be reused in other parts of the database or even in another database.

1.5.3 Database design

Professional design of a database schema needs to start by constructing the model of the Universe of Discourse (UoD), i.e. the part of the real world, in which the user of the information system to be constructed is interested. For constructing the model of the UoD, a suitable modeling language needs to be selected, such as the Entity-Relationship model or the object-oriented UML (Unified Modeling Language). Once the model of the UoD has been created, it has to be mapped onto the data model of the database, which results in the conceptual schema of the database.

Once the conceptual schema and the mapping between the model of the UoD and the conceptual schema has been created, the mapping needs to be used for cre-ating the applications of the information system. According to the ANSI/SPARC architecture [ANS75], [Jar77], each of the applications will see the portion of the database through each of the external schemas created for them, which are derived from the conceptual schema. In order to achieve a good performance, the con-ceptual schema will have to be mapped onto the internal schema which describes how the data will be stored physically.

Consequently, mappings will have to be constructed between the conceptual schema and the external schemas as well as between the conceptual schema and the internal schema. There should exist languages to describe these mappings. Current database management systems usually do not have such languages, but

(36)

provide at lease some kind of support for creating and maintaining these map-pings.

However, current database management systems usually do not have any sup-port for creating and maintaining the mapping between the model of the UoD and the conceptual schema. Although, there are tools that support the creation of such mappings, it still requires a lot of time and effort to use the mapping for creating the applications, and for maintaining the applications. Furthermore, every UoD will change, and this has the consequence that its model and the map-ping needs to be changed. Since the mapmap-ping was used to create the applications, the impact of the change on the applications will have to be carefully analyzed and propagated to applications.

Consequently, the need for the mapping between the model of the UoD and the conceptual schema has the drawback that it costs a lot of time and effort to create, to use and to maintain it.

There are some object database systems that have an object model powerful enough to be used to model the UoD. If this object model is used to construct the model of the UoD, no mapping needs to be created, because the model of the UoD can already be used as the conceptual schema. This is a pragmatic way of solving the problems caused by the mapping explained above.

This is why the author has chosen to do research in the field of object database systems with powerful object models.

1.5.4 Consistency control of data

One of the reasons people use database systems to store data is that database systems ensure the consistency of the entered data. Database systems achieve this as follows.

Interfaces

The database system has a number of well-defined interfaces (see Section 1.4.1), through which it accepts transaction requests. There is no other way to change the data in the database but going through one of these interfaces. Consequently, the consistency of the data in the database can be controlled through these interfaces.

Database schema

The database system ensures that the data it stores always complies to the database schema as defined by the database designer. This is achieved by check-ing each request to change the data by the users and by the programs against the database schema and by disallowing those that would not comply to the database schema.

(37)

1.6 QUERY SYSTEM REQUIREMENTS 25

Rules and triggers

Since it seems impossible to define every possible data consistency rule in the database schema, many database systems provide triggers or rules, through which the user can define data consistency rules that complement the rules coded into the database schema.

These are often defined as queries that are executed after each transaction request and based on their result the transaction is either accepted or rolled back. Such a rule has the problem that it may become too strict in some situations, for example when the user is entering data, and it seems easier to temporarily deactivate the rule. Another problem with these predefined rules and triggers can be that it can be difficult to remember all of them.

1.6 Query system requirements

This section presents the requirements that all query systems have to fulfill, some of which are really fulfilled by all query systems, and some of which are not. Each requirement is either a strong requirement that has to be fulfilled, or a less strong requirement that should be useful but need not be fulfilled. The subsections present the requirements classified by the system components that the requirement applies to: query language, parser, optimizer, engine. This section is based on various pieces from the query literature.

1.6.1 Query language requirements

High-level, declarative Query languages are defined as high-level, declarative lan-guages to enable the user to retrieve and to manipulate data in the database, therefore, it is a general strong requirement for query languages that they are high-level and declarative. There is no precise measure of the grade to which a language is high-level or declarative, but we can place the various computer languages on a scale. On one end of the scale, we can place the general-purpose programming languages, such as Pascal, C, or C++, which are procedural lan-guages, therefore, they are not declarative.

As next on the scale, we can place functional languages, such as LISP. The online FAQ [fun02] defines functional languages as follows.

Functional programming is a style of programming that emphasizes the evaluation of expressions, rather than execution of commands. The expressions in these languages are formed by using functions to com-bine basic values. A functional language is a language that supports and encourages programming in a functional style.

This style of programming results in the declaration of lots of expressions, hence, it is more declarative than procedural languages, but functional programs

Design, Optimization, and Evaluation of a Path Query System

Design, Optimization, and Evaluation of a

Design, Optimization, and Evaluation of a

Path Query System

PROEFSCHRIFT

ter verkrijging van de graad van doctor

aan de Technische Universiteit Delft,

op gezag van de Rector Magnificus prof. dr. ir. J.T. Fokkema,

voorzitter van het College voor Promoties,

in het openbaar te verdedigen op maandag 9 mei 2005 om 10.30 uur

door

Iv´

an SZ ´

ANT ´

O

m´

ern¨

ok, TU Budapest

geboren te Budapest, Hongarije

Contents

Preface

1 General Query Principles

2 Research Goal, Questions, Method

3 Perspective-DB ODBMS

4 The Perspective Query System

5 Incremental View Maintenance

6 Combining the results of path expressions

7 Filtered path queries

8 Recursive and reusable path queries

9 Extending export and import

10 Further features and future improvements

11 Functional description

12 Implementation details

13 PQS maintenance

14 Conclusion

Bibliography

A PQL syntax (BNF description)

B Description of the Hospital case

C UML class diagram

D Example code

E XML Query requirements

F Test suite

Index

Summary

Samenvatting

Acknowledgments

Preface

Chapter

1

General Query Principles

1.1

Query definitions

1.2

Query systems in context

1.2.1

Information retrieval

1.2.2

Query history in database systems

1.3

Query system architecture

1.3.1

Query system components

1.3.2

Query parser, tree, engine and optimizer

1.3.3

Query system output: query result

1.3.4

Views — more query input and output

1.4

Classification of database languages

1.4.1

Language aspects

1.4.2

Data aspects

1.4.3

Query languages

1.5

Maintenance of database systems and queries

1.5.1

Avoiding the double maintenance problem