hoaxula CSDL

Màu nền
Font chữ
Font size
Chiều cao dòng

Keyword Search over Relational Databases: A Metadata Approach

Sonia Bergamaschi

University of Modena and Reggio Emilia, Italy

[email protected]

Elton Domnori

University of Modena and Reggio Emilia, Italy

[email protected]

Francesco Guerra

University of Modena and Reggio Emilia, Italy

[email protected]

Raquel Trillo Lado

University of Zaragoza, Spain

[email protected]

Yannis Velegrakis

University of Trento, Italy

[email protected]

ABSTRACT

Keyword queries offer a convenient alternative to traditional SQL in querying relational databases with large, often unknown, schemas and instances. The challenge in answering such queries is to discover their intended semantics, construct the SQL queries that describe them and used them to retrieve the respective tuples. Existing approaches typically rely on indices built a-priori on the database content. This seriously limits their applicability if a-priori access to the database content is not possible. Examples include the on-line databases accessed through web interface, or the sources in information integration systems that operate behind wrappers with specific query capabilities. Furthermore, existing literature has not studied to its full extend the inter-dependencies across the ways the different keywords are mapped into the database values and schema elements. In this work, we describe a novel technique for translating keyword queries into SQL based on the Munkres (a.k.a. Hungarian) algorithm. Our approach not only tackles the above two limitations, but it offers significant improvements in the identification of the semantically meaningful SQL queries that describe the intended keyword query semantics. We provide details of the technique implementation and an extensive experimental evaluation.

1.

INTRODUCTION

Categories and Subject Descriptors

H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval--Search process; Retrieval models; Query formulation

General Terms

Algorithms, Experimentation, Performance

Keywords

Semantic Keyword Search, Intensional Knowledge, Relational Databases, Metadata

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD'11, June 12­16, 2011, Athens, Greece. Copyright 2011 ACM 978-1-4503-0661-4/11/06 ...$10.00.

The more the relational data complexity is increasing and the user base is shifting towards the less technically skilled, the more the keyword searching is becoming an attractive alternative to traditional SQL queries, mainly due to its simplicity. Unfortunately, this simplicity comes with the price of inherent ambiguity. Thus, the challenge of answering a keyword query over a relational database is to discover the database structures that contain the keywords and explore how these structures are inter-connected to form an answer. The discovered structures, alongside their inter-connections, are actually representing in relational terms the semantic interpretation of the keyword query. Numerous studies and tools can already be found in the scientific literature. They include DISCOVER [16], DBXplorer [2], BANKS [1], SPARK [22], SQAK [31], and many others [21, 29, 27, 33, 34, 38]. Generally, these works consider the database as a network of interconnected tuples, they detect those containing the keywords in the query, they generate connected components based on how these tuples are associated, and they return these connected tuples as an answer to the query. To do so, specialized structures that index the database content [2] are used. By using these indices, they may directly retrieve the tuples of interest, or they may instead construct the queries expressions that retrieve these tuples when evaluated. This is the basic idea followed by the modern commercial database management systems supporting full-text search over their relational database. Unfortunately, existing techniques suffer from two main limitations. The first is that they require a-priori access to the data instance in order to build the indices that will locate the tuples related to the given keywords at run time. This seriously limits their applicability if such access is not possible. Examples of such situations include databases on the hidden web and sources located behind wrappers in data integration systems [19] that typically expose only their schema information and lack notification mechanisms for their data updates. The second limitation is that no considerable attention has been paid to the inter-dependencies among the query keywords. The likelihood that a specific data structure represent the same semantics as a keyword in a user query does not only depend on the relationship between the keyword and the data structure, but also on the data to which the other keywords in the query are mapped. This is because despite the fact that a keyword query is a flat list of keywords, the meaning of each keyword is not independent of the meaning of the others, but they all collectively represent the intended concepts the user had in mind posing the query. Furthermore, not all the keywords represent instance values. Many are used as meta-data specification of the adjacent

565

keywords. Although there are already keyword based approaches on relational data that take into consideration metadata [21][31], they provide only partial solutions to the problem, and they only use the metadata as a way to improve their technique. In this work, we propose a novel technique for answering keyword queries over relational databases. The queries are translated into a number of SQL queries that capture the possible semantics of the keyword query. The generated SQL queries can be evaluated on the database, and their results serve as the answer to the keyword query. One of the novelties of the technique is that it is not based on an a-priori access to the database instances. Moreover, our approach exploits the relative positions of the keywords in the query alongside auxiliary external knowledge in order to make a more educated guess of the semantics that most likely represent those of the keyword query, and then rank them accordingly. The strategy can not only be easily incorporated to many relational database management systems, but it can also be used as an enhancement of existing keyword searching techniques that utilize the database instance, offering them significant improvements over effectiveness and efficiency. An advantage of our approach is that it can be used to assist users browsing databases with large unknown schemas. It is often the case that users formulate keyword queries without some specific semantics in mind but in an exploratory manner, mainly when they are neither fully aware of the type of information that is stored in a database, nor of the way this information is stored. The possible interpretations of a keyword query according to the schema and domain information of the database will be generated by our approach. In contrast to other keyword searching techniques on relational data that return sets of linked tuples, we can return the interpretations expressed in SQL. The study of these queries can reveal tables, attributes and join paths, providing the user with enough information to understand the kind of data that is stored in the database and the way this data is structured. Our key contributions are as follows: (i) we formally define the problem of keyword querying over relational databases that lack apriori access to the database instance; (ii) we introduce the notion of a weight as a measure of the likelihood that the semantics of a keyword are represented by a database structure, i.e., a table, an attribute, or a value. We further distinguish the weights to intrinsic and contextual, to emphasize that this likelihood does not depend only on the meaning of the keyword semantics when the keyword is considered in isolation (intrinsic), but also on the way the semantics of the remaining, especially the neighbouring, keywords are represented in the data (contextual). (iii) we extend and exploit the Hungarian (a.k.a., Munkres) algorithm [7] to develop a technique for the systematic computation of the contextual weights that leads into to the generation and ranking of the different interpretations of a keyword query in terms of SQL; finally, (iv) we experimentally evaluate our approach on real application scenarios. The remainder of the paper is structured as follows: Section 2 provides a motivating example and Section 3 formally defines the problem. Section 4 describes our technique. Sections 5 - 7 provides details on our technical contributions, i.e. the computation of the intrinsic weights, the contextualization and the selection of the best mappings. The relationship of our approach to the related work is discussed in Section 8 and Section 9 describes our experimental evaluation and discusses our findings. Finally some conclusions are presented in Section 10.

2.

MOTIVATING EXAMPLE

Consider the database illustrated in Figure 1, containing information about academic researchers, departments, publications, and

publication databases, all modeled by the respective tables. The tables Affiliated and Author materialize many-to-many relationships between Person-Department and Person-Publication, respectively. Let "Date Database" be a keyword query posed over this database. To answer this query we need to understand the meaning of each keyword and build an SQL query that offers an interpretation of the keyword query semantics in terms of the relational database. The first problem of this translation is to decide what role each keyword plays in the query, i.e., it represents a value or describes some meta-information (specification) about another keyword or the element of interest. In the above query, the keyword Date may represent the value of a name, and the keyword Database the value of a research area. In that case, a possible interpretation of the keyword query is information about a person called Date that has done work in the Database area. If the intended meaning of the keyword query was instead to find the online databases containing Date's publications, then the keyword Database is actually representing meta-information about the element of interest. Knowing what kind of information each keyword represents, the next critical step is to decide which part of the database actually models the intended keyword meaning. Consider the keyword query "Director Watson Address" and assume that the first and last keyword represent meta information while the second represents a value. The intended information of the keyword Director is most probably the one modeled by the attribute with the respective name. It is not clear, however, whether the keyword Address refers to the homonym attribute in the table Person or to the one in the table Department. Choosing one over another leads to completely different interpretations of the keyword query. Furthermore, the keyword Watson, even though we know it is a value, might be the name of a director, the name of a street, or even a value of some other attribute. Deciding which database structures model the meaning of the different keywords in the query is not enough. It is also important to decide how these structures relate to each other. In relational databases, such relationships are expressed either through the table-attribute-value hierarchy or through join paths, i.e., chains of key/foreign key relationships. Between two database structures there may be multiple different join paths. Each such path leads to a different interpretation of the query keywords. For instance, consider the keyword query "email CS", and assume that the keyword email corresponds to attribute Email and keyword CS to a value of the attribute DName. Between DName and Email, there are two different paths, one that goes through the table Affiliated and one through the Director. If the semantics of the keyword query were to find emails of the CS affiliated persons, then the database query describing this semantics is one that uses the first path. If instead the semantics were to find the email of the CS department director, the query is one that uses the second path. Finding the different semantic interpretations of a keyword query is a combinatorial problem which can be solved by an exhaustive enumeration of the different ways that mappings can be associated to database structures and values. The challenging task is to develop a mechanism that is able to significantly reduce the possible associations that most likely represent the intended keyword semantics. One way to do this is to exploit syntactic knowledge or other forms of auxiliary information. For instance, a keyword "(320)463-1463" is more likely to represent a phone number due to its format, while a keyword Everest, given the publicly available geo-name knowledge, is very likely to represent the mount Everest. Similarly, the keyword "article" is likely to cor-

566

Person Name Area Phone Watson Database (320) 4631234 Lenzerini Database (390) 6987654 Date Database (817) 1937842 Hunt Inf. Systems (343) 2920812 Author Name Publication Lenzerini Data Integration Date Foundation Matters

Affiliated Address Email Professor 30 Bloor [email protected] Watson Ariosto 25 [email protected] Lenzerini 107 GACB [email protected] Date 17 Helix [email protected] Hunt Publication Title Year Resource Data Integration 2002 ACM DL Foundation Matters 2002 DBLP

Department Department id DName Address x123 x123 CS 25 Blicker cs34 cs34 IE 15 Tribeca cs34 ee67 EE 5 Charles m111 m111 ME 2 Cottle Database Name Address DBLP http://www.informatik.uni-trier.de ACM DL http://portal.acm.org/dl.cfm

Director Watson Hunt Date Hunt

Figure 1: A fraction of a database schema with its data. respond to the table Publication, based on some lexical database knowledge like WordNet1 . The quest for the database structure that a keyword corresponds should not be an isolated task. Keywords are not independent entities, but it should be seen in the context of the presence of the other keywords in the query. To illustrate this fact, consider the keyword query "Name Date Database". A possible interpretation of the query is that the user is looking for the person called Date who works in the area of databases. This means that keywords Name and Date may correspond to the attribute Name of the table Person and one of its values, respectively, while the keyword Database to one of the values of the attribute Area. Consider now the query "Name Date Database DBLP". In this case, a possible interpretation is that the user is looking for data items related to Date in the DBLP database, hence, the meaning of the keyword Database is represented by the table Database and not by some value of the Area attribute. A natural and challenging question that comes to mind is which of the different semantic interpretations of a query should be considered as the correct one. It is a well-known and documented fact that keyword queries are under-specified queries [18]. As such, any of the possible interpretations may be correct, and all have to be considered. However, based on some known patterns of human behavior [18], we know that the order of keywords is important and that correlated keywords are typically close. This means that certain interpretations of a keyword query are actually more likely than others. Thus, instead of returning a flat list of possible interpretations to the user, one can return a ranked list based on the likelihood that they represent the intended keyword semantics. We have made the natural assumption that each keyword cannot have more than one meaning in the same configuration, i.e., it can be mapped to only one database term. Furthermore, we have assumed that no two keywords can be mapped into the same database term based on the fact that overspecified queries is only a small fraction of the queries that are typically met in practice [18]. However, our technique can be extended to work without this assumption with only minor modifications. We have also made the natural assumption that every keyword plays some role in the query, i.e., there are no unjustified keywords. Answering a keyword query over a database D means finding a set of SQL queries, with each of these queries using all the database elements that belong to the range2 of a specific configuration. Such an SQL query is referred to as an interpretation of the keyword query, since it provides a possible meaning of the keyword query in terms of the database vocabulary. In the current work, we consider only select-project-join (SPJ) interpretations, that are typically the queries of interest in similar works [2, 16]. Nevertheless, interpretations involving aggregations [31] are also in our list of future extensions. D EFINITION 3.2. An interpretation of a keyword query q={k1 , k2 , . . . , kn } on a database D using a configuration C is an SQL query: select A1 , A2 , . . ., Aj from R1 JOIN R2 JOIN . . . JOIN Rn where A1 =v1 AND A2 =v2 AND . . . AND Am =vm such that kq one of the following is true: · C(k){R1 , R2 , . . ., Rn } · C(k){A1 , A2 , . . ., Aj } · C(k)=dom(Ai ){dom(A1 ), dom(A2 ), . . ., dom(Aj )} and Ai , k  { A1 , v1 , A2 , v2 , . . . , Am , vm } To eliminate redundancies, we require any use of a database term in an interpretation to be justified either by belonging to the range of the configuration, or by participating in a join path connecting two database terms that belong in the range of the configuration. Note that even with that restriction, due to the multiple join paths in a database D, it is still possible to have multiple interpretations of a given keyword query q and a configuration C of it. We will use the notation I(q, C, D) to refer to the set of these interpretations. and I(q, D) for the union of all these sets for all the possible configurations of the query q. E XAMPLE 3.1. Consider the keyword query "email CS" over the database of Figure 1, and a configuration that maps email to Email and CS to DName. At least two different interpretations can be generated from this. One is the: select Email from Person P JOIN Affiliated A ON (P.name=A.professor) JOIN Department D ON (A.Department=D.id) where D.DName='CS' AND D.id=A.Department AND A.Professor=P.Name

2

3.

PROBLEM STATEMENT

A database D is a collection of relational tables. A relational table is denoted as R(A1 , A2 , . . . , An ), where R is the name of the table and A1 , A2 , . . . , An its attributes. The vocabulary of the database D, denoted as VD , is the set VD ={X | R(A1 , A2 , . . . , An )D, s.t. X=R  X=Ak  X=Dom(Ak ) with 1kn}. In other words, the vocabulary of a database is the set of all its relation names, their attributes and their respective domains. A database term is a member of its vocabulary. A keyword query q is an ordered list of keywords {k1 , k2 , . . . , kn }. Each keyword is a specification about the element of interest. The specification may have been modeled in the database as a relational table, an attribute, or a value of an attribute. A configuration is a mapping function that describes a specification for each query keyword in terms of database terms. D EFINITION 3.1. A configuration C of a keyword query q on a database D is an injective mapping from the keywords in q to database terms in the vocabulary

Bạn đang đọc truyện trên: Truyen2U.Net

#hoaxula
Ẩn QC