Precision and Recall of GlOSS Estimators for Database Discovery
Date of Original Version
Abstract or Description
On-line information vendors offer access to multiple databases. In addition, the advent of a variety of INTERNET tools has provided easy, distributed access to many more databases. The result is thousands of text databases from which a user may choose for a given information need (a user query). This paper, an abridged version, presents a framework for (and analyzes a solution to) this problem, which we call the text-database discovery problem (see full version for a survey of related work). Our solution to the text-database discovery problem is to build a service that can suggest potentially good databases to search. A user's query will go through two steps: first, the query is presented to our server (dubbed GlOSS, for Glossary-Of-Servers Server) to select a set of promising databases to search. During the second step, the query is actually evaluated at the chosen databases. GlOSS gives a hint of what databases might be useful for the user's query, based on word-frequency information for each database. This information indicates, for each database and each keyword in the database vocabulary, how many documents at that database actually contain the keyword, for each field designator (Sections 2 and 3). For example, a Computer-Science library could report that ``Knuth'' (keyword) occurs as an author (field designator) in 180 documents, the keyword ``computer,'' in the title of 25,548 documents, and so on. This information is orders of magnitude smaller than a full index since for each keyword field-designation pair we only need to keep its frequency, not the identities of the documents that contain it. To evaluate the set of databases that GlOSS returns for a given query, Section 4 presents a framework based on the precision and recall metrics of information-retrieval theory. In that theory, for a given query q and a given set S of relevant documents for q, precision is the fraction of documents in the answer to q that are in S, and recall is the fraction of S in the answer to q. We borrow these notions to define metrics for the text-database discovery problem: for a given query q and a given set of ``relevant databases'' S, P is the fraction of databases in the answer to q that are in S, and R is the fraction of S in the answer to q. We further extend our framework by offering different definitions for a ``relevant database'' (Section 4). We have performed experiments using query traces from the FOLIO library information-retrieval system at Stanford University, and involving six databases available through FOLIO. As we will see, the results obtained for different variants of GlOSS are very promising (Section 5). Even though GlOSS keeps a small amount of information about the contents of the available databases, this information proved to be sufficient to produce very useful hints on where to search.