Chapter XII: Mining Free Text for Structure | Data Mining: Opportunities and Challenges

data mining: opportunities and challenges

Chapter XII - Mining Free Text for Structure
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

Vladimir A. Kulyukin, Utah State University

USARobin Burke, DePaul University

USA

Knowledge of the structural organization of information in documents can be of significant assistance to information systems that use documents as their knowledge bases. In particular, such knowledge is of use to information retrieval systems that retrieve documents in response to user queries. This chapter presents an approach to mining free-text documents for structure that is qualitative in nature. It complements the statistical and machine-learning approaches, insomuch as the structural organization of information in documents is discovered through mining free text for content markers left behind by document writers. The ultimate objective is to find scalable data mining (DM) solutions for free-text documents in exchange for modest knowledge-engineering requirements. The problem of mining free text for structure is addressed in the context of finding structural components of files of frequently asked questions (FAQs) associated with many USENET newsgroups. The chapter describes a system that mines FAQs for structural components. The chapter concludes with an outline of possible future trends in the structural mining of free text.

INTRODUCTION

When the manager of a mutual fund sits down to write an update of the fund's prospectus, he does not start his job from scratch. He knows what the fund's shareholders expect to see in the document and arranges the information accordingly. An inventor, ready to register his idea with the Patent and Trademark Office of the U.S. Department of Commerce, writes it up in accordance with the rules specifying the format of patent submissions. A researcher who wants to submit a paper to a scientific conference must be aware of the format specifications set up by the conference committee. Each of these examples suggests that domains of human activity that produce numerous documents are likely to have standards specifying how information must be presented in them.

Such standards, or presentation patterns, are a matter of economic necessity; documents whose visual structure reflects their logical organization are much easier to mine for information than unconstrained text. The ability to find the needed content in the document by taking advantage of its structural organization allows the readers to deal with large quantities of data efficiently. For example, when one needs to find out if a person's name is mentioned in a book, one does not have to read it from cover to cover; going to the index section is a more sensible solution.

Knowledge of the structural organization of information in documents^[1] can be of significant assistance to information systems that use documents as their knowledge bases. In particular, such knowledge is of use to information retrieval systems (Salton & McGill, 1983) that retrieve documents in response to user queries. For example, an information retrieval system can match a query against the structural components of a document, e.g., sections of an article, and make a retrieval decision based on some combination of matches. More generally, knowledge of the structural organization of information in documents makes it easier to mine those documents for information.

The advent of the World Wide Web and the Internet have resulted in the creation of millions of documents containing unstructured, structured, and semi-structured data. Consequently, research on the automated discovery of structural organization of information in documents has come to the forefront of both information retrieval and natural language processing (Freitag, 1998; Hammer, Garcia-Molina, Cho, Aranha, & Crespo, 1997; Hsu & Chang, 1999; Jacquemin & Bush, 2000; Kushmerick, Weld, & Doorenbos, 1997). Most researchers adhere to numerical approaches of machine learning and information retrieval. Information retrieval approaches view texts as sets of terms, each of which exhibits some form of frequency distribution. By tracking the frequency distributions of terms, one can attempt to partition the document into smaller chunks, thus claiming to have discovered a structural organization of information in a given document. Machine-learning approaches view texts as objects with features whose combinations can be automatically learned by inductive methods.

Powerful as they are, these approaches to mining documents for structure have two major drawbacks. First, statistical computations are based on the idea of statistical significance (Moore & McCabe, 1993). Achieving statistical significance requires large quantities of data. The same is true for machine-learning approaches that require large training sets to reliably learn needed regularities. Since many documents are small in size, the reliable discovery of their structural components using numerical methods alone is problematic. Second, numerical approaches ignore the fact that document writers leave explicit markers of content structure in document texts. The presence of these markers in document texts helps the reader digest the information contained in the document. If these markers are ignored, document texts become much harder to navigate and understand.

This chapter presents an approach to mining free-text documents for structure that is qualitative in nature. It complements the statistical and machine-learning approaches insomuch as the structural organization of information in documents is discovered through mining free text for content markers left behind by document writers^[2]. The ultimate objective is to find scalable data-mining solutions for free-text documents in exchange for modest knowledge-engineering requirements. The approach is based on the following assumptions:

Economic Necessity. The higher the demand for a class of documents, the greater the chances that the presentation of information in those documents adheres to a small set of rigid standards. Mutual fund prospectuses, 10-Q forms, and USENET files of frequently asked questions (FAQs) are but a few examples of document classes that emerged due to economic necessity and whose formats were standardized by consumer demand.
Texts As Syntactic Objects. As a source of information, text can be viewed as a syntactic object whose structural organization obeys certain constraints. It is often possible to find the needed content in a document by using the structural organization of its text.
Presentation Consistency. Document writers are consistent in their presentation patterns. They do not change the chosen pattern within a single document. Many of them stick with the same pattern from document to document.
Presentation Similarity. The logical components of a document that have the same semantic functionality are likely to be marked in the same or similar fashion within a presentation pattern. For example, many document writers tend to mark headers, tables, sections, bibliographies, etc., in the same or similar ways in document texts.

These assumptions form a theoretical basis of the approach. Collectively, they act as guidelines for researchers and developers who are interested in building free-text datamining tools for individual domains. The rest of the chapter illustrates how these assumptions were applied to mine newsgroups' expertise.

The rest of the chapter is organized as follows. The next section provides the necessary background and a review of relevant literature. The following three sections constitute the main thrust of the chapter. First, we describe the problem of mining newsgroups' expertise for answers to frequently asked questions. Second, we state our solution to the problem of mining free text for structure. The problem is addressed in the context of finding structural components of FAQs associated with many USENET newsgroups. Third, we describe an evaluation of our mining approach. In the last two sections, we outline possible future trends in mining text for structure and present our conclusions.

^[1]We use the terms logical structure of documents and structural organization of information in documents interchangeably.

^[2]Since the approach presented in this chapter complements the existing approaches, it cannot be easily compared to them, because it is based on different assumptions. A comparison, by definition, is possible only among competing approaches.


	Brought to you by Team-Fly