User:KYPark/004

''' A DIRECT APPROACH TO INFORMATION RETRIEVAL

Table of Contents''' WHAT WHY HOW 1. INTRODUCTION 2. THE LINE OF ATTACK 3. SYSTEMS VS. USERS 3.1 Discrimination 3.2 Prediction 4. DOCUMENTS VS. SURROGATES 5. THE THEORY OF INTERPRETATION 5.1 Denotation and Connotation 5.2 The Theory of Ogden and Richards 5.3 Implications for Information Retrieval 6. PROPOSAL FOR FILE ORGANIZATION 6.1 Incentives 6.2 Extracts as Indexing Sources 6.3 Extracts as Review Sources 7. CONCLUSION 8. REFERENCES

4. DOCUMENTS VS. SURROGATES
A group of documents can be said to be similar to each other, when they have in common a set of identical properties A; they are similar with respect to the shared properties A. In general, each document in a similarity group has some other (different) properties B in addition to A. Therefore, the content C of a document may be represented:

C = A * B.

This equation may apply somewhat analogously to the document surrogate, too.

Because of the repetitive nature of the shared properties A, a group of similar documents are characterized by semantic redundancy, even if not by textual redundancy. This characteristic will be transferred somewhat analogously to the corresponding document surrogates. That is to say, the identical properties A are repeated not only in similar documents but also in their surrogates. This repetition or redundancy in a group of similar surrogates appears to be inevitable, because there would be no grouping of similar documents or surrogates without that. But it is not quite so from the point of view of file organization. For one thing, the idea of inverted files may be worth remembering in this connection; however, this idea is likely to raise another kind of redundancy, that is, repetition of the name of the surrogate which belongs to many similarities, e.g., index terms.

An abstract file as a retrieval tool is no exception to such redundancy. The comparative efficiency of abstracts in retrieval is still controversial. The low efficiency of abstracts, if true, may stem from difficulties in formalization and in machine processing. However, formalization does not really matter so much in human processing. And we can reasonably assert that abstracts contain much greater "semantic information" than other kinds of surrogates such as titles, sets of index terms, and classification codes. Therefore, without considering the time consumed, the human searching of abstracts should perform better than that of other surrogates in judging similarity, at least in principle.

Suppose that abstracting processes are formalized to such an extent that the above equation holds well. Then, it will be possible to exclude the identical part A from all but one of the similar abstracts, allowing them reference to A in the retained abstract. Otherwise, we can list all the similar abstracts in one of them. By doing so, we need not search for them one by one but by a group, whenever the search requests fall upon A.

Once the existence of A is accepted, a model abstract of the identical part A may be desired for all the documents that have A in common. A collection of such models will look like a classification scheme. This can be applied to individual abstracts. Then, each abstract may consist of the prescriptive code for A and the descriptive text for B, the different part. (This way of doing may be parallel adapted to combining a hierarchical classification system with a descriptive indexing system.) In practice, the prescriptive code may or may not be substituted for the text corresponding to A in an abstract. What is implied in this idea is not merely to reduce the textual or semantic redundancy involved in a group of similar abstracts.

In general, document surrogates includes errors of various kinds. Let us take for example just one kind of errors: inconsistency in surrogation. Many inconsistencies can hardly be said to be errors in the strict sense, for the surrogates are fairly correct individually. The cause of these inconsistencies may be attributable to difficulty or lack in formalization.

In this respect, abstracting systems, particularly based on author abstracts, seem to be hopeless to control. However, this is not the whole point. The default is to leave the failure caused by inconsistency to be repeated each time the abstracts are searched. Certainly, this failure can be prevented or reduced by careful examination and grouping of similar abstracts, prior to a series of searches.

This prior grouping process implies retrieval which ensures high recall even at the cost of low precision. One thing that matters here is the manageable number of abstracts to be examined as to their similarity. The greater the number, the more preventive work there is to be done. What makes matters worse is the possible multiplicity of similarity groups which an abstract belongs to at the same time. We may not even make certain which groups will be more significant or more likely to be requested by the user. This situation will eventually demand enormous efforts. Our ideal to rule out inconsistencies may require prohibitive efforts.

We all know something about abstracts and extracts, not being pretentious. However, this general kind of knowledge may not suffice for critical discussion of their characteristics, merits, snags, and so on. An abstract was defined as an abbreviated, accurate representation of a document; and an extract as consisting of one or more portions of a document selected to represent the whole. Were they defined with accuracy? Were the definitions intended for making clearer how to make abstracts and extracts? Are there any really working standards for making them?

Any document surrogate of however small and biased content may be justified, because it is not the document itself but a representation, description or prescription. Sometimes it is mistaken that the content of a surrogate is the same as the content of the corresponding document; or that the equation C = A * B holds equally in both cases. Distinguishing between intensional aboutness and extensional aboutness, Faithorne7 says that:


 * Parts of a document are not always about what the entire document is about, nor is a document usually about the sum of the things it mentions. A document is a unit of discourse, and its component statements must be considered in the light of why this unit has been acquired or requested.

Even with the great flexibility and elasticity of language, it seems almost impossible to make an abstract of about two hundred words exactly analogous to the content C of the corresponding document. In other words, selection and bias are more or less unavoidable in abstracting. If paraphrasing of selection is considered to be semantically superficial, then the difference between an abstract and an extract will be somewhat marginal. Both are biased selections or parts of the content C.

Roughly speaking, an abstract is more intended to balance selection uniformly over C, aiming at inductive information effects. Similarly, an extract is more intended to spot selection (perhaps conclusive part) eccentrically from C, aiming at immediate rather than inductive information effects. Yet, no formal procedures beyond conventions of a vague nature are available of what to select.

Considering the power of meta-language and its use in retrieval, Goffman, et al.13 notice that an abstract is given in meta-language whereas an extract in object-language. They further notice that many abstracts, being written in "trivial" meta-language, should more accurately be called extracts.

Selection or part of a document, whether balancing or spotting, should assume that it can do without the rest or context. In other words, it should be an independent unit of discourse. Truly, abstracts, extracts, titles, even index terms, all these tell us something on their own account. Fairthorne7 paraphrases Bohnert's notion of data as:


 * parts of a document that, in the given environment, will be read in isolation from the rest of the text.

This phrase seems to be worth careful scrutiny. Perhaps, we can raise several questions such as:


 * What is the given environment?
 * What happens to a reader when he reads the parts in isolation from the rest?
 * What is the relationship between a document D and its part d in terms of effects on the reader?

We shall discuss these and other questions in the next chapter. Meanwhile, Belzer14 calculates "the entropies of the various surrogates of error-free information," by assigning one bit of information to a full document. For five different types of surrogates - citation, abstract, first paragraph, last paragraph, first and last paragraph - he observes the 2 x 2 contingency of:


 * P = relevant as predicted from surrogates,
 * P'= non-relevant as predicted from surrogates,
 * R = relevant as evaluated from full documents,
 * R'= non-relevant as evaluated from full documents.

By showing the calculation result as in Table 1, and by calling attention to the fact that production of abstracts only requires extensive professional effort, he in effect revives superiority of extracts to abstracts. Comparison of a document with its surrogates is also interesting.

'''Table 1. The Entropies of Various Surrogates.'''