Cranfield experiments

The Cranfield experiments were a series of experimental studies in information retrieval conducted by Cyril W. Cleverdon at the College of Aeronautics, today known as Cranfield University, in the 1960s to evaluate the efficiency of indexing systems. The experiments were broken into two main phases, neither of which was computerized. The entire collection of abstracts, resulting indexes and results were later distributed in electronic format and were widely used for decades.

In the first series of experiments, several existing indexing methods were compared to test their efficiency. The queries were generated by the authors of the papers in the collection and then translated into index lookups by experts in those systems. In this series, one method went from least efficient to most efficient after making minor changes to the arrangement of the way the data was recorded on the index cards. The conclusion appeared to be that the underlying methodology seemed less important than specific details of the implementation. This led to considerable debate on the methodology of the experiments.

These criticisms also led to the second series of experiments, now known as Cranfield 2. Cranfield 2 attempted to gain additional insight by reversing the methodology; Cranfield 1 tested the ability for experts to find a specific resource following the index system, Cranfield 2 instead studied the results of asking human-language questions and seeing if the indexing system provided a relevant answer, regardless of whether it was the original target document. It too was the topic of considerable debate.

The Cranfield experiments were extremely influential in the information retrieval field, itself a subject of considerable interest in the post-World War II era when the quantity of scientific research was exploding. It was the topic of continual debate for years and led to several computer projects to test its results. Its influence was considerable over a forty-year period before natural language indexes like those of modern web search engines became commonplace.

Background
The now-famous July 1945 article "As We May Think" by Vannevar Bush is often pointed to as the first complete description of the field that became information retrieval. The article describes a hypothetical machine known as "memex" that would hold all of mankind's knowledge in an indexed form that would allow it to be retrieved by anyone.

In 1948, the Royal Society held the Scientific Information Conference that first explored some of these concepts on a formal basis. This led to a small number of experiments in the field in the UK, US, and the Netherlands. The only major effort to compare different systems was led by Gull using the collection of works from the Armed Forces Technical Information Agency, which had started as a collection of aeronautics reports captured in Germany at the end of World War II. Judging of the results was carried out by experts in the two systems, and they never agreed on whether various retrieved documents were relevant to the search, with each group rejecting over 30% of the results as wrong. Further testing was cancelled as there appeared to be no consensus.

A second conference on the topic, the International Conference on Scientific Information, was held in Washington, DC in 1958, by which time computer development had reached the point where automatic index retrieval was possible. It was at this meeting that Cyril W. Cleverdon "got the bit between his teeth" and managed to arrange for funding from the US National Science Foundation to start what would later be known as Cranfield 1.

Cranfield 1
The first series of experiments directly compared four indexing systems that represented significantly different conceptual underpinnings. The four systems were:


 * 1) the Universal Decimal Classification, a hierarchical system being widely introduced in libraries,
 * 2) the Alphabetical Subject Catalogue which alphabetized subject headings in classic library index card collections,
 * 3) the Faceted Classification Scheme which allows combinations of subjects to produce new subjects,
 * 4) and Mortimer Taube's Uniterm system of co-ordinate indexing where a reference may be found on any number of separate index cards.

In an early series of experiments, participants were asked to create indexes for a collection of aerospace-related documents. Each index was prepared by an expert in that methodology. The authors of the original documents were then asked to prepare a set of search terms that should return that document. The indexing experts were then asked to generate queries into their index based on the author's search terms. The queries were then used to examine the index to see if it returned the target document.

In these tests, all but the faceted system produced roughly equal numbers of "correct" results, while the faceted concept lagged. Studying these results, the faceted system was re-indexed using a different format on the cards and the tests were re-run. In this series of tests, the faceted system was now the clear winner. This suggested the underlying theory behind the system was less important than specifics of the implementation.

The outcome of these experiments, published in 1962, generated enormous debate, both among the supporters of the various systems, as well as among researchers who complained about the experiments as a whole. Nevertheless, it appeared one conclusion was clearly supported: simple systems based on keywords appeared to work just as well as complex classificatory schemes. This is important, as the former are dramatically easier to implement.

Cranfield 2
In the first series of experiments, experts in the use of the various techniques were tasked with both the creation of the index and its use against the sample queries. Each system had its own concept about how a query should be structured, which would today be known as a query language. Much of the criticism of the first experiments focused on whether the experiments were truly testing the systems, or the user's ability to translate the query into the query language.

This led to the second series of experiments, Cranfield 2, that considered the question of converting the query into the language. To do this, instead of considering the generation of the query as a black box, each step was broken down. The outcome of this approach was revolutionary at the time; it suggested that the search terms be left in their original format, what would today be known as a natural language query.

Another major change was how the results were judged. In the original tests, a success occurred only if the index returned the exact document that had been used to generate the search. However, this was not typical of an actual query; a user looking for information on aircraft landing gear might be happy with any of the collection's many papers on the topic, but Cranfield 1 would consider such a result a failure in spite of returning relevant materials. In the second series, the results were judged by 3rd parties who gave a qualitative answer on whether the query generated a relevant set of papers, as opposed to returning a specified original document.

Continued debate
The results of the two test series continued to be a subject of considerable debate for years. In particular, it led to a running debate between Cleverdon and Jason Farradane, one of the founders of the Institute of Information Scientists in 1958. The two would invariably appear at meetings where the other was presenting and then, during the question and answer period, explain why everything they were doing was wrong. The debate has been characterized as "...fierce and unrelenting, sometimes well beyond the boundaries of civility." This chorus was joined by Don R. Swanson in the US, who published a critique on the Cranfield experiments a few years later.

In spite of these criticisms, Cranfield 2 set the bar by which many following experiments were judged. In particular, Cranfield 2's methodology, starting with natural language terms and judging the results by relevance, not exact matches, became almost universal in following experiments in spite of many objections.

Influence
With the conclusion of Cranfield 2 in 1967, the entire corpus was published in a machine-readable form. Today, this is known as the Cranfield 1400, or any variety of variations on that theme. The name refers to the number of documents in the collection, which consists of 1398 abstracts. The collection also includes 225 queries and the relevance judgments of all query:document pairs that resulted from the experimental runs. The main database of abstracts is about 1.6 MB.

The experiments were carried out in an era when computers had a few kilobytes of main memory and network access to perhaps a few megabytes. For instance, the mid-range IBM System/360 Model 50 shipped with 64 to 512 kB of core memory (tending toward the lower end) and its typical hard drive stored just over 80 MB. As the capabilities of systems grew through the 1960s and 1970s, the Cranfield document collection became a major testbed corpus that was used repeatedly for many years.

Today the collection is too small to use for practical testing beyond pilot experiments. Its place has mostly been taken by the TREC collection, which contains 1.89 million documents across a wider array of subjects, or the even more recent GOV2 collection of 25 million web pages.