User:RichTreston/Conceptsearch

Concept Search

This document will address Concept Search: how it is being defined, and some of the major methodologies or solutions that are typically used to provide Concept Search.

Definition of Concept Search

In short, there are many definitions of concept search, and most of them are driven by individual companies and their desire to "define" the term for their own marketing use. Worse, users are so conditioned by keyword or Boolean search strategies (virtually everyone knows how to use Google) that defining the difference between keyword search and concept search quickly gets philosophical.

The easiest way to define concept search is to first define what a concept is. A dictionary defiition of "concept" a general notion, or idea, or construct. The "concept" of terminating someone from a job might be expressed by a variety of keywords: being fired, being let go, included in a reduction in workforce, job termination, etc.

The challenge with concepts is that they can usually be expressed using a variety of different keywords. This is also the promise of concept search, or searching for "concepts" - if you get the concept properly conceived, then you will find valid, relevant information whether or not you include particular keywords. Conversely, concept search generally permits a user to easily narrow in on only relevant results, eliminating much of the chaff that keywords generate.

Another way to look at this is to regard concept search as search based on Meaning, not Spelling. Boolean search finds documents based on specific shared character strings. If your search term doesn’t exist in a document, you never see it. Concept search returns relevant documents regardless of shared terms or even a common language. It gracefully handles degraded documents like those processed by OCR where Boolean search quickly breaks.

Types of Concept Search

There are two major types of concept search engines that are commercially available. One type is centered around word and sentence construction, sometimes referred to as Natural Language Processing. The other type uses mathematics and either probability or geometry. Each has its benefits and drawbacks. This is a hot area of research right now, so any discussion of these two types of concept search will doubtless neglect some of the newer strategies and solutions; readers are directed to look at companies and universities involved in primary research and development in Search and Search Engines, as well as research companies such as IDC who devote an entire department to Search.

Linguistic or Language-based Concept Search

At its core, linguistic language-based concept search uses techniques first developed at PARC in the 1980's regarding language and linguistics. The idea behind Natural Language Processing or NLP was to identify how sentences were constructed and how the combination of the words used and the way they were used (i.e., linguistics) could be interpreted by computers as accurately conveying meaning. One of the early providers of NLP-based concept search was Inxight (now Business Objects/SAP). In many ways this mirrors how we all learn speech and verbal communications: we identify the a subject, a verb, and an object - i.e., "I want an ice cream cone." If we know that "I" refers to the first-person, and "want" is a verb conveying desire, and we know that "ice cream cone" is an object, then a similar statement like "I want coffee" conveys a similar desire for a diffferent object. Overly simplistic, granted, but one way to imagine the ideas behind NLP.

Challenge to Linguistic Concept Search

Language is both fluid and complicated, which confounds any search engine. In the case of NLP, first, somewhere there is a need for dictionaries that explain what words mean. Second, there are two vexing characteristics to language that must be overcome: synonymy, and polysemy. Synonyms are different words that have the same meaning: for example, car and automobile. In addition to a dictionary, such systems typically need a thesaurus to correspond these synonyms.

Polysemy is a more difficult challenge: polysemic terms are ones whose meaning changes depending on how they are used. A well-used example is the word "bank." A bank can be a financial institution, the edge of a river, or can describe the way an airplane turns. Since polysemy is totally dependent upon context, additional wordmaps may be needed to guide these search engines down the right path to perform a conceptual search.

Finally, language is very organic - it changes over time. If anything, the prevalence of the internet and electronic communications has accelerated how language morphs. A term like "google" didn't exist except as a company name a decade ago, yet now it's a part of everyday speech. This means that the "external resources" such as thesauri, dictionaries, ontologies, etc., that support linguistic concept search engines need continual modification and enhancment. This may also mean that using modern terminology to search older information could yield unsatisfactory results - the current definitions of words and their usage simply don't match older terminology. A good example is the term "gay," which was regularly used to describe happy and carefree people in the early 20th century, yet has taken on a completely different meaning today.

Mathematical Concept Search

Language has a common, repeatable structure - if it didn't, we couldn't communicate along a common framework. Even foreign languages share much of the same structure, i.e. the notion of nouns and objects and verbs and how we string these together to form thoughts and communications. The notion behind mathematical search engines is that if you can identify the structure (versus the actual words and their meaning) AND you have enough volume of information to index, you can identify patterns, relationships, etc., simply based on how various words, phrases, etc., are used.

There are two major strategies in the market today that utilize mathematics to perform concept search. One is based on a 17th century theorem called Bayes' Theorem - or Bayesian Inference, and the other is based on vector geometry and is called Latent Semantic Indexing or Latent Semantic Analysis. They are both very different yet are capable of arriving at highly valid conceptual similarities and conclusions.

The notion behind Bayesian Inference in very simplistic terms is probability. If the same words and terms appear together in enough documents, the probability that those words and phrases are related is very high. Given sufficient volume, these phrases turn into concepts and the relationships into concept search. Autonomy is one company who successfully employs Bayesian Inference in their concept search solutions.

The second approach, LSA or LSI, is based on the notion that if you can assign a unique value to every unique word in every document in a collection, then given sufficient volume of documents, the numeric plot of those values will show relationships. Like Bayesian Inference, those relationships are a highly valid measurement of conceptual similarity. Two companies who successfully employ LSA or LSI are Recommind and Content Analyst (who is the original patent holder on LSI).

The benefit of mathematical approaches is that they are highly resistent to the challenges of word meanings (synonymy and polysemy) and language evolution which can plague NLP-based solutions. The downside to mathematical approaches is twofold. First, they are deriving all their meanings in a vacuum, which can be subjective, narrow, and incomplete. For this reason, larger collections yield more valid results because those collections are more inclusive. Second, mathematical approaches can have specific processing or hardware requirements. In the case of LSA or LSI, the index of pointers typically needs to be resident in a computer server's memory. For large collections this can mean many MB of memory are needed to achieve rapid response times.

Uses for Concept Search

In either form (linguistic or mathematic), Concept Search has found its way into a variety of markets. In the public sector, concept search is an obvious technology for intelligence gathering. In the commercial market, Concept Search is employed in human resource applications (to help match resumes and qualifications with job positions), in the legal world (in knowledge management at major law firms, allowing those firms to quickly determine which attorneys have relevant experience for upcoming cases), in the world of litigation (specifically eDiscovery, where concept search offers a more efficient way of sorting through documents to find the ones most relevant to the case at hand), in eduation (where concept search engines have been successfully employed to "score" essay-type tests by determining how relevant those essays were to a set of accurate or benchmark essays), and in customer service or research (where concept search permits faster retrieval of service information, and where concept search permits easier identification of customer queries or complaints).

Most applications of concept search are not in the "public domain" (i.e., Google, MSN, etc.) - rather, they are deployed into specific industries and/or specific applications.

Concept Search Engines not being used for Searching

One aspect about concept search engines that some folks find confusing is that one of their most popular uses is for applications that aren't really doing any search. This is because the capability to identify "concepts" has broader applicability than just search.

One of the most popular applications of concept search is for classification or categorization. In this deployment, the same software that can identify concepts is used instead to compare a set of examples against a larger collection of documents. This notion of conceptual categorization is used to classify email correspondence, to identify scanned and OCR's documents, and to organize electronic files or documents, among other uses. The benefit of using concept search engines in a classification or categorization scheme is they are not term-dependent: they will identify all documents which discuss similar or conceptually-related topics even if particular words are mis-spelled, missing, or deliberately avoided.

Another application of concept search engines is clustering. This is almost the reverse of classification: in a clustering application, the concept search engine is used to evaluate a collection of documents, determine which are conceptually related and to what degree, and finally to organize them into appropriate folders and subfolders.