Talk:Search engine indexing

About this article
The goal is to provide an authoritative resource on the architecture, behavior, major processes and challenges in search engine indexing. This should be described for the general audience of the web. Please refrain from adding commercial references to support a NPOV.

TODO

 * fill out the list of references using correct formats
 * add back in the some of the content removed on the 9th, but in a correct fashion
 * remove annotational garbage, opinionated content
 * integrate with templates
 * get rid of weasel words
 * harmonize with information extraction, controlled vocab

Search engine sizes
This comes from http://blog.searchenginewatch.com/blog/041111-084221, so I have not included it at the moment in the article, as I do not want to do anything illegal, and am not sure this is the best reference. The goal is to show the sizes, at least at some point in time, of the number of pages indexed, to help get a feel for the size. The understanding and reference to sizes in application is important to understand the technological challenge and the rationale behind the intense research in compression and forms of indexing and search engine architectures. Josh Froelich 16:44, 15 December 2006 (UTC)


 * Also see Overture press release Josh Froelich 16:52, 15 December 2006 (UTC)


 * The size of an Internet search index isn't necessarily particularly useful information. The problem has never been how to store lots of entries in the index, as even simple solutions scale O(1) best-case and O(log(n)) worst case. In some space constrained world where you really had to save the bytes, you could probably fit a billion webpage reverse index in a few terabytes of data. If you crunch the numbers, you'll see that allows for a few kilobytes of information per document, which is plenty. The hard problem in search engine design is how to retreive the *relevant* entries. 176.10.248.194 (talk) 08:34, 13 September 2021 (UTC)

Controlled vocab

 * maybe provide link to full text search topic, harmonize with its contents
 * explain controlled vocab style indexing, start lists, weighted lists, other techniques, which are indexed differently in the sense that a specifialized inverted index is created that is not data driven. the keywords (in keyword based controlled vocab searching) are like classes in a classification model or an associate array or map to keywords and specific full text terms/articles.

Notes on WikiPedia as an Example
WikiPedia's Search option is powered by a search engine which involves an indexing process. The search engine technology is a program called Lucene (|external wiki), which is built into MediaWiki, the supporting software that WikiPedia uses.

Some of these same concepts are employed by WikiPedia, and you can learn more about these by researching MediaWiki and the architecture of Lucene.

WikiPedia articles are stored in a database, which serves as the corpus. For indexing, primarily only the wikitext part of the article is considered, the other meta data is mostly ignored. The wikitext is periodically indexed by the Lucene indexer, and this index is used when you click the search button on WikiPedia. From wikiehelp - Article does not appear when searching. The database for searching is updated approximately every 4-6 weeks. Changes do not appear there right away. Josh Froelich 02:34, 19 December 2006 (UTC)

I am considering adding an example of WikiPedia itself of the innards of search engine indexing. For the wikipedia lucene example, looking at the SVN source code on mediawiki's webstie, we can see that:


 * In http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search/org/wikimedia/lsearch/EnglishAnalyzer.java?revision=6911&view=markup, when parsing documents, each token is lowercased and stemmed.


 * In http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search/org/wikimedia/lsearch/MWDaemon.java?revision=8447&view=markup, searchers operate off the index asynchronously in multiple java threads


 * In http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search/org/wikimedia/lsearch/MWSearch.java?revision=8991&view=markup, using a rebuild command, wikipedia article corpus is indexed sequentially, in its entirety.


 * Lots of info about SearchState - http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search/org/wikimedia/lsearch/SearchState.java?revision=8992&view=markup
 * wikipedia uses a separate indexer for german, esperanto, and russian, but everything else uses the english tokenizer, based on the language.
 * we can see that we first write incremental changes to an in memory index which is then later written to a file-based index
 * the wiki text is first parsed and wiki syntax is removed in the stripWiki function, so that the document is treated like a normal english document

Josh Froelich 20:21, 7 December 2006 (UTC)

Token Hashing
Storing the token 'dog' repeatedly is a redundancy. Storing a list such as this for all the documents in a large corpus is not technically feasible. Instead, many search engine designs incorporate a token hash, which assigns a unique, identifying value for each word. This word hash is a separately managed data structure commonly referred to as a lexicon. The following table displays what the lexicon looks like:

After hashing the words, the algorithm stores a list of each word hash key (the key is the id or value identifying the word) and its position (and potentially other characteristics such as the part of speech). This listing consumes less computer memory (takes up less space) given the smaller key size.

Token Conflation
In the above figure, word id 2 appears twice. This is a redundancy, which means there is room for optimization. As such, the list is conflated or aggregated by the id. The position for each token occurrence, commonly referred to as a 'hit' (citation needed), is stored as a list for each key. The list might look something like the following:

In addition to grouping together words based on the exact characters in each word (dog is the same as dog...), some search engines also group together 'similar words', or words which share something in common, such as grammatical form or meaning. Using linguistic analysis, stemming is performed which identifies the root lexeme of each word. If two words share the same lexeme, the words are grouped together during conflation.

Stop and Start List Data Structures
A stop list is a list of words (or numbers or other symbols), typically for a specific language, that should not be indexed. Alternatively, the stop list is the list of words that the search engine user cannot use to search with (and as such, should not always be indexed). At this point in the process, or earlier during initial tokenization or hashing (or just by looking at a property of the word in the lexicon), words may be ignored and then removed to prevent the entries from being stored in one of the later data structures of the indexing process such as the forward index.

Stop lists are also referred to as black lists, or ignore lists. Entries in the stop list are typically referred to as stop words or stopwords.

Stop lists typically contain functional words, sometimes referred to as noise (see information theory, Claude Shannon). Common English stop words include:


 * the
 * it
 * she
 * he
 * and
 * or

These words carry little semantic value for the search effort. Removing the words due to limited value ratio (the disk space or memory consumed in the index data structures and the time spent processing them versus how frequently users search for them or need to search at all for them), is a common practice as it reduces the disk size of later data structures, which also in turn improves processing speed.

A start list is a list of words or symbols that should be indexed. Some indexing processes only index words from a start list. Others provide a weighting method, where start words are treated differently, so that search results containing these words appear higher in the search results.

Lexicographic Conflation
Lexicography is commonly used to refer to the practice of making dictionaries. Search engine designers may compile dictionaries, or one or more thesauri and incorporate these into the indexing process, to introduce human bias. For example, words which are found in a thesaurus as synonyms may only be indexed as a single entry.

In application, document authors may use lexically distinct symbols to annotate a single concept. Given that search engines pursue quality search results, and that a search engine user enters words which he/she considers to represent the concept, a goal of incorporating the dictionary is to increase the accuracy of the search engine in matching various documents, despite that the symbols (the physical words) differ.

Certain search engines do not incorporate dictionaries, or involve them at a later point as part of query expansion.

The Term Document Matrix
The class of search engines and natural language processing software which employ Latent Semantic Analysis use the Term-document matrix data structure. This is similar in nature to the forward and inverted indices described above. The term document matrix contains a table where every row is a document, every column is a word, and each intersection (or data cell) contains a frequency count (or sometimes other information). The data structure looks like the following:

Queries are performed by locating the column containing the word and obtaining the set of non-null rows in the table.

Note that the term document matrix is a sparse matrix as not every document in the corpus contains every token, and not every token matches all documents. The forward and inverted indices are also forms of this same sparse matrix.

Order of Sections
This article would flow better if the sections were placed in the general order in which they are executed to create the inverted index (the final step). So something like:


 * 1) Document Parsing
 * 2) Forward keyword index
 * 3) Inverted Keyword index

With the smaller sections in between some of these. This would also involved changing the lead to include a statement of the goal of indexing: "Indexing aims to produce an map from unique keywords or terms to a list of documents containing that keyword. This data structure is referred to as an inverted keyword index."

Thoughts from anyone?

Drmadskills (talk) 18:16, 21 February 2008 (UTC)

Biased?
This article seems too biased to Internet web search engines. I came here to find information on how full-text indexes work in general (hash tables?) and found lots of stuff about... identifying languages? meta tags? Web 3.0?? —Preceding unsigned comment added by 200.68.94.105 (talk) 19:06, 22 August 2008 (UTC)

I could not agree more: there should be a separate article that describes a unversal concept of search index providing examples of search index types and their applications, such as DBMS, text, and image search engines. Not necessarily an Internet search engine! The article should contain a link to an article that describes (Internet) fulltext search engines. Itman (talk) 19:53, 24 February 2009 (UTC)

Suggested change
One of the previous subject authors wrote: "Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science." The word "informatics" is linked to another page titled Information technology. Note: Another page titled Informatics (academic field) exists. The Informatics (academic field) page specifies "Not to be confused with Information technology". This discussion requires a SME to identify 1. whether Index design incorporates informatics or information technology and 2. link the appropriate page to the designated word. araffals 17:11, 22 January 2013 (UTC) — Preceding unsigned comment added by Araffals (talk • contribs)

"Search engine optimisation indexing"
This article's lead section gives "search engine optimisation indexing" as a synonym for search engine indexing. Is this a mistake? Jarble (talk) 03:10, 29 September 2020 (UTC)