User:Vrohitg/sandbox/Q21.Why Google is considered superior to Bing?

Introduction
When we think about Search engines, we immediately whisper Google. It has become our one stop shop for our queries. It doesn't matter the field, the naivety or the complexity of the query, Google deals with it all and answers it all. We all know how Google became such a phenomenon. It is a beast in terms of analyzing data and has one of the best web search algorithms. It also has the reputation and the word of mouth from users and the reliability factor. Even the stats agree with them, about 67% (two-thirds of internet users) use Google as their preferred search Engine. The Search engine in the second place is Bing, Microsoft's answer to Google. Ever since its merger with Yahoo, Bing has established itself as the prime competitor for Google, with 18% of the internet users preferring it. For a Search Engine introduced in 2009, this is a very promising number against a rival as big as Google. But still the general perception is tough to change. Our very own professors and orators often use the phrase 'Google it' frequently when some questions are unanswered. But we never hear the phrase 'Bing it' or the fancy 'Bing it on' from the general Public, which led to the question I intend to go into. Why this preferential treatment to Google? Is it well placed? How do the two evaluate against each other? Is Google indeed better or is the reputation just too high for Bing to scale down?

A Short Answer
Google and Bing, are both search engines that respond to a query from a user. The search engines use an intensive, high-performing algorithm every time it searches the web. No matter how simple the query, the search engine implements this complex algorithm for every search.

The given image describes the step by step description of the algorithm a search engine implements. Letting the specifics to imagination for now, any search engine that an user comes across in the web roughly does these steps over varying degrees. The search engines such as Google and Bing, obviously have a multi-layered algorithm that is too complex for the novice user. But every search engine has 3 main operations:
 * Web Crawling
 * Indexing
 * Index Matching

Web Crawling
One of the main functions of a search engine is to accumulate a set of web pages. This is done by a web map, which selects web-pages from the World Wide Web(WWW). A web crawler or more commonly, a spider(as Google likes to call it), finds links in these web pages. From these links, the crawler finds more links to other web pages and so on. Hence, an enormous amount of pages are accumulated by the crawler. . One other important factor in crawl-ability is the Robots.txt file. This file is important to be present in the root of the web page's domain. This text file instructs the crawler where to look in the web page. Also, this file may specify any area in the page, it does not want the crawler to look at. The handling of this file is important since any mistake in the file may prevent the crawler from fully searching the page.

Indexing
The next job of a search engine is to index the accumulated web pages. This is where the ranking algorithm comes into play. The algorithm finds specifics in a web page, based on which these web pages are indexed.

Index Matching
The final job of a search engine is to match the indexing of these web pages with regards to the search specified by the user. The Matching filter does this job. The indexing is a result of the web pages produced by the matching filter.

Search Engine Optimization
Search Engine Optimization is the technique by which a web-page can be modified to gain a higher rating or to increase the visitation in a search Engine. By doing this, any given web page may achieve a higher rating in a search engine's ranking algorithm. The search engine looks at these signals in a web page and specifies a rating for that particular page.

A search engine looks at the following signals to estimate the indexing of a web page:

Site-Level Signals:
These include trust level and the reputation of the website in which the web page resides. The localization of the web-page in the website is also important. Also, the domain name the page uses and the popularity of that domain.

Page-level Signals
The search engine also looks for specific page-level signals, these include the classification of layers in the page, the content presented and also the quality of the content. The present-ability of a page also plays an important role.

Off-site signals
Some factors beyond the face-value of a website are also considered. For example, the number of links in the page, not just linking within the same page but also linking into other websites/Web-pages are also important. Some other factors include social standing of a web-page and also rating of the page in a neighbor web-site.

Any indexing algorithm also utilizes graphs connecting a group of web-pages. Some of the most important graphs, a search engine looks for are:

Link Graphs
These are graphs that represent the links connected to a web-page. It represents both the incoming and out-going links to a web-page. Here each webpage is considered a node and the degree of a node is important. The degree of a node is defined as the amount of links connected to a node at any point.

Social Graph
Social graph represents the social connections the web-page has with other web-pages. The social standing, the interaction with other web-pages and also the marketing tactics used by the webpage comes under this graph. The web-pages are mapped under each criteria to form a graph.

Entities
This includes the places, events and the organizations, the website/Web-pages handle. This graph involves the independent activities each web-page has taken upon.

Ranking Mechanisms
The Search engine uses all the factors specified above and also incorporates further layering before arriving at a ranking list.

Scoring
The Scoring represents the initial value the ranking algorithm assigns to each web-page crawled by the web-crawler. This value is just a preliminary value evaluated in a fraction of a second. This initial value just represents the score evaluated based on the generated signals and Graphs that were discussed earlier. There are ways in which the rank may further be influenced:

Boosting
This is a way in which the ranking of a web-page may be bumped by a factor or two, by using search engine-specific optimization. This may involve things as big as Keywords to things as trivial as accessibility. For example, a web-page may load faster in mobile internet as opposed to a desktop PC, this may increase the web-page's ranking by one in mobiles when compared to its actual ranking. Some search engines may assign high importance to Keywords in the page, which can bump up the ranking of a web-page designed in that way for that particular Search engine.

Dampening
Similar to the Boosting phenomenon, there are ways in which a web-page may get lower ranking than its initial value, due to factors such as Spamming or Cloaking.

Spamming: Loading content that maybe irrelevant multiple times or forcing the user to do the same thing over and over.

Cloaking: As we know, the web-pages are being indexed by a robot(Google uses a robot named GoogleBot). There are ways in which a web-page designer may fool this Robot. One way is for the web-page to show two different versions of a web-page, one for the Robot and one for the user. This is called as cloaking.

Ranking Function
As discussed earlier, it is common for each search engine to have its own algorithm. But then, since the expectation from each search engine for the user is similar, there must be a generalized formula that must serve well for a person or a group of people looking to design their own search engine. Such a formula may be represented as such:

Ranking = F(Topical relevance, Content Quality, Context),

meaning, the ranking value is a function of the 3 specified variables.

Topical Relevance
This specifies the closeness of the topic to the query from the user. This is done by the matching filter in the search Engine. This typically done by matching keywords.

Content Quality
Probably, the most important factor in the ranking algorithm is the quality of content produced by the page. The content quality is basically quantified upon the following 3 factors:
 * 1) Authority : The trust level associated with the content(recommendations and as such), the reliability and the bias factor of the author, the reputation of the website., etc.
 * 2) Utility : The accuracy of the content, do the content and the title of the web-page match?, the level of detailing that has been taken into account., etc.
 * 3) Presentation : is the content presented friendly to the eye?, classifications in the text, the accessibility of the content., etc.

Context
How valuable is this content to the presented query? How old is the post? Is the content out-dated or up-to-date? These are some of the questions pondered by the search engine when looking upon this field.

A Long Answer
Having looked at the general mechanisms of a search Engine, now let's look at the specifics:

Google's Page-Rank Algorithm
Page-Rank is an algorithm used by Google Search to rank websites in their search engine results. Page-Rank was named after Larry Page, one of the founders of Google. Page-Rank is a way of measuring the importance of website pages. According to Google:

''Page-Rank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.''

It is not the only algorithm used by Google to order search engine results, but it is the first algorithm that was used by the company, and it is the best-known.

The page-rank algorithm works on the basis of back-link count for a node(web-page). Page-Rank is a link analysis algorithm and it assigns a numerical weighting to each element of a hyperlinked set of documents. The algorithm may be applied to any collection of entities with reciprocal quotations and references. It is quite clear that the page-rank algorithm does not rank websites but rather the web-pages.

The original Page-Rank algorithm was described by Lawrence Page and Sergey Brin in several publications. It is given by,

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

where,


 * PR(A) is the Page-Rank of page A,
 * PR(Ti) is the Page-Rank of pages Ti which link to page A,
 * C(Ti) is the number of outbound links on page Ti and
 * d is a damping factor which can be set between 0 and 1.

The Page-rank is similar to a vote. Every link to a particular page is like a vote to the page. The more pages vote for you, higher your ranking. If the page having a large number of votes points to you, you become more important. The page-rank of a page A cannot be computed until the page-rank of B, that is pointing to A is calculated and so on. If 2 pages point each other, then it can become a recursive process,which can be confusing. But Google explains it:

Page-rank can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web.

The link matrix specified may be formed by considering each page as a node and assigning values to the nodes to which a specific node points to. The matrix size is n*n, where n is the number of pages.

High-points of Google's Algorithm

 * Web Crawler : Google has a robust and a highly efficient web-crawler which can produce a humongous amounts of web pages in a very short period. Higher the number of web-pages crawled, better the quality of ranking algorithm.
 * Back-link Count : The page-rank algorithm's prime focus is the back-link count. The more incoming links to your page, the more important you are. The more incoming links to the pages that link to you, the more important you become.
 * Anchor Text : The anchor text is the text that the user sees at the end of the each result. It consists of a link that redirects to a page. The beauty of the Google's Algorithm is that the anchor text provided is a link to the page that the user wants to view. Hence, Google interprets the search from user and responds with the most popular result.

Bing's Algorithm
Microsoft ventured in to create the Bing search engine as a competitor to Google. It implemented its own ranking algorithm to rank web-pages. The Bing search engine refers to the search engine optimization problems as a whole page relevance. Bing algorithm blends together blocks of content, Web pages and results as a single result set, called as Answer Ranking.

The underlying argument for Bing's Algorithm is that a block of content or a web-page, is well placed if it receives at least as many clicks as the equivalently sized block of content below it — or, as they call it in Bing, if its win rate is greater than 0.5. So the algorithm will promote an answer on the page upward as long as its win rate is greater than 0.5. Armed with this metric, Bing ran multiple online experiments and compared the results of competing blending algorithms giving us a realistic data set. Along with this, three kinds of additional inputs are also used by Bing: confidence scores from the answer provider, query characterizations, and features extracted from other answers and web pages that will be shown on the page. For example, to learn where to place an image answer in the search results for a given query, Bing considers the confidence score returned from the image search provider, the ranking scores of nearby Web pages, and whether the query is marked as referring to the sort of entities that are well described by images (people, places, etc.). Finally, Bing has created an offline and online infrastructure that will be used to create and run a blending function. Bing uses a very robust, but high-performance learning method, called boosted regression trees, to automatically produce a ranking function given training data. This allows Bing to use many signals with the confidence that each additional signal will incrementally improve our blending function.

Although the algorithm, as described by Bing, sounds complicated and complex, the algorithm may be simplified down to a simple mathematical model. This mathematical model consists of two basic components:

Scoring Based on Relevancy
This method has its similarities to Google's Relevancy score. But that is dependent on the number of incoming links. But Bing's measure is based on Keywords:

Bing takes every document on the web and parses each document for word frequency. Words are reduced to their roots (e.g., eating is reduced to eat) and any useless articles are ignored (e.g., as, the, is, etc.). Each word generates a hash value (basically an ID number), which corresponds to a term frequency table. The term frequency table details the frequency of every term in the document.

When a user enters a query, the query is separated into words and reduced to its roots. A hash value is calculated from the word and found in the term frequency table. Documents that have the term present are called essential pages. If the selection algorithm finds a new document in its document database with the term present, it will take the essential page with the lowest term frequency, compare it to the new document, and if the new document has a greater term frequency, it will toss out the old essential page and replace it with the new document. This is done until only the most essential pages are in the subset (subject to an undetermined threshold). Refining this process over a large scale, Bing finally arrives at a list of essential pages that are put out in front of the end user.

Click Distance/URL depth
Once Bing has identified a list of essential pages based on the relevancy score, it then proceeds to the next step in the algorithm. The click distance/URL length. The click distance is basically the number of clicks it takes to reach from one web-page to another web-page. Indirectly, it may be considered as link distance too. The amount of clicks indirectly corresponds to the number of links it must parse before reaching a desired web-page. The lower the click distance, higher the visibility or the rank of the page. Higher the click distance, lower the importance of the page.

URL depth is somewhat similar to click distance, except that this is somewhat dependent on the URL being part of a website. As we parse through the website, the number of slashes it takes to reach a web-page in that website. The larger the number of backslash, the lower the importance and vice-versa. Bing uses URL depth to adjust the weighting calculated by click distance.

Once the relevancy score and the click distance/URL depth have been found, the Bing algorithm assigns weight to each of them and also incorporates further complex measures before arriving at the results. Although the bing algorithm might be well advanced and much more complex in its current state, this was the basic mathematical model implemented as part of the ranking function. Bing has also added additional properties to its algorithm such as search history, click-through rate of the user to improve the experience of the user considerably.

High-points of Bing's Algorithm

 * Multimedia Content : Bing's algorithm ranks the web pages with high multimedia content, things that include Images, Flash Videos., etc, much higher than other search engines. While Google gives higher precedence to raw HTML text, the Bing Algorithm also takes into account the multimedia factor.
 * Domain Names : Bing algorithm gives higher importance to web-pages whose URL consists of domain names such as .gov, .edu, and so on. One might argue that in most cases, Google also returns these domain names higher in their ranking, but the point is that Google ranks it higher due to the high back-links that these webpages have rather than because they have such domain names.

Crawling
The main difference is the crawler used by the 2 search engines. The Google spider is capable of crawling through a wide range of pages in a short span without compromising on the measurement of each Web-page. The Bing algorithm on the other hand, despite crawling a lot of pages, bases its measurement on the first 100kb of the page, making it a top-heavy algorithm. A page designer could make use of this particular glitch and manipulate the ranking of the page.

Back-link Count
The back-link count is so reliable because it depends on web-pages pointing at you and in turn, you pointing at other web-pages. There is very liitle chance of a manipulation and invariably the results are legitimate and to the point. The Bing on the other hand, doesn't recognize this as their primary measure. In fact, the top 10 results from Bing, in most cases seem to discard the back-link count to a much lower degree when in comparison with Google.

Keywords
The use of specific keywords also play role in Bing's Algorithm. It matches the keywords in user's query with the frequency of keywords in a web-page. Once again, this can drive certain pages to be designed in a specific manner that emphasizes the high use of such keywords.

Re-indexing Pages
Both Google and Bing index pages and store them and update the indexes periodically. But Google with its updated searching index, Caffeine(updated in 2010), seem to be much faster in re-indexing the pages. The best case to illustrate this is the time it takes for each search engine to update a new page created in the web. Google with its new technology, seems to be much more faster at this when compared to Bing.

Caffeine
As, specified in the last case, Google, recently in 2010, updated its searching index from a layered indexing procedure that updates upwards in the layer to a continuously updating global procedure, that is much more complex but delivers high-performance. Apparently, this has helped Google bring in 50% fresher results to their search results. The previous layered indexing, in some cases, took weeks to deliver new web-pages. Caffeine has helped Google in scaling a lot more web-pages, thereby improving the efficiency of their algorithm.

Example 1
Bing's Algorithm : Use of Click distance.

In Bing, let us search 'Amazon my orders'. What do you expect? Bing's first result is that of a amazon web page that returns the result for a search for keywords 'My orders' in Amazon search box. The most logical and rigid step would have been to take us to our orders, if we're logged in to Amazon or to take us to login page. But the first result without ads in Bing is not this. Of course, it is easy to navigate from search results page to My orders page in Amazon, but the point here is to test the algorithm. Now lets look at the reason for this occurrence:

The URL for page 'My orders' in Amazon is: https://www.amazon.com/gp/css/order-history, a URL Depth of 3 and takes a longer click distance, while the URL for the first result in Bing is: www.amazon.com/Search/my order, a URL depth of 2 and a slightly lower click distance.

Hence, it is clear that Bing gives importance to the click distance and URL depth, even if the page with longer URL may contain a link to homepage, while the other with shorter URL may not.

Example 2
The characteristics of Page-Rank shall be illustrated by a small example.

We regard a small web consisting of three pages A, B and C, whereby page A links to the pages B and C, page B links to page C and page C links to page A. According to Page and Brin, the damping factor d is usually set to 0.85, but to keep the calculation simple we set it to 0.5. The exact value of the damping factor d admittedly has effects on Page-Rank, but it does not influence the fundamental principles of PageRank. So, we get the following equations for the Page-Rank calculation:

PR(A) = 0.5 + 0.5 PR(C)

PR(B) = 0.5 + 0.5 (PR(A) / 2)

PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

These equations can easily be solved. We get the following PageRank values for the single pages:

PR(A) = 14/13 = 1.07692308

PR(B) = 10/13 = 0.76923077

PR(C) = 15/13 = 1.15384615

It is obvious that the sum of all pages' Page-Ranks is 3 and thus equals the total number of web pages. As shown above this is not a specific result for our simple example. For our simple three-page example it is easy to solve the according equation system to determine Page-Rank values. In practice, the web consists of billions of documents and it is not possible to find a solution by inspection.

Conclusion
From a broader sense, when Google and Bing are viewed as two search engines in the market that deliver search results, then the results more or less match and the similar queries return results that don't vary too much. But, with these two being giants in these fields, and with the importance given to finer details of the ranking algorithm, the differences do eventually arise. The importance of reputation and word-of mouth in both cases should not be ruled out. But, after a study of both search engines from data available to us, it is fair to conclude that the public perception is in fact well placed, at-least for now. Google's page-rank algorithm due to a variety of factors(mainly crawler and back-linking) seem to generate a highly reliable and information based results to the user. Bing, on the other hand, isn't too far behind with many users already preferring it to Google. The fact that Bing is relatively new also needs to be factored in to the discussion. May be with some tweaking and patience, Bing might well become the giant that Google is now in the near future.