User:Mathglot/sandbox/Using search engines effectively

In Talk page discussions at Wikipedia, many editors are not aware how to use search engines effectively to find data about a discussion topic, how to interpret the search results counter ("we fond 7 million results"), or how to properly analyze and assess the result list.

Engines

 * web search: google, bing, wolfram, duckduckgo, baidu, others
 * specialized: books, ngams, scholar, Trends
 * links to history of, etc.

Queries
How to build queries


 * quoted or not
 * plus and minus
 * site:
 * disallowing alt spellings
 * forcing required tokens
 * Wolfram Alpha - can give rich data, but building the right query is troublesome

Search engine result counts
After you build and execute a query, how do you interpret the "We found 10.5 millions results" tally?

Users must commonly see results of web searches posted by other users, and they can tell from the searches they have done previously, or just from common sense, that something is very wrong with the search counts being given (e.g., like the 7.2 million results thing; everybody knows that can't be right), but they don't know why it's wrong, and so they just give up in frustration, figuring that you can't possibly ever determine anything out of search engine result counts. Which is wrong, but understandable.

Filtered results
Even though the number stated may be large, the search engine results typically eliminate pages that are very similar to each other, to avoid showing the same mirrored or copied page many times. You can add a param to the search url to request unfiltered results:, or you can do it in search settings... (how)  Normally, you want filtered results (the default), in order to avoid counting the same web document more than once.

Finding the actual count
The tally is an initial estimate, and it's not at all unusual for it to be off by several orders of magnitude (stating seven million results, when there are actually 155 results). To find the actual result count, involves finding the last result page which actually contain your query terms.

Some search engines will just stop showing more result pages so there won't be any more after the umpteenth, and others may post a notice. Here's a notice from Google, on the 13th page of a two-word, quoted search. The url for this page, had a start param of 120; that is,  was one of the url query param-value pairs in the url in the address bar of the google results search page:"In order to show you the most relevant results, we have omitted some entries very similar to the 130 already displayed. If you like, you can repeat the search with the omitted results included."


 * using the "next page" feature
 * skipping ahead pages
 * using the  param to search quickly (guessing, or binary search)
 * the

Interpreting results

 * false positives (accidental colocations, etc)
 * not bolded/not in snippet

Displaying and linking search engine urls
How can you include your search url in your Talk page discussion economically and successfully? After you've generated a query that you believe has useful data, you want to link it into the Talk discussion, but it doesnt display properly.

Urls with a query string may need to have certain reserved characters url-encoded, in order to display properly.

If you copy an url out of the address bar of a browser into the Talk page and attempt to make an external link out of it, it may not display correctly. You have two options to fix this: If you don't escape ("url-encode") your url, you're likely to get a "display error". Here's an example of a search url, with all the extraneous search params removed, and escaped and placed into a wikilink: "washington senators" &#91;p.14&#93;.
 * for a simple search, use a template like Google; e.g., (normally, should be subst'ed; not substed here, so you can see it in operation)
 * for a more complex search that includes additional features in the query string not supported by the Wikipedia template, you have to use an escaped url and link it yourself.

Note that the browser may play a role, too, because it might encode certain metacharacters under the hood, so you might think an url is good and works for you, but someone else pasting the url into another browser might not see what you do.

Scholar queries
Google Scholar searches academic journals, and only returns the first 100 pages of results. So, if the initial tally of search results says, "About 2,300 results" you're not going to get to the last page of results, because it will stop at page 100 (results 990-1000).

To compare two queries and find their actual numbers, the search has to be narrowed somehow. Unchecking patents and citations limits it. Choosing time windows that generated less than 100 pages (1000 results) is one way, and then they can be summed to provide the actual count.

Trends data
Google Trends data should never be used for questions about common name or notability. Rfcs or discussions sometimes get bogged down by incorrect application of Google Trends data as if it were a reliable source, but it is not. (In theory, one could use it for one thing: to illustrate what people are searching for.)

Google Trends data show the results of the terms people use in their online searches, and have no connection to the proportion of reliable sources on a subject. User searches are not reliable sources, and do not provide useful information on how to decide an issue like this. To see why this is so, consider the results of these two Google Trends data analyses, to try and determine whether Elvis is alive or dead, and whether the moon landing was real or faked. It is of course, absurd; but that is the point: what people are searching for, has no relation to what sources say. When thousands of people search for "Elvis is alive", that doesn't mean it's true (or false), it doesn't mean there are many (or any) reliable sources that make that claim, and it doesn't even mean that the person searching believes that Elvis is alive. It only means that they are searching for that expression and nothing more. We really don't know why they searched for that term.

Trends data may be used cautiously, when intended to show what users were searching for during a particular interval; for instance, see Latinx (permalink).

Ngrams

 * very brief intro; notion of threshold; link to methodology;
 * incomparability of n to n+1 grams
 * false positives due to context
 * interpretation traps: part of speech, capitalization, hyphen and other punct; accidental colocations

Templates aiding search
One search template available is Google Wikipedia, but it leaves some stray text; e.g., (Arg2 will add your own anchor text.) If you mean, to search for yourself when you're looking for something, use Advanced search. For example, search for templates called Quote or similar like this. See also, Help:Search. Searching google directly is fine. Rather than add the word wikipedia as a simple search term, though, encode it as a domain restriction: search, i.e.,. Looks like result #1 is the one you want.

What this essay is not
This essay is about using search engines effectively in Talk page discussions. It's not about describing how search engines are constructed, how an index is built, how the index is searched, what PageRank is or how it works, why some search results appear higher up in the results list than others, how to make your page appear in search or come up higher in the result set, or anything else related to search.

sources to develop

 * discussion at a closed Rfc, about a Google Trends data issue