Wikipedia:Historical archive/Search engine commentary

The following is some old stuff that probably should be deleted. It doesn't apply to Wikipedia's current search mechanism, which is completely different.

Many things came up on Friday, so I wasn't able to clean up and post the code. :-( On the weekends, I try not to work, because I have a 8 month old baby at home, and talking to her is more important.  So, Monday.  Or during naptime tomorrow.  :-)

This is a complete rewrite of this page. I'm going to work through everone's comments. I will simply delete the comment if I've taken care of it, or reply if there's some reason why I haven't or don't intend to take care of it.

To keep this all very simple, I'm going to unattribute all of the comments and questions and just list them, like an FAQ or something.


 * What's the basic status of the new search engine?

The current version returns results from the full text of all the articles in Wikipedia. It is currently updated when I run a script, which I do frequently while I'm working on it. After today, it will be updated either every few hours or every night, depending on what I decide based on the server load.

The current new version that I have written is fast -- it uses FastCGI and a btree file. It also has a semi-crude but semi-clever ranking algorithm for helping to push the best match to the top. The algorithm may be tweaked if we notice major empirical problems with it. It counts words in the title of an article much more strongly than words in the body of the article.

The code will be released tomorrow morning, I hope. I need to clean it up a bit, it's sloppy.


 * How about a google search box?

Google search box added, simplified. Ignore the placement, I'll rearrange later. Do we still need this, given the fact that I'm doing fulltext? Perhaps this should be an option with the main search box, or perhaps I should just link to this search?


 * What about REDIRECT pages?

REDIRECT pages are completely ignored. Empirically, most of them are simple respellings that clutter the results. This has a cost, as per someone's "mountain lion" and "puma" example, but since we're doing fulltext, that cost has been minimized.


 * I see a formatting problem. I already mentioned it, but you didn't fix it?

Please tell me again. It might have gotten lost or maybe I thought I fixed it. Unless it is mentioned here, I think I fixed it.


 * How about full access to the traditional search using a search box?

I plan to add this later today. Actually what I plan to do is make this a radiobutton option.


 * Could we remove the "More" link and list everything on one page? Less hassle that way for everybody.

The reason for this isn't usually about causing more pageviews for more ads, but good standards for not putting too much on a single page for people with slow modems (i.e. most people). After I get things stabilized, I will look carefully at what the optimal default should be (10? 15? 50?). Ideally, we should set your preference in the preferences cookies and respect that. So you can get 200 at a time if you want, and other people can get the standard default, say.


 * It wouldn't hurt to re-install the "edit this page" link on the top of the screen.

Oh, I see what you mean. Where I had it in the code before, it was showing up on pages that you can't really edit. Now it is not at the top of pages that you can edit. I'll study this.

(It isn't really about the search engine, though.)


 * What about /Talk pages?

Tough call. I'm thinking about removing them, but having the text on them still work and point to the main page. So if someone mentions a word on a /Talk page, you'll be sent to the main page instead when you search on that word.

That's problematic, of course. But currently, we return a lot of talk pages that are probably unnecessary.

The other thing to do is simply exclude all /Talk pages, period. I actually prefer this solution myself, but...

Anyhow, I think we should identify /Talk strictly as pages which are named such-and-such/Talk.
 * If you are going to exclude talk pages, please make it optional. (i.e. have an "Exclude Talk Pages" checkbox, either on a search form or on user preferences.) -- Simon J Kissane


 * I think the text excerpts could stand to be a bit longer--perhaps twice as long. I don't think it helps particularly to have that text italicized.

I disagree. I can make them a little longer, but remember that we want the page to load quickly for people. The idea is not to read the page here, but to just get a quick idea of whether this is your context. Look at what Google does -- I'm already returning significantly more.

I think the italics look nice. :-)


 * It would be great to have a publicly-editable list of pages that are deliberately excluded from search results (so we can exclude personal pages from encyclopedia search results, for example)

I think that a better solution will come when we move to a MySQL solution. Certain pages can be flagged as personal, for example, and then handled differently. For now, this is a lot of work for minimal benefit.


 * "Search again using these other engines:" could stand to be reworded

All of the things listed there will change soon, so as to lean more towards encyclopedias. That was just cut and pasted from another site I own.


 * Will substring searches be allowed? For instance, if someone wanted to search for 'rquez' to find both 'Marquez' and 'Márquez' and convert some of the entries?

I can't do substring searches with my current setup, period. I should emphasize that I can't do it, not to say that it can't be done.

However, the right thing for a search engine to do with your example is to automatically list all the Márquez as such, but to ALSO list these under Marquez by squashing the fancy 'a' down to a regular 'a'. In this way, people can type either and get decent results. Right now, I don't do that. Anything that people type, goes into the system as-is. This is good in a way but bad in a way.

One thing to keep in mind is that at least on the English language wikis, most people won't have the least clue how to type in those fancy foreign letters. I don't. (I had to cut and paste yours to  include it above!)


 * Ok, but look at this: a search for "rock and roll" turns up Japanese cuisine: "Being an island nation, Japanese cuisine is heavily into seafood. Beef is extremely expensive in Japan, it is a luxury ..." From that I discovered that searching for "rquez" (to find Gabriel Garcia Marquez, both with the tildes and without) will return 0 results but searching for "gabriel rquez" will return both ones with the tilde over the "a" and those without.  The problem is that it also turns up anything with "Gabriel."  But searching for a string of nonsense (e.g. "xrlfsdytq") follewed by "rquez" will not turn up results....  Also, searching for a word that appears on only one page ("Naqoyqatsi") followed by "rquez" will not turn up one irrelevant result and then all the ones you're really looking for; it will only turn up the one irrelevant result.  Searchng for an uncommon but relevant word (such as "solitude") will for some reason not turn up the author's own page. At any rate, I thought you might like to know that the potential for substring searches is there in your code already, if someone can just hash through how.  :-)  --Koyaanis Qatsi


 * I don't think you stumbled on a hidden substring search here: "and" is a stop word and will be ignored in the search; if you search for "rock roll" you will still find Japanese cuisine. The code that highlights the matches in returned result fragments is slightly buggy: it will highlight all substrings that match any of the submitted search terms, even if the search code didn't use those substrings. --AxelBoldt


 * Hm. I hadn't thought of that.  Good point.  --KQ


 * I just searched for "ring" and got all sorts of results back that contain a single r, but the one I was looking for, mathematical ring, didn't show up.

A, ha. That's funny. Bomis, which is my main site, and the site that pays the bills for all our Wikipedia and Nupedia fun and games, is a web 'ring' search engine. So 'ring' is a stopword there.

The cause of the problem you identified is that I did a cute trick with 'ing', basically causing the search engine to treat 'thinking' and 'think' in the same way. I do this with 's' at the end, too. With 's', this cute trick eliminates all the woes of singulars and plurals, so that "horses" and "horse" return the same thing, which is good.

There are some funny side-effects, I see. I'll make an exception for 'ring' and 'thing' on the next revision!


 * Yup, and maybe also for "wing", "sing", "is", "us", "loss", "gross", "mass", "was", "class" etc.


 * And another thing: I searched for "gauss" and the article about Gauss appeared after several articles which contain the word "Gauss" only once. --AxelBoldt

See the previous entry for a clue as to the cause. My immediate thought is that I have a bug -- in the body, I silence a closing 's', and in the title, I don't. So your search for 'Gauss' actually searches for 'Gaus' which, in the body, is equivalent to 'Gauss'. But I didn't do the trick correctly in the title.

This is kind of fun. I'm giving away all my "secrets" from the Bomis search engine. I've never thought they were all that valuable as secrets, but they have been secret for a few years now. Cute tricks, mostly. :-)

If you look at the history of an article and then use the search box on that page, you get "Invalid URL".

I really don't like the stripping of final -s and -ing. If I do a search for "horses" it's because I want "horses" -- I don't want to have to wade through masses of entries which contain only "horse". At the very least there should be some way of switching this annoying behaviour off, and it should be clearly explained somewhere. --Zundark

I agree that it should be explained somewhere. But I think that it's not annoying behavior. Like many other design choices, it's really an empirical matter. Does it help more often than it hurts? In my experience, it helps on the vast majority of searches, but hurts only sometimes.

The horse example is a good one. It seems very unlikely to me that someone would really care much about 'horse' versus 'horses'. Other examples, though, illustrate the downside more clearly. For a few words, chopping off the 's' smashes together two words of very different meanings, thus cluttering the results.

But more generally, and this is particularly true of a site with an overall smallish set of data (and Wikipedia still is, despite our fast progress, pretty small as compared to the web as a whole!), it helps a *lot*. If you're searching for information about Aleutian indians or Aleutians, you're searching for the same general sort of thing, and there's room in our search results to show both. (Basically, we don't have an article on either, so we'd better show you 'Aleutian Islands', which we do have.)

A better thing to do, and perhaps this is a useful compromise that I can think about, is to keep the two separate in the database, but upon searching, to search for both forms at the same time, and then to blend the results. Exact matches are given more weight than inexact matches. This should drive your 'horses' article to the top, if it exists, while also returning the 'horse' articles ranked lower. Or vice-versa as the case may be.

And I also agree that there should be a way to turn off any behavior that anyone doesn't like. But this is somewhat advanced for now. Still, and this is especially true once I publish the code (Tuesday, I bet), we can all pitch in to polish it up.

The main thing to remember is that search engines need to return what people are really looking for, even when they aren't good at formulating a proper request. So we want to 'fail gracefully'. If someone enters 'Aleutians' we want to give them something potentially useful, and not pretend that 'Aleutian' isn't relevantly similar. So crushing 's' is surely a part of any valid search strategy.


 * There's also a need for a power search mode where you have a complete, up-to-date (albeit slow), regular expression search without any heuristics, like the old search box. This is mainly useful for authors who want to fix links and other things. --AxelBoldt


 * Searches for these words should turn up results but do not:
 * "group", "word", "problem", "history", "time", "day", "second", "refer", "science", "number", "rule", "theory", "force"


 * as of 30-September-2001, most of these are fixed, except: "group", "history", "time", "day", "number".

Searching from an article history page or from a traditional search results page gives "Invalid URL." --AxelBoldt

I was looking for a comment I had made recently about Hitchcock's The Man Who Knew Too Much. "man who knew too much" turned up a load of irrelevant results--I forgot it didnt' search for the exact string, but for the occurrence of the words--so then I looked for Hitchcock, which turned up only relevant results but not that comment. Then I searched for imdb, since I also mentioned the IMDb; that search also did not turn up the comment (which is about 5 days old, I think). So my question is: how often is the search info updated? --KQ

Search for "database" returns several relevant pages, but not the article with title "Database". --AxelBoldt

Search for "Paris" does not return the Paris page, but does return articles with the word "paring". This would appear to be because the search engine is dropping the final 's' in the word "Paris". - Tim

So if I want to search for cooking and I don't want to wade through a billion matches for cook, how do I do that? Maybe we could have do what I actually want and do what Larry thinks I want as checkboxes ;) -- Greg Lindahl

The search database needs to be updated more often. For instance, it is clear that the Friends of Wikipedia page is not in the search database, and it was created on September 6, two weeks ago. --AxelBoldt

Near the top of this page, there is a paragraph:

 The current version returns results from the full text of all the articles in Wikipedia. It is currently updated when I run a script, which I do frequently while I'm working on it. After today, it will be updated either every few hours or every night, depending on what I decide based on the server load. 

The today mentioned was quite sometime back. Apparently the search index has not been updated for weeks. Is the cron job broken or something?

The lettered globe in the upper right corner of the search results page should be turned into a link to the home page. - The RecentChanges link on search results page needs changing to Recent Changes.