Help:Searching/Regex

To perform a regex search, use the ordinary search box with the syntax  or. 

Indexed search
All pages on Wikipedia are scanned and indexed by Wikipedia's own search engine. The entire wiki is treated as one "full text" kept in a separate database (an "index") built just for searching. It's like the index in a book, but practically every word and every number is indexed to every page.

Since each word in the prebuilt search index already points to the pages that contain it, a keyword search usually corresponds to a single record lookup in the index. (This is also true for phrases, to a certain extent.) "Index searches" take basically no time to execute. They are cheap and plentiful.

There are separate indexes kept updated for: Any text transcluded from a template is indexed as if it were really present on its target page. (In other words, by default, a keyword search is done on the text of the rendered Wikipedia page, not on the page source itself. However, you can change this by using  to search the source markup instead of the rendered page.)
 * titles
 * visual content
 * wikitext
 * templates

Preparing and maintaining the search indexes is done by Wikipedia's servers, in the background, in near real time. As soon as you save the page, a few seconds later you can search for the changes you just made. For templates that are transcluded onto many many pages, the propagation of those changes to all the pages in the index might take a while.

The index is based on alphanumeric characters; it stores no information on non-alphanumeric characters. If you type any punctuation or brackets into the search box when doing an indexed search, those characters will be silently discarded.

A basic indexed search
 * searches only article space by default.
 * matches only letters and numbers. This is usually not a problem.
 * lands a lot of search results. You rely heavily on page ranking rules. You then refine search results based on the topmost pages. This is done with the not filter, signified by a minus sign attached to the front of the unwanted word to filter out page-hit noise you could not have predicted.
 * is an "aggressive matcher" including as many pages as it can by matching all forms of each word you enter.

Regular expression search
Instead of doing a basic indexed search on keywords, you can perform a regex search, which bypasses the index. A regex search scans the text of each page on Wikipedia in real time, character by character, to find pages that match a specific sequence or pattern of characters. Unlike keyword searching, regex searching is by default case-sensitive, does not ignore punctuation, and operates directly on the page source (MediaWiki markup) rather than on the rendered contents of the page.

To perform a regex search, use the ordinary search box with the syntax  or. The expression  denotes a regular expression in MediaWiki-flavored regular expression syntax.

Use regexes responsibly
Because regex searching scans each page character by character, it is generally much slower than an index search. You can — and should — add additional search terms when using  to reduce the amount of text being processed. For example:


 * finds pages that match a case-insensitive stemmed keyword search for "polish" (including "polished" or "polishing"); then does a case-sensitive regex search within those pages. Only pages that match both filters are returned.


 * is similar, but starts with a case-insensitive search of the source markup instead of the rendered page (so it will find usages like, and not find transclusions).


 * ,, and   are excellent filters.


 * is a good filter.

Adding an index-based search term to reduce the amount of text being scanned is important simply to make your own regex search finish in a reasonable amount of time. Regex searches that take too long will "time out" and return only partial results. Overuse of slow regex searches might cause temporary throttling of the feature for yourself and/or everyone on Wikipedia. (However, you cannot affect the site performance of Wikipedia as a whole simply by abusing regex search.) Remember that a single regex search can take multiple seconds, and there are currently registered users on Wikipedia. Use regex search responsibly.

Metacharacters
MediaWiki's regular expression syntax works like this:


 * Most characters represent themselves. For example,  will search for pages containing the literal string "C-3p0" (case-sensitive).
 * The following metacharacters are treated specially: . Any metacharacter can be escaped by preceding it with a backslash  . Preceding any other character with a backslash is harmless. For example,   will search for pages containing the literal string "yes.no" (case-sensitive). Regex experts should note that   does not mean "newline,"   does not mean "digit," and so on: In MediaWiki syntax, the only use of   is to escape metacharacters.
 * is special because it indicates the end of the regex. For example,  is treated the same as   (because the keyword search for   ignores punctuation). The   character must be backslash-escaped everywhere it appears inside a regex – even inside square brackets or quotation marks.
 * matches any single character. For example,  is matched by ,  ,  , etc.
 * group a sequence of characters into an atomic unit.
 * goes between two sequences and matches either of them. For example,  matches either   or.
 * matches the preceding character or group one or more times. For example,  is matched by ,  ,  , etc.   matches  ,  , etc.
 * matches the preceding character or group any number of times (including zero). For example,  is matched by ,  ,  , etc.
 * matches the preceding character or group exactly zero or one times.
 * match the preceding character or group a fixed number of times. For example,  matches exactly 2 lowercase letters in a row.   matches any string of 2, 3, or 4 lowercase letters.   matches any string of 2 or more lowercase letters.
 * introduce a character class, which matches a single instance of any of the characters in the class. For example,  matches both   and  . Characters inside square brackets generally don't have to be escaped, although escaping them remains harmless, and   still needs to be escaped everywhere. For example,   matches a single instance of ,  ,  , or.
 * Inside a character class, the character  (if it appears first of all) represents negation, and the character   (unless it appears first or last) represents a range. For example,   matches any alphanumeric character or underscore, and   matches any non-alphabetic character.
 * stand for numbers treated as numbers, not characters. For example,  is matched by ,  , ...  ,  , but not  . (But it will also match the first six characters of  .)
 * "looks ahead" and negates the next character or group. For example,  should match the first five characters of   but not the first five characters of.

There are a few additional quirks of the syntax:


 * The metacharacter  is a synonym for   (match any sequence of characters at all).
 * A search  fails, although   and   both succeed.
 * are an escape mechanism, like square brackets or the backslash. For example,  means the same thing as.
 * The character  is also a metacharacter and must be escaped.
 * Regex experts should note that  does not mean "newline,"   does not mean "digit," and so on.
 * Regex experts should note that  does not mean "beginning of text" and   does not mean "end of text." Searching from the beginning or end of a Wikipedia page is not generally useful.

Workarounds for some character classes
Although character classes,  ,   are not supported, you may use these workarounds:

In these ranges, " " (space) is the character immediately following the control characters, "!" is the character immediately following space, and "􏿽" is U+10FFFF, the last character in Unicode. Thus, the range from " " to "􏿽" includes all characters except for control characters (of which articles may contain newlines and tabulation), while the range from "!" to "􏿽" includes all characters except for control characters and space.