User:HervGerv/Internet linguistics

Article body
The Web is clearly a multilingual corpus. It is estimated that 71% of the pages (453 million out of 634 million Web pages indexed by the Excite engine) were written in English, followed by Japanese (6.8%), German (5.1%), French (1.8%), Chinese (1.5%), Spanish (1.1%), Italian (0.9%), and Swedish (0.7%).

A test to find contiguous words like ‘deep breath’ revealed 868,631 Web pages containing the terms in AlltheWeb. The number found through the search engines are more than three times the counts generated by the British National Corpus, indicating the significant size of the English corpus available on the Web.

The massive size of text available on the Web can be seen in the analysis of controlled data in which corpora of different languages were mixed in various proportions. The estimated Web size in words by AltaVista saw English at the top of the list with 76,598,718,000 words. The next is German, with 7,035,850,000 words along with 6 other languages with over a billion hits. Even languages with fewer hits on the Web such as Slovenian, Croatian, Malay, and Turkish have more than one hundred million words on the Web. This reveals the potential strength and accuracy of using the Web as a Corpus given its significant size, which warrants much additional research such as the project currently being carried out by the British National Corpus to exploit its scale.

Edits:

The Web is a multilingual corpus. Of the languages used online, English is most dominant, comprising 25.9% of users, followed by Chinese (19.4%), Spanish (7.9%), Arabic (5.2%), Indonesian and Malaysian (4.3%), Portuguese (3.7%), French (3.3%), Japanese (2.6%), Russian (2.5%), and German (2%).

The number of words found through search engines are more than three times the counts generated by the British National Corpus, indicating the significant size of the English corpus available on the web. This along with the sheer number of words recorded in many different languages reveals the potential strength and accuracy of using the Web as a corpus given its significant size and scale.

Notes:

The information in the body paragraph compares a couple different internet browsers, which are not only all out of service but today's modern internet landscape is dominated by the search engine Google. Such comparisons would be irrelevant as of now. Additionally, I think that this paragraph is written in a more persuasive essay type manner than an informational one, so I curtailed the paragraph and attempted to make it appear as less of an essay paragraph.