Wikipedia:Modelling Wikipedia extended growth

This essay considers issues about projecting the long-term growth of Wikipedia in terms of article count, including growth promoted by many other types of articles, beyond the traditional encyclopedia and major pop-culture articles. The extended-growth model considers factors that will create new articles, beyond the current million articles (live count).

The growth of Wikipedia, although reduced, is not slowing as much as predicted in 2007, but not skyrocketing either, as predicted in 2005. The extended model predicted Wikipedia as exceeding a total of 3 million articles in mid-August 2009, rather than at year-end (occurred 17 August 2009). The model predicted the 3.5 millionth article would be added in mid-September 2010, but it occurred on 12 December 2010 instead.

Wikipedia extended growth model


An alternate possibility for the growth of Wikipedia is a more protracted, long-term decline in new articles: not the original exponential burst that doubled each year, but neither a balanced bell curve that peaked in late 2006. Instead, an extended-growth model should be considered with the middle, or mean size, to occur during 2010–2011, to double that size of 4 or 5 million articles to nearly 9–10 million articles, long term. The additional millions of articles will be various types of follow-on articles, after the major articles have mostly become stable.

The psychological motivation for the follow-on articles might be a feeling that Wikipedia needs to answer some basic questions about any notable topic that anyone can think about. That motivation is probably much stronger than refining the existing articles to be a comprehensive treatment of each topic (see below: Psychological motivation).

Graphical model fits overall pattern
The model was developed as a graphical curve to fit the overall pattern of the data, which does not follow a simple mathematical model because many batches of new articles are added by wiki-bot and short-term groups, rather than as "random" additions by the general public. Hence, there is no simple equation which could fit the actual data, which fluctuates wildly when robotic bot-programs are triggered to load numerous new articles in some months, such as for numerous protein-sequence articles. There is no simple mathematical "process generator" to simulate new-article growth. A detailed operational model would not be an equation, but rather a logical, procedural computerized model. However, the growth impact of articles from the general public has been much greater, than the short-term group efforts, so that the overall pattern appears to be a somewhat linear decline in the growth rate for new articles, averaged for several months, or 3 future years, at a time. Perhaps a rough equation would reduce the new-article growth by 11% each year, with the understanding that the decline slows further in each June/July but rises in August, each year, probably tied to school vacations in the Northern Hemisphere. Bear in mind that if a massive bot were triggered to load 700,000 new articles from a "Who's Who in Science" then the new-article rate would soar for months, and always appear as an anomaly, as an upward bump, in the declining overall curve during the next 20-65 years.

Continued growth for follow-on articles
The initial base of Wikipedia articles covered the traditional encyclopedia and mainstream pop-culture articles, including historical figures, world events, catalogs of scientific terms, celebrities, entertainment topics, and famous sports figures. Those topics, after 6 years of expansion were thought to be saturated, so that the primary growth of Wikipedia would quickly decline and end within 5 years.

However, growth can be expected for several other types of articles: Note that even the fan-cruft articles will be notable, because thousands or millions of people might be affected, however briefly, and the topic will be covered by some mainstream media sources.
 * unresolved redlink articles – linked because authors expected notability (someday);
 * spinoffs – sets of sub-articles created when large articles are split;
 * disambiguation pages – whenever 2 or more articles have similar titles, expect a page to separate them;
 * unseen-hand articles – these are the supporting cast & crew, or assistant leaders, as the power behind the throne that made things happen;
 * lost-world articles – these are the long-lost, buried civilizations, failed inventions, secret societies, or forgotten heroes;
 * also-ran articles – these are the contenders, or losing players, just outside the lime light; and
 * technical artifacts - such as cars, consumer electronics, electrical parts, scientific instruments, software, weapons. Thousands (millions) of new models enter the market each year, and millions were notable from the past (e.g. IBM 1620).
 * chemicals - it is estimated that some 10 million substances like 2,2-dimethylbutane have been described non-trivially in the literature (with some other information besides their mere existence and formula).
 * species - estimates for the number of species range in the millions, all with some nontrivial information published somewhere.
 * stars – there are several star catalogues with millions of stars listed, as yet there is only a fraction listed on Wikipedia.
 * fan-cruft articles – these are detailed or pop-culture topics, such as one-event clothing designs, that get mentioned (briefly) in mainstream news.
 * additional articles for new things of established sorts: new books, new films, newly notably performers, newly elected politicians, new major athletes, new scientific discoveries, new major products. Where major prizes are notable, there will be new people in that group every year.  This portion can never become saturated, though the growth can become linear.
 * expansion of the enWP into fuller coverage of the other culture areas; for example, we are much more saturated for UK railroad stations than for ones in India.

Because of the large array of follow-on articles, there seems to be great potential for creating masses of new articles, beyond the millions of traditional encyclopedia and major pop-culture articles.

Annual growth rate of new articles
The table below shows the increasing article counts for the English Wikipedia:

Articles as resolved redlinks
If Wikipedia's growth were nearing an end, then many articles would have most major redlinks already resolved with the intended linked articles. However, many articles still recommend 6 or more redlinked articles. Improbable redlinks are often removed from articles, so the remaining redlinks are typically notable. They will include: nearby mountain names, wildlife reserves, rivers, bays, towns, key personnel, book/film titles, special varieties, etc. Such topics are easily defended as being notable, so the redlinks are a major influence on creating new notable articles.

Articles as disambiguation pages
A common type of new article is a disambiguation page, which offers a choice of articles related to the same title. Originally, the choice was between items having exactly the same name, such as "John Smith" or "Mary Jones" or "Leonardo". However, variations of a title were added as potential matches, in a manner similar to word-prefix searches. As a result, disambiguation pages began listing organized groups of potential matches for a partial title, carefully grouping people, companies, towns, films (etc.) with a short description of each.

A disambiguation page can be so comprehensive, and descriptive, that it acts like search-engine results "on steroids", as a structured, informative scan that would be a lofty goal for a search-engine to attain. Because of the exceptional information distilled by the disambiguation pages, they can be valuable additions to Wikipedia, and hence, a major source of welcomed new pages. In February 2009, Wikipedia had nearly 108,000 disambiguation pages, more than the entire size of Wikipedia back in early 2003. In early 2009, the daily growth of new articles included, perhaps, nearly 1–2% disambiguation pages. By 2014 the count of disambiguation pages had grown to over 250,000.

Articles as lost-world topics
The search for knowledge often illuminates the worlds of yesteryear. Archaeologists have excavated for decades at Emperor Qin's Terracotta Army in Xian (China), at the fields of Ephesus, the hills of Copan, the ruins at Carchemish, inside many Caribbean shipwrecks, under lava flows near Pompeii, and in ancient temples at Edfu, Abydos or Kom Ombo along the Nile. As new discoveries are pieced together, thousands of ancient topics gain the details to become full articles.

The world of antiques, with furniture and household items, instantly provides many thousands of topics for new articles.

Paleontologists are expanding the fossil record in many areas: as the arctic glaciers melt, numerous fossils are sometimes found on the surface under the ice; and even in Africa, where dinosaur remains were rarely seen, numerous fossils are being discovered.

Many thousands of articles can be expected on lost-world topics.

Articles as unseen-hand issues
Behind, or beneath, the major, popular topics, are the "unseen hand" articles. The supporting cast and crew (sometimes with a "cast of thousands") eventually becomes known well enough to fill new articles.

These articles include the people, with their inventions, who sold their novel ideas to Thomas Edison.

Psychological motivation for new articles
The English Wikipedia, since early 2005, has added over 1,000 new articles every day. However, the number of articles being refined and polished to meet featured-article status is only a few a day. Clearly, the ratio of 1 featured article for every thousand indicates some key psychological factors are involved.

The psychological motivation for creating so many new, follow-on articles might be a feeling that Wikipedia needs to answer some basic questions about almost any notable topic that anyone can imagine. For example, there are over 33,000 English Wikipedia articles about professional footballers (soccer players), and many of those articles are read daily, by someone somewhere. In contrast, for the more traditional field of mathematics, there are perhaps 21,000 total articles. However, new articles are still being added.

Meanwhile, the process of refining articles to reach featured-article status, as a comparison, involves weeks of changes and reviews. Plus, the criteria used to screen articles can become severe: some even request that the phrasing within an article be made more diverse, by eliminating repetition of ordinary phrases. It is not enough to just describe all major aspects of a topic, those articles must meet certain literary standards. During 2008, over 100 articles lost their featured-article status, as criteria became perhaps more strict about the quality required for featured-level.

As a consequence, the motivation is probably much stronger, to create new (brief) articles which provide a general introduction to each topic, rather than refining or polishing the existing articles to become comprehensive treatments of their topics, according to a carefully defined set of high-quality criteria.

Growth as a percentage of prior year
There has been decline in daily growth of articles for 17 years, after the initial sprut phase. If annual decline from 2025 is about 5% fewer new daily articles, each year, then each year adds about 95% of the prior year's new-article count. Using that form of model, then total articles will continue to grow. The following table shows daily new-article count, reducing by 5% annually:

From 2024, the daily new-article count is guesstimate: the purpose of the table is to show how article growth might continue. However, the actual new-article counts will differ. The actual counts could increase, especially, if AI is used to auto-generate stub articles for redlinks, such as auto-searching for matching source webpages, then auto-generating footnotes and inserting key phrases or infobox details within each stub article.

Projections could change radically
Beware that the ongoing projections assume a continuation, of the prior types, of new articles. Any drastic change in mass uploads or new-article restrictions could radically alter the rate of new-article creation. For example:
 * If some WikiProject decided to auto-upload new articles, generated as stubs, from a huge database of "Who's who in science", then a massive upsurge would occur for new articles.
 * In contrast, if Wikipedia policies were quickly changed to demand sources, such as requiring 2 independent sources per new stub, then new-article creation could fall to just a few dozen a day.

Because of the widespread impact of mass uploads or new-article restrictions, the actual growth figures could veer widely from the projected levels, within only weeks of the current time.

Also, article count as a sole parameter does not take into account that there is an ongoing work on merging articles that are redundant or too small into larger and more comprehensive articles, for which a reduction in total article count is a sign of healthy development.