User:Piotrus/Wikipedia interwiki and specialized knowledge test

All the time one can hear claims that Wikipedia has "enough articles" and it is unlikely to grow. And all the time those predictions are proven wrong. In summer 2006, there were about 2 millions articles in need of translation from non-English Wikipedias, and more then 50 million of specialized topics in need of creation (I justify those numbers below). In summer 2011, Wikipedia boasted 3.5 million articles, still covering less than 10% of what would be, roughly, a comprehensive coverage of world's notable subjects. Wikipedia is just in its infancy...

Introduction
One of the interesting questions about Wikipedia is "how much more information there is for Wikipedia to assimilate"?

Part of that answer, if we look at English Wikipedia, is the number of articles from non-English Wikipedias that need to be translated. I was somewhat surprised that we have no such statistics - at least I was unable to find information on how many articles on a given Wiki (for example, Polish wiki) have interlinks to a specific (English) Wiki?

I checked pages of User:YurikBot and on Interwikimedia link, Interlanguage links (shouldn't those two be merged?), and Multilingual coordination, but they don't seem to have the answer (or I can't find it :>)

Note: while the initial comparison (Polish Wikipedia, PSB) was done by me (Piotr Konieczny aka Prokonsul Piotrus talk), please don't hesitate to edit this page, and add more information (from 'to do' lists or whatever you seem is appropriate). But let's discuss it at the discussion page, not here.

Polish Wikipedia interwiki test
So I decided to run a little test: take a random sample of 100 pages from Polish Wikipedia (4th largest Wikipedia with over 250,000 articles) and see how many have interwiki links to en wiki. The sample was taken by clicking the 'random page' button and noting down if article has an interwiki or not.

Results: out of 100 pages randomly selected on Polish Wikipedia, 72 had no interwiki links to en Wikipedia. (test as of 22 July 2006; Wikipedia at that time had about 1,350,000 articles.)

Notes:
 * 1) The results may be slightly under counted, as it is possible some pages have an unlinked equivalent on en-wiki. If I were to check, I expect that around 10 pages should have equivalents on en wiki.
 * 2) I did count disambig pages
 * 3) Interestingly quite a few of pages that have no interwiki to English wiki have interwiki to wikis in another language. I did not keep track of that, just noted the general trend.
 * 4) Many of the missing pages were towns, villages, administrative districts. Interestingly, many of those were foreign (French, Italian, etc.) and belonged in the category 'has interwikis to non-English wiki but not to English wiki').

Conclusion: While generalizing from Polish Wiki to other wikis is not recommended and we should do similar tests for other wikis, it appears that between 60-70% of articles on Polish wiki have not yet been translated to en wiki. According to Wikipedia article, Wikipedia had more than 4,600,000 articles in many languages, including more than 1,200,000 in the English-language version. Therefore it is possible that just by translating articles from non-English wikipedias into English Wikipedias we would increase the size of English Wikipedia by ~2,000,000 articles.

To do

 * see if the same number holds true for other Wikipedias. Analysis of German - second largest, and French - third largest, would be quite useful. The other 7 to get 'top ten' would further increase statistical significance of these analysis across the Wikipedia's projects.
 * get a bot to run those tests every month or so
 * record data on interwikis to non-english languages if english is not present
 * record data on how many articles should have interwikis but don't

Update

 * 1) 7 August 2007: out of a hundred random pages, 59 pages had no English interwiki. 21 of them however had interwikis to other wikis. This would indicate improvement of coverage of Polish Wikipedia subjects by 14% (last year, 72% had no English equivalent; now, 59%).
 * 2) From September-October 2007, a bot (User:Kotbot) has started to fill the gaps in Polish gminas, villages, and similar content on en-wiki. This may impact the interwiki test.
 * 3) 9 May 2008:  52 had no interwiki, 48 had interwiki.
 * 4) 8 September 2008: 58 had, 42 hadn't. Have some trends stabilized?
 * 5) 25 December 2008: 58 had, 42 hadn't (exactly the same as last time??). This time, I've also counted articles with non-English interwiki: out of 42 articles that didn't have interwiki to English ones, 8 had interwikis to foreign ones (German and French seem most popular).
 * 6) 23 March 2009: 70 had, 30 didn't. Out of those 30, 8 had interwikis to non-English ones. The sample seems to small for a good distribution, but there were 4 links to be/ru/uk wikis, 1 to de wiki, 3 to es/pt wikis, 1 to fr wiki and 2 to it wiki.
 * 7) 16 June 2009: 67 had, 33 didn't. Out of those 33, 7 had interwiki to non-English ones. Distribution: fr wiki 1, it wiki 1, estonian wiki 1, slovak wiki 1, netherlands wiki 2, multiple non-en wikis 2. Observations: 1) en wiki seems to be missing articles on astronomical objects (asteroid, galaxies) 2) missing Polish articles seem to center on the following areas: biographies, subdivisions below settlement level (urban districts/neighborhoods), buildings (churches, train stations).
 * 8) 1 December 2009: 65 had, 34 didn't. Out out those which had, 52 had links to multiple wikis, only 13 had links to en wiki only. Out of those that didn't, 25 had no iwiki links (Stara Synagoga w Narwi, Haemodipsus ventricosus, Eothalassoceras, Moshic, Wish You Were Here (Marillion), Starzec karpacki, Osiedle Ogrody (Siedlce), Robert Sobociński, Herb powiatu ostrowieckiego, Jezioro Gamerskie, Poziom próchniczny, Martin Grosser, Sternik jachtowy, Adam Krzysztofiak, Malinka (disambig), Bojowe środki promieniotwórcze, Międzynarodowy Festiwal Szkół Filmowych i Telewizyjnych MEDIASCHOOL, Mieczysław Bilski, Kazakhoceras, Txema, DSP56000, Huba (disambig), 3 Pułk Haubic Polowych Austro-Węgier, Dolnośląski biały łapciaty, Pieszkowo (gmina Miłakowo)), 10 had links to non-English wikis (3 to only one: pt, ca, fi, 7 to multiple: NGC 6953, NGC 1684, NGC 1461, Wasyl Busłajewicz, (74361) 1998 WK16, Keskranna, Rumcajs). The sample included 5 disambigs; in three instances I found and added English / Polish interwikis that were missing. All examples, including disambigs, were notable for inclusion on en wikipedia. Interestingly, non-en Wikipedias seem to have articles on general knowledge (galaxies, species) that are still missing  from en wikipedia.
 * 9) 10 May 2010: 76 had, 24 didn't. Out of those which had, 58 had links to multiple wikis, and  16 to en wiki only. The following articles had no interwikis: Bisclaveret, Konfraternie pokutne, Brent Bailey, Prezentacja sceniczna, Synagoga Agudat Szloma we Lwowie, Kartaczownica Claxtona, 7 Pulk Szwolezerów-Lansjerów, Kurowscy herbu Nalecz, Parafia Swietego Krzyza w Rzeszowie, Ewa Hutny, Okreg Kujawsko-Pomorski ZHR, PocketDOS, Janusz Szablicki, Wydzial Chemii Uniwersytetu Gdanskiego, Maszyny i wilki, Speedway Championships, Andrzej Szalewicz, Maria Zmarz-Koczanowicz, Waldemar Roszkiewicz, Przemiana austenityczna, Kosciól sw. Malgorzaty w Graboszewie, Sobkowy Zleb, Herb Zelowa, Muzeum sw. Faustyny w Plocku. The following artcles had interwiki to non-en Wikipedias: Landrat (Prusy), Celia, 11 Czechoslowacki Batalion Piechoty - Wschód, Jan z Wislicy, Syntagma, NGC 6386. In two instances I found and added English / Polish interwikis that were missing
 * 10) 10 Sept 2010: 76 had, 24 did not. Out of those which had, 54 had links to multiple wikis including the English one, 10 to English wiki only, 3 to multiple one but not English and 9 to one non English wikis. The following one had no interwiki: Grzegorz Cwajg, Waldemar Wegrzyn, Wyroby hutnicze, Nawuj, Bartosz Konopka, Lubelska Szkola Biznesu, Slask Wroclaw w europejskich pucharach sezon 2000/2001, Jezioro Dziewicze (dolnoslaskie), Grzegorz Kuzniewicz, Alina Rossman, Ksiestwo czerskie, Terapia logopedyczna, Marek Boral, Stary Ujków, Gmina Góra (powiat warszawski), Towarzystwo Milosników Historii, Kapitan zeglugi wielkiej, Witold Tokarski, Intentional grounding, Jan Wyszkowic, Thiago Gomes, Hemicordulia tau, Kamieniec Zabkowicki (stacja kolejowa), Mieczyslaw Poznanski, Gmina Lagiewniki (ujednoznacznienie), Juan Antonio Albizu). The following one had a single non-English interwiki: Powiat sulechowski - Esperanto, Maurycy Stefanowicz - Finnish, Les Ordres - Catalonian, Cyryl (Dmitriew) - Russian, Roztoka Odrzanska - Dannish),Brendan Doran - German, Renat Ibragimow - Russian, Ron Richards - German, and the following ones, multiple non-English: Hovs kommuna, Naczyniak limfatyczny torbielowaty, Lubusz (ujednoznacznienie). The sample included 4 disambigs; in four instances I added a missing interwiki between Polish and English wikis (those instances are counted as having an interwiki).
 * 11) 23 May 2011: 74 had, 26 did not. Out of the 74 which did, 56 had links to multiple wikis including the English one, 9 to English wiki only, 4 to multiple one but not English and 5 to one non English wikis. The following one had no interwiki: Kosciól sw. Antoniego Padewskiego w Gdansku, Swietlik nadobny, Bóstwa astralne, Ognisko (pismo), Rafal Szukiel, Galeria pod Sufitem, Ciri, Julian Kalowski, Kosc luskowa, Nowa figuracja, Fengxue Yanzhao, ORP Jastrzab (1963), Zbór Kosciola Ewangelicznych Chrzescijan w Lublinie, Albin Jasinski, Litoriusz, Adam Sobieraj, Renato Salvatore, Julija Sorokowska, SIERRA (program), Antoni Maciej Sierakowski, Ugoda polsko-ukrainska z 1935 roku, Kazimierz Dlugopolski, Yaro, Wskaznik protrombinowy, Artur Mkrtczjan, Osiedle Sloneczne (Wloclawek). The following one had a single non-English interwiki: de - Ernst Eckstein, Dialekt malopolski, Herman David, Johannes Theodor Suhr, Klasztor kartuzów we Frankfurcie nad Odra, lt - Giemzy II. The following ones, multiple non-English: NGC 5892, Towarzystwo im. Mychajla Kaczkowskiego, Sferosom, Karmnik rozeslany. The sample included 3 disambigs; in 2 instances I added a missing interwiki between Polish and English wikis (those instances are counted as having an interwiki).
 * 12) 25 Oct 2011: 74 had, 26 did not. Out of 74 that did, 54 had  links to multiple wikis including the English one, 13 to English wiki only, 5 to a single non-English one and 2 to multiple non English wikis. The following had no interwiki: Herb Łobza, Mistrzostwa Europy Juniorów w Lekkoatletyce 1989, Nasza ulica, Błeszyński (herb szlachecki), Pałac w Brzyskach, Ryszard "Skiba" Skibiński 1951-1983, Osiedle Miłocin (Rzeszów), Gry liczbowe, Killer (MUD), Michaił Okuń, Edward Rybarz, Dorota Brodowska, Most Siechnice-Łany, Obniżenie Mieroszowskie, Pomnik Piotra Skrzyneckiego w Krakowie (ul. Skawińska), Liceum, Sióstr Niepokalanek w Szymanowie k. Warszawy, Grupa mostowa, Ignacy Zatorski, Tarcie ruchowe, 3 Pułk Zmechanizowany, Indywidualne mistrzostwa Anglii na żużlu, Medal Stanisława Kostaneckiego, Krawędź tkaniny, Turniej Czterech Narodów w Bangkoku 1999, Szymon Czapiewski, Al-Dżarkana. The following ones, multiple non-English: Droga N14 (Holandia), Katja Wächter. The following one had single link to non-English wikis: 1 German (Oberharz), 2 Lithuanian (Parafia św. Jana Teologa w Dąbrowie Białostockiej, Kaletnik Mały) and 2 Dutch (Bartąg (przystanek kolejowy), Bieruń Stary). The sample included 4 disambigs, in 3 instances I added a missing interwiki between Polish and English wikis (those instances are counted as having an interwiki).
 * 13) 28 March 2012: 66 had, 34 did not. Out of 66 that did, 47 had links to multiple wikis including the English one, 8 to English wiki only, 4 to a single non-English one and 7 to multiple non English wikis. The following had no interwiki: Królowe Wirtembergii, Nokia DCT3, Lista mistrzów Formuly 1, Urszula Garwolinska, Swielub, Jan Olszanski, Trójkat (polaczenie), Palac w Szalszy, Aleksander Goluchowski, Towarzystwo Milosników Miasta Bydgoszczy, SN 2001by, Sznajderman, Miloslaw (ujednoznacznienie), 10 Pulk Artylerii Lekkiej, Slonce Arizon, Nowy Józefów (osiedle w Lodzi), Sizzo, Reprezentacja Kirgistanu na Mistrzostwach Swiata w Narciarstwie Klasycznym 2011, Wojciech Zalinski, Skaly melanokratyczne, Klasztor Karmelitów Bosych w Zawoi, Marek Kryda, Zoogoneticus, Linia kolejowa nr 92, Andrzej Malczewski, Skarzynski, Stacja Monitoringu Srodowiska Przyrodniczego UAM w Bialej Górze, Centralna Skladnica Marynarki Wojennej, Ihor Czeredniczenko, Moczylki Waskotorowe, Spóldzielcza Grupa Bankowa, Mieczyslaw Piotrowski (pisarz), Wyleganie, Naftali Lau-Lavie, Sari Ska Band. The following ones, multiple non-English: Club Voleibol Cuesta Piedra, NGC 495, Sankt-Georg-Schanze, Frankendorf, Paolo De Nicolò, Wyszeslaw Wlodzimierzowic, Rosyjska Akademia Sluzby Panstwowej. The following one had single link to non-English wikis: Slavonín (Czech), Åge Sten Nilsen (Norwegian), KAB-500 (Russian), Ukrainskie Regionalne Muzeum "Strywihor" w Przemyslu (Ukrainian). The sample included 8 disambigs.
 * 14) 26 August 2012: 77 had, 23 did not. Out of 77 that did, 68 had links to multiple wikis including the English one, 4 to English wiki only, 3 to a single non-English one and 2 to multiple non English wikis. The following had no interwiki: Roztoka (Góry Leluchowskie), Tribute to Rejestracja, Robert Sikorski, Kanada na Igrzyskach Imperium Brytyjskiego 1934, Metropolia Kansas City, Kreznica Okragla (kolonia w gminie Belzyce), Kreznica Okragla (kolonia w gminie Belzyce), Nowe Siolo (ujednoznacznienie), Kosciól sw. Jana Chrzciciela w Leszczawie Dolnej, Sluz gestagenny, Hotel Polonia we Wroclawiu, Parahomoceras, Parafia sw. Karola Boromeusza w Poznaniu, Urszula Zybura, Czeslaw Robakowski, Dynamo Tarnopol, Województwo slasko-dabrowskie, 2 Front, EXPAL, Szubieniczna (Kotlina Klodzka), Kaplica Swietego Krzyza w Lukowicy, Aleksander Dobrzanski (biskup), Herb Labiszyna, SN 1989R.  The following ones, multiple non-English: NGC 2028, Blanc guenar. The following one had single link to non-English wikis: Indaeschna grubaueri (Dutch), Oleksij Hatin (French), Barak Baba (Turkish). The sample included 3 disambigs.
 * 15) 13 February 2013: 83 had, 13 did not. Out of 83 that did, 70 had links to multiple wikis including the English one, 7 to English wiki only, 5 to a single non-English one and 1 to multiple non English wikis. The following had no interwiki:  Vápenná jaskyna, Kiczora (839), Rozwiniecie Herbranda, Bohdan Kurowski, Kynoforia, Najasnica, Siódme wtajemniczenie, Rotunda Najswietszej Marii Panny na Wawelu, Wulkan eksplozywny, Klimat podrównikowy, Cud Matki Boskiej Snieznej, Berlinka (statek), Bartlomiej Kwiatkowski, Wiktoria Quintana Argos, Dekanat Mogilany, I Liceum Ogólnoksztalcace im. Juliusza Slowackiego w Czestochowie. The following ones, multiple non-English: NGC 6147. The following one had single link to non-English wikis: Bilgoraj LHS (Dutch), Izaak Brudny (Russian), Antoni (Zawgorodny) (Russian), Olszyna Lubanska (Dutch), Østjyske Motorvej (Dutch). The sample included 5 disambigs.
 * 16) 15 May 2014: 78 had, 22 did not. Out of 78 that did, 60 had links to multiple wikis inc. the English one, 10 to English wiki only, 5 to a single non-English one and 3 to multiple non-English ones. The following had no interwiki: Pomnik przyrody w gminie Czorsztyn, Inbentos, Polska Bibliografia Narodowa, Józef Jagielski (komunista), Skala CEAP, Instytut Wiedzy i Innowacji, Jan Wislocki, Medalisci mistrzostw Polski seniorów w pchnieciu kula, Buczyna (powiat olkuski), Zwiazek Hodowców Psów Rasowych, Kajetan Proskura Suszczanski, Krzysztof Michalik (informatyk), Teatr Dramatyczny im. Jana Kochanowskiego w Opolu, Gromada Pogrzebien, Rózany Potok, Palac biskupi w Ciazeniu, Petecki, Memorial Józefa Dominika, Parafia sw. Marcina BW w Siemkowicach, Ruch Przyszlosci, Bractwo sw. Lukasza (Polska), Tylkowy Zleb, Gmina Kowala (województwo krakowskie). The following ones, multiple non-English: NGC 691 (19), Droga krajowa nr 14 (Polska) (3), Kulewcza (obwód Szumen) (5), Aleš Hanák (2). The following one had single link to non-English wikis: Radzic (Russian), Maccabi Nazaret YMCA (Israeli), Bitwa pod Labiszynem (Russian), Halina Dorda (Czech), Alfred Piper (German). The sample included 2 disambigs.

Specialized knowledge test
Next, I decided to run a comparison of 'how many articles from a random encyclopedic publication' are missing on Wikipedia. The publication I selected, Polski Słownik Biograficzny (encyclopedia of famous Poles), was not completely random, but as far as I know there is no project dedicated to creating relevant stubs on en-wiki, and as one of my past projects there is a nice index at User:Piotrus/List of Poles. Note also that PSB is not a general knowledge encyclopedia but a specialized knowledge encyclopedia.

Results: as of 22 July 2006 out of selected 1000 entries of User:Piotrus/List of Poles/Kisielinski-Korzelinski, about 30 entries have blue links (I ignored entries in need of disambigation, like 10 entries for Konrad). Wikipedia at that time had about 1,350,000 articles.

Notes:
 * 1) As the bot we used to generate this index was not perfect, and it doesn't help if there is an entry with diacritics but no redirect without them, this means the blue links may be somewhat under counted, but I would be very surprised if by 50%.
 * 2) Due to the nature of PSB (print publication, not easily updated) and my own observation (how many Poles who have articles on Wikipedia are not listed in PSB), it should be noted that PSB does not represent perfect coverage in its area. But for now I assume that the number of articles about Poles Wiki has that are not on PSB, and the number of missing articles from Wikipedia that are covered on PSB evens out.

Conclusion (as of June '06): assuming PSB represents the average coverage of specialized knowledge on (English) Wikipedia, Wikipedia has covers currently about 3% of such knowledge. If 3% = 1,350,000 articles, than 100% would equal, roughly, 40,000,000 (40 million) articles. Therefore Wikipedia will be approaching somewhat comprehensive coverage of specialized knowledge when we have about 40,000,000 articles. This is a very rough estimate, but it is my reply to some people who said there is not enough encyclopedic knowledge to merit 2,000,000 articles, as well as to the very optimistic estimates of WikiProject Missing encyclopedic articles (Biographies - 92.6% done ?? who are they kidding? :D )

To do

 * run similar comparisons on some other specialized knowledge databaes
 * run similar comparisns on general knowledge databases
 * given statistical data (see Modelling_Wikipedia%27s_growth), do a trend estimation to see when we reach ~40m articles

Updates
Preeliminary analysis suggests coverage improvement of ~1% per year, with the estimate completion around turn of the century, assuming a linear growth model...
 * 1) 8 August 2007. I counted 34 blue links in 'Kisielinski-Korzelinski'. I counted two more for better stats: 'Olbrycht-Pawleta' - 37; 'Ebenberger-Gembicki' - 28 - so the ~3% still holds.  Wikipedia at that time had about 1,800,000 articles.
 * 2) 16 May 2008. 'Jesionowski-Kisielewski': 47. 'Skowron-Spiczakow': 23. 'Biergel-Bzowski': 36. Some interesting outliers, but it is safe to say ~3% still holds.  Wikipedia at that time had about 2,250,000 articles. Wikipedia at that time had about 2,300,000 articles.
 * 3) 25 December 2008. 'Biergel-Bzowski': 36, 'Hoser-Jerzykowski': 46, 'Majnert-Michiels': 44. ~4%? Wikipedia at that time had about 2,600,000 articles.
 * 4) 23 March 2009. 'Danielski-Dzwonkowski': 52. 'Lichtenstein-Majkowski': 67. 'Rutowicz-Schreiber'. 58 ~5%? Wikipedia at that time had about 2,750,000 articles.
 * 5) 16 June 2009. 'Skowron-Spiczakow': 28. 'Przyalgowski-Retke': 65. 'Grodecki-Hoscki': 48. ~5%? Wikipedia at that time had about 2,950,000 articles.
 * 6) 8 Dec 2010. 'Kisielinski-Korzelinski' 58.  'Olbrycht-Pawleta' 66; 'Ebenberger-Gembicki' 60. ~6%, and double the coverage of 2007. Wikipedia at that time had about 3,500,000 articles.
 * 7) 23 May 2011. 'Gemma-Groddeck' 58; 'Rutowicz-Schreiber' - 70; 'Krzesinski-Lichtarowicz' - 61. Keeping at ~6%
 * 8) 25 Oct 2011. 'Abakanowicz-Bienkowski' 57, 'Korzeniewski-Krzesimowski' 67, 'Skowron-Spiczakow' - 37. No change.
 * 9) 29 March 2012. I counted 63 blue links in 'Kisielinski-Korzelinski'. 'Olbrycht-Pawleta' - 70; 'Ebenberger-Gembicki' - 50. No change. This time I decided to repeat the first sample.
 * 10) 26 August 2012. 'Jesionowski-Kisielewski': 72. 'Skowron-Spiczakow': 38. 'Biergel-Bzowski': 59. ~6% still seems to be the rough estimate. Repeated second sample. Individual samples seems to suggest rough doubling within their populations.
 * 11) 13 February 2013  'Biergel-Bzowski': 58, 'Hoser-Jerzykowski': 66, 'Majnert-Michiels': 72. ~6.5%?
 * 12) 15 May 2014. 'Rettel-Rutkowski': 46, 'Gemma-Groddeck': 59, 'Lichtenstein-Majkowski': 89. ~6.5%?

Updated conclusion (as of February '11): It appears that Wikipedia is growing faster in some other areas than in Polish biographies. Wikipedia coverage of Polish biographies have doubled between June '06 and December '10, but its total number of articles has grown almost threefold in that period (well, around 2.6 times). If we were to take June '09 or Dec '10 numbers and try to estimate the size of complete Wikipedia, we would get the number of ~60 million instead of 40, as the June '06 data would suggest. Of course, as the growth in Polish biographies have not kept pace with the growth of Wikipedia, it is obvious that it is hardly a perfect estimator. Assuming it is some kind of an estimator, we might as well take an average of those two results and call the ultimate, comprehensive size 50 million.