Wikipedia talk:Article size/Archive 3

New talk
-- Mojska  666  – Leave your message here 15:52, 5 May 2008 (UTC)

Readable prose numbers?
It seems section "A rule of thumb" has been edited to change "Article size" to "readable prose" in this edit, as has been pointed out in the Talk above. Has this been done with the consensus of the editors? I humbly suggest that this preceeding edit was Ok (There is no need for haste, and the readable prose size should be considered separately from references and other overhead.), but the next edit was not. The rule of thumb (I thought) was to be based on the actual number seen when editing. Editors would then look further at the number of bytes in the readable prose (they would need to determine this themselves in an editor) and possibly also at the use of templates to help decide how and what should be reduced or split. So, please could someone re-address this issue? It seems the Talk above brought it up, but left it unresolved. My opinion is that the byte count should appear in the Rule of thumb and that still better wording be used in the No need for haste. -84.223.78.86 (talk) 17:21, 14 May 2008 (UTC)


 * Actually the discussion above focused only on the table and ignored the very first section WP:Article size which first raises the issue of "readable prose".  This section has been there well before the change in the table which appears to me have been a logical step to make the article consistent.  If it was intended all along, as some claim, to only use the concept of total bytes then the concept of "readable prose" never would have been included at all.


 * There was, and still is, a proposal to change the table. This proposal has not secured a majority of support, let alone a consensus.  If you have an alternative proposal, then spell it out and we can discuss it. Tom (North Shoreman) (talk) 17:40, 14 May 2008 (UTC)


 * Readable prose is only one factor to consider. Getting back to the "No need for haste section", do you have a suggestion?

"'Do not take precipitous action the very instant an article exceeds 32 KB overall. There is no need for haste, and the readable prose size should be considered separately from references and other overhead. Discuss the overall topic structure with other editors. Determine whether the topic should be treated as several shorter articles and, if so, how best to organize them. Sometimes an article simply needs to be big to give the subject adequate coverage. Certainly, size is no reason to remove valid and useful information.'"


 * Add, "without moving it to a subarticle" at the end? Oakwillow (talk) 17:45, 14 May 2008 (UTC)


 * I think it was pretty well resolved in the above proposal where the IP proposed that the article size be used to determine when an article should be split up and all of the responders to that survey responded as being opposed to that change. As has been noted several times, using the article size (per the message that appears when you edit the article) is a bit of a misnomer in that it includes references and other non-readable text. Heck, it doesn't even accurately reflect the actual total size of an article in that the size of the images on the article are not included and those are the things that eat up the most bandwidth when it comes to downloading the page. It may be unfortunate, but if Wikipedia wants to have well referenced articles and maintain the inline references, then the size that needs to be used is the readable prose, not the article size. --Bobblehead (rants) 17:53, 14 May 2008 (UTC)
 * Concur. Sandy Georgia  (Talk) 17:56, 14 May 2008 (UTC)
 * That's not necessary. When it comes to some things edit byte count is most important, when it comes to other things, readable prose is most important. I can assure you that on dial-up, edit byte count is hugely dominant. There is no one size fits all way to choose which to use. However, as pointed out above, if you do wish to change "article size" to "readable prose size" you need to divide all the numbers by two or three to make them work. Forcing people to "have to" look up the readable prose metric is unnecessary. Editors should feel free to use whatever grain of salt they deem appropriate. Oakwillow (talk) 18:08, 14 May 2008 (UTC)
 * Edit byte size may be the easiest to come up with, but it is also the worst of all the options we have to determine when it may or may not be appropriate to split up an article. This is because of the possible measures that could be used it is the one that doesn't actually reflect anything important. Readable prose reflects the attention span of the average reader and total html size reflects the size that is actually downloaded when viewing the page. The only thing edit byte size reflects is the amount of text in the edit field, which only impacts someone if they want to edit an article. Even then, the impact is minimal because of the advent of sectional editing. --Bobblehead (rants) 18:31, 14 May 2008 (UTC)

Table of bytes downloaded
From the responses above I do not have a suggestion as the issue appears more complex than a simple rule of thumb can address. In fact Oakwillow makes the important point that both can be relevant. Meanwhile to help (me, mostly) understand what causes the bloated size of some large Wikipedia articles (which causes slow loads for those with slower computers and dial-up network access), here is a table showing a breakdown of bytes used in two of the longest articles (Barack is Barack Obama and Hilary is Hillary Rodham Clinton presidential campaign, 2008). ( added Barack campaign = Barack Obama presidential campaign, 2008 ) -84user (talk) 20:28, 14 May 2008 (UTC)

* Not including references. ** using http://www.websiteoptimization.com/services/analyze/index.html

The large amount in Hillary's article surprised me, but even without the Table of Contents there are 122,926 characters of readable text there (each newline counts as one). I also tried some load tests on different browsers, one under an emulator to exagerrate the slowness (I am on a fast PC with fast network), and also used the www.octagate.com Site Timer (which reports 5.4 seconds for Barack and 11.5 sconds for Hillary's page - these times match my fastest times on a 2.6 GHz PC with over a gigabyte of RAM, so I can well believe that dial-up users find these pages "unloadable"). -84.223.78.86 (talk) 18:37, 14 May 2008 (UTC) (I added Russia.-84user (talk) 18:25, 18 May 2008 (UTC))


 * You left out an important component (images), which the last time I looked, is what slows down Clinton, and has nothing to do with any of the measures we're looking at. Sandy Georgia  (Talk) 18:43, 14 May 2008 (UTC)
 * Added. Please correct if they are wrong, or change from k to bytes. Another metric I would like to suggest is no article with a printed page count of more than 10 pages, not including the references. The references in Barack are another 10 pages, so 18 total and 12 for Hillary, 39 total. Oakwillow (talk) 19:18, 14 May 2008 (UTC)

Actually SandyGiorgia, you are right, the images are significant and rather large for some articles (both democrat campaigns have over 420 kilobytes with three large ones). The numbers in my table exclude image sizes, they are the raw byte count of the html page. I am adding approximations now. (I just created this username). -84user (talk) 19:44, 14 May 2008 (UTC)
 * Thanks for telling me; considering the work I've done on looking at the load time of this article relative to other articles, its prose size, and its images, I never would have guessed I might be right ... I actually look for opportunities to spout off and be wrong :-))) No, we didn't need to add printed page size; the relevant measure (word count and prose size) deal with reader attention span. Layout is another matter, affected by tables and such. Sandy Georgia (Talk) 19:57, 14 May 2008 (UTC)
 * Sorry, I did not mean it in any sarcastic way, I really did realise after reading your comment and checking images, yes I'd forgotten about the images, and two, that they have a big effect. -84user (talk) 20:28, 14 May 2008 (UTC)
 * I was just joshing :-) But I've spent a lot of time looking at these issues, as I'm often forced to a dialup when I travel, and I believe the issues in many of the slow-loading articles will resolve to images, not prose, although I haven't spent enough time sorting out what factors affect load time wrt images.  I think our measure of readable prose is fine.  Sandy Georgia  (Talk) 20:47, 14 May 2008 (UTC)
 * However if you do think of switching to readable prose, don't forget to divide all the numbers by two or three, or 2.5 or something. See the dramatic comparison above. Images don't bother me in most articles, because most articles use thumbs which resolve to about 26 kB each, not a factor. Once in a great while someone insists on putting in three or four 400 pixel images or a hundred flag symbols, both of which take like forever to load, but other than that the images are not a big factor. It is true that there is a fairly close relationship between readable prose and printed pages, but since I sometimes do print out articles to show people I always cringe when they run into 10 or more pages, knowing that it really isn't ever going to be read. Oakwillow (talk) 21:47, 14 May 2008 (UTC)
 * I think the more interesting question is why editors insist on shooting themselves in the foot by writing articles so long that no one will read them anyway, or adding so many images and slowing down the loadtime so much that no one will even click on the article. Sandy Georgia  (Talk) 21:56, 14 May 2008 (UTC)

Evidently nobody wanted to create any subarticles for the campaign. Oh well, to each their own. I would like it if someone could make a table of 50 articles of various sizes so that we could plot them and compare to see if there is a strong correlation between edit byte count and readable prose size. So far the three examples above range from a ratio of 1.95 to 3.47. So if it was determined that the ratio always was in that range, would you prefer dividing all the numbers by some mid value, or would you prefer just changing the guideline to say edit byte count? Or should I wait for the data before asking the question? Oakwillow (talk) 01:18, 15 May 2008 (UTC)

Table
The following articles were chosen randomly to give a cross section of FA, GA, and unrated articles. The FA articles were chosen from a broad cross section of categories, the GA articles were chosen randomly through the alphabet, and the unrated ones were just what came up using random article. The political campaigns and their countries were added separately.

* Article does not cite any references or sources. ** List.

Ratio of readable prose to word count = 6.204 with a correlation of 0.9995

Ratio of edit byte count to readable prose text = 2.07 with a correlation of 0.965

Ratio of edit byte count to word count = 12.8, with FA articles tending to have a lower ratio than GA or unrated articles. For practical purposes an easy rule of thumb to guestimate word count is to just divide edit byte count by 12. This will not work for lists and articles with many illustrations or tables. A more accurate word count can be obtained using a text editor or installing User:Dr pda/prosesize.js in your monobook.js Oakwillow (talk) 17:49, 8 June 2008 (UTC)

Poll
It is clear that there has been some confusion about the measurement of article size. Most people interpret "Article size" to be "edit byte count", although many experienced editors think of it as "Readable prose size". The current table, however, is a legacy from when it did mean edit byte count. Therefore, there are three options, please choose one or more:

A. Change "Article size" to "Readable prose size"
Change "Article size" to "Readable prose size" and adjust all quantities by a factor of approximately 2.5 correspondingly. This will have the effect of forcing editors to ignore the 32 kB warnings and artificially figure out how to count readable bytes to determine appropriate article length.


 * 1) Well, aside from this poll being one of the more biased polls I've seen in quite awhile, readable text is the size that is most important measurement as far as an encyclopedia goes.  The average attention span of a reader has is what we should be measuring the articles against, not some random measurement, like edit byte size, that is not applicable to any issue that causes problems for readers or editors. Edit byte size is impacted by the number of references an article has and whether or not the sources are formated using cite templates. This becomes especially problematic for controversial topics where it is not uncommon for editors to fight over the most minor of information if they are not cited properly. In articles related to politicians, it is not uncommon for the amount of Kb references take up to be larger than the amount of Kb the actual text of the article takes up, due to the tendencies of the editors to fight over the smallest of nits. --Bobblehead (rants) 15:27, 16 May 2008 (UTC)
 * The article already uses the concept of readable pause and has for years -- as at least four editors have already pointed out. Readable prose should stay as the operative term.  There is already a proposal on the table covering this subject, and this appears to be nothing but an attempt to further confuse issues. Tom (North Shoreman) (talk) 15:50, 16 May 2008 (UTC)
 * The proposal that was on the table is the same as option B, below. This poll supersedes that proposal, as it is more comprehensive. Oakwillow (talk) 16:34, 16 May 2008 (UTC)
 * So YOU say. How do you, a single editor, have the authority to say that your proposal takes precedent and an existing proposal is no longer open for consideration or debate? People have registered their opinions above and are under no obligation to participate in your "biased poll" in order for their expressed opinions to remain valid in determining consensus or lack of consensus.Tom (North Shoreman) (talk) 16:43, 16 May 2008 (UTC)
 * The original proposal was broken because it did not provide any valid choices for editors such as yourself who were opposed to option B. By the way, if you could help fill in the table with the readable prose numbers I can do the ratio calculations. It is extremely tedious for me to obtain readable prose size. I noticed above you indicated that it takes you less than a minute. Oakwillow (talk) 17:06, 16 May 2008 (UTC)
 * If you had read the "How do you find the readable page size?" topic above, you'd have seen that the User:Dr pda/prosesize.js tool allows you to find readable prose sizes and readable prose word counts instantly. Wasted Time R (talk) 21:33, 16 May 2008 (UTC)
 * I'm on dialup. Nothing is instant on dialup. WP is a collaborative project. If someone else could fill in the prose column, I can do the rest. My computer often locks up long before any of those long files load, but I don't have to load them to get the byte size, I just look at the history. Oakwillow (talk) 00:35, 17 May 2008 (UTC)

B. Change "Article size" to "Edit byte count"
Change Article size to Edit byte count and make no changes to quantities. Readable prose is considered separately as a measure of article size, but since there is a strong correlation between the two measures, both are equivalent.


 * 1) This is the simplest option, this quantity is displayed for all to see every time an article or even a section that is over 32 kB is edited. Staying within guidelines guarantees readable prose is also within reasonable limits. Oakwillow (talk) 14:59, 16 May 2008 (UTC)
 * It is very clear that "readable prose" and "article size" are not "equivalent. As the discusion above (as well as the article itself) demonstrate, these are two very different concepts.  Basic inaccuracies such as this invalidate any results that may come from this poll.  The existing proposal provides two very clear alternatives -- this "based poll" adds nothing to the ongoing debate. Tom (North Shoreman) (talk) 16:51, 16 May 2008 (UTC)
 * Equivalent in the sense that there is a one to one, two to one, three to one, in other words a linear relationship between them. Knowing one you know the other. See the table above. Oakwillow (talk) 17:27, 16 May 2008 (UTC)

C. Make no changes. Leave it saying "Article size"
Make no changes. Leave it saying "Article size" instead of "Readable prose size". This will mean that to some editors this will mean "edit byte count", and to others, who will tend to encourage articles that are two to three times as long, it will mean "readable prose bytes". Since both metrics are important in determining article length, readable prose and byte count, everyone is happy.

Validity of this poll??
The poll makes a number of assumptions and generalizations that are either debateable, inaccurate, or unverifiable. While I responded, the man debate should remain focused on the original proposal made above -- as of this date there is a clear majority opposed to changing "readable pose" to "article size."Tom (North Shoreman) (talk) 15:59, 16 May 2008 (UTC)
 * I think you meant to say changing "Readable prose size" to "Edit byte count". Changing to article size is not one of the options. Oakwillow (talk) 16:28, 16 May 2008 (UTC)
 * NB I apologize for the bias that was originally in option C, and have removed it (by changing "continue to mean 'edit byte count', which it is" to "edit byte count"). Oakwillow (talk) 16:44, 16 May 2008 (UTC)
 * The main bias is your claim that the article currently says "article size" when it very clearly does not. You are mistakng your minority opinon on what the article should say with what it actually does say. Tom (North Shoreman) (talk) 16:58, 16 May 2008 (UTC)
 * Someone, not to mention any names, improperly recently changed it, but that is easily fixed. The article to all intents and purposes says "Article size", but I'm not going to get into an edit war about it. Oakwillow (talk) 17:30, 16 May 2008 (UTC)

Very biased poll. All the options make the implicit assumption that the current numbers reflect "edit byte count", when clearly they were meant to mean "readable prose size". I might as well make the following options for balance:

D. Change "Readable prose size" to "Edit byte count"

Change "Readable prose size" to "Edit byte count" and adjust all quantities by a factor of approximately 2.5 correspondingly.

E. Change "Readable prose size" to "Article size"

Change "Readable prose size" to "Article size" and make no changes to quantities. The description will be ambiguous and every user can choose for it to mean what they think it should mean.

'''F. Make no changes. Leave it saying "Readable prose size"'''

Make no changes. Leave it saying "Readable prose size" instead of "Article size". This will mean that to all editors this will mean the same thing and be absolutely clear.

HermanHiddema (talk) 22:42, 16 May 2008 (UTC)
 * Look at the history. The word prose does not even appear in the first three and a half years, other than to say that the numbers do not refer to prose. If they did mean prose they would have been adjusted downward by 2 or 3 when the switch was made from byte count to prose, and that clearly never happened. Oakwillow (talk) 03:44, 17 May 2008 (UTC)


 * This version from march 6 2004, 1 year after the page was created, already contains the phrase: "Readers may also tire of reading a page in excess of 20-30 KB of readable prose (tables, lists and markup excluded)." HermanHiddema (talk) 14:13, 17 May 2008 (UTC)


 * Which is compatible with a limit of 40-100 KB edit byte count. I'm still waiting for someone to add in the numbers for the readable prose column. Here is a compromise table. Oakwillow (talk) 14:57, 17 May 2008 (UTC)

Compromise table #1
You call that a compromise table? Please look up the word compromise in a dictionary. This table is one that completely reflects your view of the issue and makes absolutely no compromises. I might as well make this table and call it a compromise table:

See. It completely reflects my view of the issue, so it must be a great compromise...
 * Totally wrong. This totally ignores the fact that "Readers may also tire of reading a page in excess of 20-30 KB of readable prose". Oakwillow (talk) 15:24, 17 May 2008 (UTC)

Now an actual compromise table would be one that compromises between the ones above, eg:

See? That's what we call a compromise. HermanHiddema (talk) 15:16, 17 May 2008 (UTC)
 * You are compromising the numbers to compromise the point. Compromise in the sense that it includes both choices. It is by far the best thing to do, just include both columns, and make everyone happy. Oakwillow (talk) 15:24, 17 May 2008 (UTC)


 * If you feel that way, I am fine with using the second table, which most accurately reflects what the article has been saying for year. HermanHiddema (talk) 15:26, 17 May 2008 (UTC)
 * Ah, that is what you "think" it has been saying, but that is not the case, and that is why there have been so many complaints about long articles. Oakwillow (talk) 15:56, 17 May 2008 (UTC)
 * Yes, that is what I "think" it has been saying. You "think" it has been saying something else. Which means we both have an "opinion". To find a compromise between two opinions, you try to find some half-way point. Which is what I did. You, however, keep asserting that your own opinion is somehow "fact" while mine is "false". That has nothing to do with compromise. HermanHiddema (talk) 16:39, 19 May 2008 (UTC)

Prose size stats
I don't know what's going on here, but it seems to fit the WP:TLDR bill. Here are some stats on prose size on featured articles, measuring prose size exactly as this article recommends, and as has been done for several years; Barack Obama, Hillary Clinton and John McCain are all well within guidelines and aren't even close to being as long as many featured articles. Oh, and polls are evil, and there is no consensus to change this guideline. Also, this may help:
 * User:Dr pda/Featured article statistics
 * Wikipedia:Miscellany for deletion/Wikipedia:WikiProject Extra-Long Article Committee.  Sandy Georgia  (Talk) 04:08, 17 May 2008 (UTC)
 * What is going on here is that I am fixing a serious problem with the guidelines, which crept in because some people shifted to thinking about readable prose, but failed to change the dividing lines, effectively multiplying the acceptable article size by from 2 to 3 meaning that there are continual complaints about articles being too long and continual pointing to oh no it's not too long, it's well within the guidelines, even though it is so long that it locks up your computer and is totally inaccessible. The fix is simple. There are three ways to fix it, one, choice A above, recognize that the numbers need to be adjusted if they are to mean readable prose, two, choice B above, simply use the numbers as edit byte count because that is the easiest metric to obtain, or three, leave the article as is but insure that it says "article size" for the table and not "readable prose size" and recognize that a percentage will treat it as byte count, which it really is, and a percentage as readable prose, artificially allowing articles to be 2 to 3 times as long. As this is a guideline, editors are free to do whatever they wish, and if they want a 100kB or 450kB article, that is their prerogative. And they will also know that it's a problem. Oakwillow (talk) 04:52, 17 May 2008 (UTC)
 * I notice that this is what you wrote in 2006 from that discussion: "There are 50KB articles that are too long (because they're all prose, no references), and there are well-cited 80KB articles that aren't too long (the KB is mostly in references)." Since references are not included in calculating readable prose, am I to conclude that you meant "edit byte count"? Oakwillow (talk) 16:24, 17 May 2008 (UTC)

These stats are very interesting. They show that most FA quality articles are in the range 10-30k of readable prose, with sizable minorities under 10k and in the range 30-50k. Articles of over 50k are rare, only about 3%. This is reasonably in line with the text of this article, which gives upper limits of 30-50k of readable prose. They would, in my opinion, form an excellent source on which to base the numbers in this article. HermanHiddema (talk) 16:48, 19 May 2008 (UTC)
 * I adjusted the numbers accordingly (see below). I notice that you recently installed the prose tool, can you fill in the numbers above? Also the ones I filled in, change them if you get very different results - you don't need to calculate the ratio, but erase the ratio if you change the prose size, ok? For the last article I only included the first sentence which is why I put in a question mark. I could use the November 2007 article list, but I think that more recent articles would be better to use. It is very interesting to me, that out of a million articles less than 10 were over 65 kB readable prose. Would you agree with saying ">50 kB Almost certainly should be divided up"? (see above proposal) Oakwillow (talk) 07:52, 20 May 2008 (UTC)
 * Note that those stats are only about featured articles as of November 2007, not wikipedia totals (See also: Special:Longpages). These statistics are therefore over 1721 articles, which means:


 * What consitutes a large article?
 * about 1% are > 60k
 * about 3% are > 50k
 * about 10% are > 40k
 * about 30% are > 30k
 * What consitutes a small article?
 * about 40% are <= 20k
 * about 20% are <= 15k
 * about 7% are <= 10k

You proposal was the following: (I undid that one to restore the context of the original discussion)

Compromise table #2

 * The first compromise table used 2.5 for the ratio between edit byte count and readable prose. This one uses 2.0, and anything from 2 to 3 could be used. Oakwillow (talk) 16:41, 20 May 2008 (UTC)

I do not think it is right to say that about 30% of FA class article "probably should be divided", I would reserve that for the 10% mark. I would then start the sliding scale of "May eventually need to be divided (likelihood goes up with size)" at that 30% point. Further, I have removed the "edit byte count" column in my proposal. As yet no clear reliable ratio between "edit byte count" and "readable prose" has been established, in the above results it varies from 1.27 to 5.15 in small article, and from 1.73 to 3.47 in larger ones. With the disappearance of technical limitation of browsers, and the availability of section editing, I think edit byte count is a far less important measure than readable prose. Readable prose is a factor in content quality, while edit byte count only had technical impacts. HermanHiddema (talk) 11:32, 20 May 2008 (UTC)


 * Wouldn't it be fair to assume that those 30% that are bigger are that way because the editors felt that "the scope of [ the topic justifies ] the added reading time"? Oakwillow (talk) 17:47, 20 May 2008 (UTC)

So, my proposal:

I feel these numbers are in line with the stats. HermanHiddema (talk) 11:32, 20 May 2008 (UTC)


 * I do not agree that your analysis of the statistics is conclusive or even terribly relevant. The statistics do not show, for example, what the difference is regarding size between FA articles and non-FA articles.  I would guess that the average FA article is longer than the average non-FA article.  This would counter the "shorter is better" theory with a "longer is better" argument -- neither one of which tells anything close to the whole story. The purposes of this article is clearly listed at the start and "creating featured articles" is not one of them. Also Featured article criteria lists article length as only one of ten highlighted factors and specifically says this:


 * Length. It stays focused on the main topic without going into unnecessary detail (see summary style)


 * "Length" of course must be balanced with:


 * comprehensive: it neglects no major facts or details


 * I fail to see the proof that cutting in half the existing rule of thumb regarding the readable prose level at which the article "Almost certainly should be divided up" would add quality to virtually all articles. The quality of any article as well as the appropriateness of subdividing it is best handled by a discussion of the content of the particular article. I suggest we stick to the stated purposes of the size guidelines in our discussion over changing this article. I do not feel that Wikipedia's articles should be restricted in size based primarily on the poorest connection speeds available. If anything, we should be increasing the rule of thumb numbers to recognize the advances that have occurred since the numbers were originally calculated. Tom (North Shoreman) (talk) 12:53, 20 May 2008 (UTC)
 * It isn't cutting in half the existing rule of thumb because the numbers are for edit byte count not for readable prose in the existing article. The fact that they say readable prose is simply an error, as shown above in the edit summary, which can either be corrected in one of three ways, A, B, or C, above. It isn't going to kill anyone to see both numbers, edit byte count and readable prose, so the compromise solution is probably the best thing to do. However, instead of guessing about the difference between FA and other articles, how about filling in the stats, since you are the one that pointed out that you can "instantly" determine the readable prose size. I filled in the short ones, although some of them may need to be corrected. Oakwillow (talk) 16:04, 20 May 2008 (UTC)
 * Your claim that "readable prose [in the current article] is simply an error" is not true, and it has been adequately rebutted above. I see no purpose in filling in a table that I fell has little relevance to this discussion, but if you want to do so, go ahead. Tom (North Shoreman) (talk) 16:14, 20 May 2008 (UTC)

PS I went back to October 6, 2005 which had the following sentence:
 * "However, do note that readers may tire of reading a page in excess of 20-30 KB of readable prose (tables, lists and markup excluded)."

As you should be able to see, the concept of readable prose as the most relevant count has been with this article a long time. Tom (North Shoreman) (talk) 16:21, 20 May 2008 (UTC)
 * However, 20-30 kB means "no articles longer than 20 kB are going to be read by a lot of readers". I'm on dial-up. It would be nearly impossible to fill it in, and my numbers would not be the same because I don't do it the same way you do (I am not willing to use javascript). The allegation of it being an error has most certainly not been rebutted, it has been strengthened. Wouldn't although the scope of a topic can sometimes justify the added reading time apply to those 30% that are over 30 kB? Would you prefer to say "although 30% of the time the scope of a topic may justify the added reading time"? Bear in mind that it isn't a matter of us providing more reading time, because what it is doing is simply guaranteeing that the entire article is not going to be read most of the time - we can't change someones attention span just by creating a longer article. It just doesn't get read. Oakwillow (talk) 16:27, 20 May 2008 (UTC)
 * I think that your October 6, 2005 version also said ">20KB - may need to be divided (make sure sections are <20K - preferably much smaller)", and was clearly referring to edit byte count, because while readable prose was discussed, it was clearly delineated from the rest of the article in the sentence you quoted (it didn't say, by the way all of the numbers in this article are for readable prose, it said, in effect, oh by the way these two numbers are for readable prose, not edit byte count, like all the rest). Oakwillow (talk) 16:48, 20 May 2008 (UTC)


 * I provided the old edit information simply to show that the phrase has been in use for a long time. The 20-30 KB reference however has been changed in the current revision to:
 * "Readers may tire of reading a page much longer than about 6,000 to 10,000 words, which roughly corresponds to 30 to 50 KB of readable prose. If an article is significantly longer than that, it may benefit the reader to move some sections to other articles and replace them with summaries"
 * So this suggests that the STARTING POINT for considering whether subdivision MAY be appropriate is 50 KB of readable prose -- considerably higher than is being proposed but consistent with the EXISTING table. As far as the likelihood of an article being read in full, this is a choice to be made by the reader based on their own needs and interests.  I would guess they are more likely to find something interesting and useful that they were not expecting in a longer article that they have on their screen in front of them, rather than if they have to switch to another screen in order to get the full availability of the information.
 * I am all for WP:Summary style -- I just think it should be driven by content and context rather than conjectures about attention spans.Tom (North Shoreman) (talk) 16:56, 20 May 2008 (UTC)
 * Actually, I would contend that saying that readers may tire from pages longer than 30 to 50 KB sets 30 KB as the upper limit - you don't want to leave anyone out, do you? So 30 KB prose becomes the ending point, not the starting point. Summary style is essential on all in depth articles - we have about 2 Megabytes in the United States article, when you count all the subarticles that stuff has been split off into. Oakwillow (talk) 17:56, 20 May 2008 (UTC)
 * Actually, there are plenty of people that will tire from reading 100 words of prose, so the "not leaving anyone out" argument doesn't hold. HermanHiddema (talk) 07:55, 30 May 2008 (UTC)

Just use the 80/20 rule - for 20% of the effort you get 80% of the results. Anytime you are dealing with statistics you have a bell shaped curve. By "anyone" I mean 80%. The other alternative is to use the Ivory soap rule - include 99 44/100%. Either one you choose you end up with a whole lot less than the previous guidelines. Oakwillow (talk) 06:15, 5 June 2008 (UTC)


 * Do you have any proof that the 80/20 rule actually supports an upper limit of 30kb? That is a rather bold assertion. HermanHiddema (talk) 09:08, 5 June 2008 (UTC)
 * I take it that you are not familiar with that rule? To quote from the 80/20 article, "The Pareto principle (also known as the 80-20 rule, the law of the vital few and the principle of factor sparsity) states that, for many events, 80% of the effects come from 20% of the causes." Who said that 30 was an upper limit? The article says that "30 KB [readable prose size] Probably should be divided (although the scope of a topic can sometimes justify the added reading time)" - that's not an upper limit, it's a guideline, which is a little too high for those who tire after reading 20 kB of prose - remember readers may tire of reading 20 to 30k?
 * Uhm, you did. Read your own comment from may 20 (two comments back), where you say "I would contend that saying that readers may tire from pages longer than 30 to 50 KB sets 30 KB as the upper limit" HermanHiddema (talk) 15:36, 5 June 2008 (UTC)
 * It is a limit, however this article is a guideline, and states simply that beyond 30kB "Probably should be divided", as a guideline, and not a limit, as is 50k, a guideline, not a limit. There is a difference. For some the limiting factor is their attention span of about 8 seconds. Everyone has a limit. Oakwillow (talk) 16:51, 5 June 2008 (UTC)

A rule of thumb (Proposal #3)
Some useful rules of thumb for splitting articles, and combining small pages:

These guidelines apply somewhat less to lists or disambiguation pages, and naturally do not apply to redirects.
 * Please note :

In this change, and I see no reason for not implementing it, 15k was reduced to 10k - in other words the lower division used 3:1 while the upper divisions used 2:1. It would still be helpful if someone would fill in the above table with readable prose sizes, by the way. Oakwillow (talk) 15:32, 5 June 2008 (UTC)


 * The reason for not implementing it is that people have simply not bought in to your agenda for mandating smaller articles. The majority of people who have commented have favored the status quo and tinkering with the numbers does not change the views of the majority.  Tom (North Shoreman) (talk) 15:37, 5 June 2008 (UTC)
 * Where on earth do you get the idea that I am mandating smaller articles? This is a guideline, and even says for the upper most division "there is no mandate, however; these are guidelines only". That is there because there had been discussion of creating a "size police" to run around and chop up large articles, which was soundly rejected. The purpose, however is to prevent editors from pointing to an invalid table and saying "see this article at 93 kB readable prose is well within the WP:SIZE guidelines", totally neglecting the fact that readers may tire of only 20 kB of readable prose. Oakwillow (talk) 16:39, 5 June 2008 (UTC)
 * The language is contradictory. As proposed it reads, "Almost certainly should be divided up (there is no mandate, however; these are guidelines only)".  Some will emphaize the first part while others the second part -- it's an edit war waiting to happen anytime someone tries to apply the guideline in a particular article. What you are in effect saying is that your numbers are "almost certainly a mandate" -- way too close to an absolute mandate.


 * A far as folks getting "tired", let them learn to skim. Why should we favor folks with low attention spans as opposed to those who want more information at one site?  My concern is editors who have little or no knowledge or interest in the article who, nevertheless, want to chop up the article simply because it is some arbitrary size.Tom (North Shoreman) (talk) 17:13, 5 June 2008 (UTC)

Ok, what would you suggest as an improvement? As I see it, it is up to individual editors to make their own choices - it has been pointed out above that one article has gone to 122kB. I see no danger of this guideline creating or stopping edit warring. As to "let them learn to skim", it is definitely not our job to try to change the way people read. We just shouldn't be showing people a table of edit byte count and calling it a table of readable prose. Oakwillow (talk) 18:17, 5 June 2008 (UTC)
 * My suggestion is to drop your insistence on including edit byte count as part of the rule of thumb. It means absolutely nothing as far as readability or download speed for users on dial-up are concerned. The only thing it means is that when you select to edit the article, there is a certain amount of text in the edit window in which to edit. As an example, the Barack Obama article has 120kb of editable text and 34kb of readable text, so above the edit byte size, but well below the readable text size. The 120kb of editable text is due primarily to the text of the 177 references (35k), the templates and links to other languages (9k), section headers and related stuff (1k), images (2k), and the remainder being the wiki markup and cite templates. --Bobblehead (rants) 18:41, 5 June 2008 (UTC)
 * So change the table to more realistic numbers. Edit byte count, though, is just as important a measure as readable prose (and the more important of the two if you are on dialup), and is the easier of the two to find - it stares you in the face every time you click edit. What's it there for, to amuse you? No, it's there to help you, and as can be seen, there is a strong correlation between the two numbers. The BA article at 34 kb is not below the readable text size. Readers may tire of 20 kB, and for them, it is way beyond what they can read. Here, I read pretty fast, lets see how long it takes me to read the BA article, although I have to quit if it takes longer than 20 min. Oakwillow (talk) 19:20, 5 June 2008 (UTC)
 * Edit byte count isn't a good measure for dial up users, it's merely a remnant of the days when 32kb was the maximum size that many browsers could load that has been repurposed to mean something it shouldn't. Edit byte count may be a larger number than readable prose, but it certainly doesn't tell you if the page is going to be problematic for dialup users. This discussion page is 140k in size and I guarantee you that dial up users don't have a problem loading this page, while Barack Obama is 120k and dial up users have problem loading that one. Long story short, there isn't a meaningful number for edit byte count that can be used to determine whether or not an article should be broken up or not. Every article is going to have a different threshold of edit byte count depending on how many images they use, what kind of templates and how many are included in the article, and how many references the article has. --Bobblehead (rants) 19:59, 5 June 2008 (UTC)
 * Short story long, it's a usable number, which is only thrown off if there are a lot of images or any big images. This version took 12 minutes to read, not counting the 32 seconds staring at a blank screen while the text downloaded (absurd, useless, pointless and stupid), although I have to say that at 6 minutes I lost interest and at 11 minutes I was just passing my eyes over the words hoping that I would finally come to the end. I would estimate that I read about twice as fast as the average reader, and at least 3 times as fast as a slow reader. The article length is absurd. Fill out the table above and we will see just how close the correlation between edit byte count and readable text really is. Oakwillow (talk) 20:16, 5 June 2008 (UTC)
 * There's no reason to fill out the table above. You are the only one that is insisting that edit byte size is a usable number. You're beating a dead horse here. Drop it and move on. --Bobblehead (rants) 20:19, 5 June 2008 (UTC)
 * On the other hand it is pointless to make false suppositions that are easily refuted or supported. The purpose of filling out the table is to make an intelligent assessment of the proper readable byte count. 50 random articles from all three groups, FA, GA, and unrated should be sufficient. Oakwillow (talk) 21:57, 5 June 2008 (UTC)

Edit byte count is one of the least useful metrics when it comes to article size. It only affects users when they edit the article, and only if they do not use the section editing feature. As such, it affects an exceedingly small percentage. For example: The Barack Obama article was edited 763 times in May, and a substantial percentage of the edits used the section editing feature. In the same period, it was viewed 770849 times. Which means that about 1 in a thousand page views is an edit, and only a part of those is affected by the edit byte count. The remaining 770000 page views are not affected by edit byte count at all. They are affected by the download size of the HTML and the size of the readable prose. HermanHiddema (talk) 20:37, 5 June 2008 (UTC)
 * I would suggest doing some homework before making any brash statements like that. For example, were you to fill in the table above, now that you have installed the prosesize tool, we could actually find out if there is any correlation between the two numbers. By the way, I believe that hitting show preview is going to at least double the byte count that has to be downloaded, for all but the very smallest articles. However, I agree that our priority should be on our readers, not on our editors. If each of the three of you took a third of the table I would expect you could finish in less than 5 minutes, although I see that Tom doesn't have prosesize installed (and what on earth are you even complaining about then, do you just want to put 17 million as the proposed article split size so that no one can ever look to the guidelines for assistance?). Assuming that none of you are on dialup. Oakwillow (talk) 21:20, 5 June 2008 (UTC)


 * Excuse me, but I did my homework above, when I showed how irrelevant edit byte count is. Why should I now do your homework when all it does is show a possible correlation between proze size and a number I have already shown to be mostly meaningless anyway? HermanHiddema (talk) 21:35, 5 June 2008 (UTC)
 * What on earth are you talking about? Having a linear correlation means that there is a specific ratio between the two numbers. So far it seems highly likely that this is the case, as only the BA article above has a significantly different ratio from 2.06, and is only 68% away, and if that is the case you can just throw out readable prose from the table completely. However, to make everyone happy I would suggest keeping both numbers. Oakwillow (talk) 21:50, 5 June 2008 (UTC)
 * There is no correlation between edit byte size and readable text/total html size. Remove everything except the images on Barack Obama and the article is less than 2kb of readable prose, but it is still over 350kb in total HTML size and still takes over 70 seconds to load on a 56k modem. Edit byte size means absolutely nothing.--Bobblehead (rants) 21:54, 5 June 2008 (UTC)

Your statement is mathematically false. No correlation would mean that for any random group of articles the ratio would be all over the place from close to zero to close to infinity. I don't think that is what you had in mind. What I am seeing is a strong correlation. Certainly there are some articles which are only a gallery of images that are the exception, we have already seen a discussion of this, for example on an international page that lists hundreds of flag icons. However, what is it like for the vast majority of articles? Correlation is a mathematical term for the goodness of fit of a straight line, and can be calculated from the data. Oakwillow (talk) 22:10, 5 June 2008 (UTC)


 * I am quite aware of what a correlation is, thank you. What I am saying is that edit byte count is extremely unimportant when compared to either readable prose size or total html size. As such, I see no point in including it in the table. Perhaps a note under the table to the effect of "The byte count reported at the top of the edit window is usually about 1.5 to 3.5 times that of the readable prose size" may be useful to give editors a rough way to estimate readable prose size. I've given you some numbers in your table, you can do the ratios. HermanHiddema (talk) 22:05, 5 June 2008 (UTC)
 * Thanks. Ratios I can calculate. The number is going to be closer to 2 to 3 though, and probably 2 will be close enough to use. The readable prose table still needs to be corrected with realistic numbers. Your numbers seem specious - they all end in an extremely unlikely three zeros. For the purpose of careful analysis it would help to indicate the version of the article used and the exact byte count. Oakwillow (talk) 22:19, 5 June 2008 (UTC)


 * The prose size tool reports the prose size in KB, so more precise number are not available with that tool. I do not think that the extra precision is very significant anyway. HermanHiddema (talk) 11:39, 6 June 2008 (UTC)


 * I don't understand this rationale for edit byte count as the easiest metric to use. Once the prosesize tool is installed, you get the readable prose size count with one click.  You also get the readable prose word count, and word count is the size metric most familiar to those who have done writing in other fields.  Wasted Time R (talk) 22:37, 5 June 2008 (UTC)


 * Not everyone has prosesize installed, but everyone sees the byte count when they either click edit (over 32kB) or click history. I have no objection to a greater emphasis on word count than prose byte count. It would certainly help reduce the confusion. I assume that is actual words, not typing words (characters divided by five)? So far I'm seeing a 97% correlation between edit byte count and readable prose characters, which is what I would call very high. Oakwillow (talk) 22:50, 5 June 2008 (UTC)


 * Lets keep in mind that correlation does not imply linearity (see also Correlation). HermanHiddema (talk) 12:11, 6 June 2008 (UTC)

Hello... Speaking of correlations you guys might consider a checkuser on this page because this discussion has implications for all of wikipedia and I think there's a good chance it has been compromised. My favorite statement in the above discussion was, "My concern is making it too easy to justify slicing up appropriately long articles (because of the nature of the material) by editors with little grasp or interest in the underlying subject." Amen... On the mischievous side of things there are people who revel in disrupting good articles. Don't give them any more tools. Strengthen the protection of good content if anything. I'm an advocate of the readable prose guideline. Technical or biographical articles can easily have half or more of the edit byte count in refs, lists, tables, pictures etc. The current guidelines are SAT. Mrshaba (talk) 17:48, 6 June 2008 (UTC)

A rule of thumb (Best fit)
Some useful rules of thumb for splitting articles, and combining small pages:

These guidelines apply somewhat less to lists or disambiguation pages, and naturally do not apply to redirects.
 * Please note :

The above is a best fit using a power function to convert from edit byte count to readable prose size. Oakwillow (talk) 22:55, 5 June 2008 (UTC)


 * Sigh. You are making the assumption, again, that the current numbers in the table refer to edit byte count, but they do not. As such, the above table is utter nonsense and you should move the numbers from the edit byte count column to the readable prose size column, then use your power function to fill the edit byte count column with the appropriate larger numbers. HermanHiddema (talk) 07:30, 6 June 2008 (UTC)
 * Just to support HermanHiddema here.. The readable text levels are completely ridiculous.The current rule of thumb in the guideline is perfectly acceptable and you are the only one arguing for its replacement, Oakwillow, which should pretty much tell you that consensus is against you. Drop it and move on. You're wasting your time and the time of every editor that comes by here and feels the need to respond to your incessant demands for change to this guideline. --Bobblehead (rants) 19:00, 6 June 2008 (UTC)
 * Wrong. I am pointing out an obvious error which needs to be corrected. A table that was established using edit byte counts has been mislabeled readable prose. What are the facts that we know? 1) that readers may tire from reading 20-30 kB (and by extension anything greater, such as 30-50, or 10-17 million), 2) that using the erroneous numbers leads to articles twice as long and frequent complaints about article size, and 3) that the edit byte count is more accessible and is commonly confused with readable character count, although there is typically a 2:1 ratio between the two. Therefore, the table needs to be fixed. It is simply wrong. The three ways it could be fixed are choices A, B, or C above. Choose one and move on. We can also add 4) that most people in other writing areas use word count and not character count, so that is what we should also use. By the way, it is a little amusing that the only editors participating in this discussion are P.E.'s (Primary Editors of the article they have edited the most) of horrendously long articles (over 80 kB edit byte count, although one is right at 80 kB). WTR of course sets the record, for being the P.E. of HRC (Hillary Clinton) at 158 kB edit byte count. Trust me I couldn't care less how long you make your articles, just when someone complains that it is too long don't point to a bogus number and say it is within guidelines (HRC in the FAQ appropriately says - Q: This article is long! A: Yes.). Oakwillow (talk) 22:32, 6 June 2008 (UTC)
 * Please refrain from personal attacks. HermanHiddema (talk) 21:25, 8 June 2008 (UTC)
 * Not intended. It was simply an observation. Most articles are much smaller. Oakwillow (talk) 00:55, 9 June 2008 (UTC)
 * The table was established using source size at a time when the difference was rarely of significance. The intent is to refer to the size of the readable text as reflected on the page. Christopher Parham (talk) 23:42, 8 June 2008 (UTC)
 * Not a problem, however, there is a 2 to 1 difference in the numbers, so they should be adjusted accordingly. It would probably make sense though to just switch to word count. Oakwillow (talk) 00:55, 9 June 2008 (UTC)
 * There is no need to adjust anything. The existing numbers have always referred to readable prose size, as they do today, and there is no need to change them. The only necessary adjustment has already been made long ago, by ending the use of source size as a good analogue to prose size, which it no longer is. Hence your proposals are unnecessary, which explains the lack of support for them. Christopher Parham (talk) 04:05, 9 June 2008 (UTC)

Proposal #4 (Use word count instead of readable character count)
Some useful rules of thumb for splitting articles, and combining small pages:

These guidelines apply somewhat less to lists or disambiguation pages, and naturally do not apply to redirects.
 * Please note :

This proposal is to switch from readable prose which inherently creates confusion between edit byte count and readable character count, to word count, which is more normally used as a measure of how much to write about a subject. Much of the rest of the guideline would also need to be changed slightly to reflect this emphasis. Oakwillow (talk) 18:18, 8 June 2008 (UTC)


 * Sigh. This table makes the same mistake that all your other proposals have made. It falsely assumes that the current numbers in the table refer to edit byte count instead of readable prose. HermanHiddema (talk) 21:24, 8 June 2008 (UTC)
 * I see no indication that that is a false assumption. I would, however like to hear what others have to add. Whatever numbers are used, if they are too large there will be more complaints about size, if they are too small, the articles will simply ignore them. Right now what I am seeing is complaints about size. As a percentage of articles, there are very few that are ignoring the above suggestion, of staying at less than 100 kB edit byte count, zero of the 38 random articles above (the only ones larger are the added country and campaign articles). Of all 1721 FA articles at User:Dr pda/Featured article statistics, only 14, or less than 1%, are greater than 10,000 words, so I would posit that that is a good number to use. None are greater than 15,000 words, so that or 20,000 would be a useless number to use (it would be like setting a 500 kph speed limit on a highway; while every kid would love it, it would serve no purpose). Oakwillow (talk) 01:18, 9 June 2008 (UTC)

Other editors have, so far, agreed with my assertion that your assumption is false. See eg the comment by Tom (North Shoreman) at 16:14, 20 May 2008 (UTC) and the comment by Christopher Parham at 04:05, 9 June 2008 (UTC). This is not a popular vote, of course, but apparently other editors have found my arguments convincing or have reached the same conclusion separately. I will try to explain how I think the current numbers came about:

At one time, in 2003 and before, there was a hard limit of 32 KB on "edit byte count", because some browsers has issues with editing texts larger than that. This was, of course, purely a technical issue. The number 32KB was not based on considerations of readability or the like. By 2004, the number of browsers in that still had this issue had become quite rare. This version from march 2004 links to a page of browsers that have the issue and how to upgrade them. It explicitly mentions that section editing for logged in users exists, and mostly invalidates the technical 32KB limitation. By the end of 2004 the section editing feature was available to all users, whether logged in or not, further invalidating the 32KB limit. As 32KB was no longer a technical limitation, but there was still a desire to say something about limiting article size, the concept of "readable prose" was introduced to the page. The version above includes the text "Readers may also tire of reading a page in excess of 20-30 KB of readable prose (tables, lists and markup excluded)". This 20-30KB seems, initially, to have been a convenient number because it resulted in limits similar to the outdated 32KB technical limit. As time move on however the numbers for "readable prose" were adjusted upward, after discussion, to "30-50 KB", which was apparently felt more in line with actual limits on attention span. The text, and the "rule of thumb" table were updated accordingly. And as "readable prose" has been the most important measure since 2004, the table has referred to that since that time. What has happened since is that wikipedia became stricter in its referencing policy, causing "edit byte count" and "readable prose" to drift apart. So numbers could at that time have referred to either "readable prose" or "edit byte count" without any significant difference, but this is no longer the case.

Now personally, I feel that the current numbers in the table are somewhat too high, and could benefit from being adjusted downward. But to do that, we should simply examine what reasonable limits on "readable prose" are. To assume that the current numbers refer to "edit byte count" and that they are set in stone and can only be adjusted downward by dividing them by some 2.06 correlation number is counter-productive. HermanHiddema (talk) 08:14, 9 June 2008 (UTC)
 * Edit byte count=crap. For the exact reasons described above by HermanHiddema and multiple times by myself. Personally, before I'll seriously consider any modification to the rule of thumb table, the use of edit byte count must be removed. I'm open to reducing the readable text size, but not by much. Certainly not to 50k being the "Almost certainly ..." level. Realistically, the only issue I see with the current readable prose limits is that the 100k is too high for the "Almost certainly..." level. Drop it down to 80k and make the divisions a little clearer and I think the table would be fine:

--Bobblehead (rants) 20:20, 9 June 2008 (UTC)

Good, now we're getting somewhere. This seems like a reasonable proposal. I do think that using word count might be a good idea as well, given that that measure is familiar to writers in many fields, as Wasted Time R mentions above. The table above show a ration of slightly over 6 bytes per word, so how about something like:

-- HermanHiddema (talk) 14:17, 11 June 2008 (UTC)


 * It is useless to include word count and readable character count, as there is a 99.95% correlation between the two. However it is necessary to include both edit byte count and word count because edit byte count is more accessible than word count and there is not a more than 99.9% correlation between them. There is a very high correlation, but it is only about 96.5%. A simple rule of thumb should be included to divide edit byte count by 12 to get approximate word count. These are guidelines, and no where should it say that articles must be smaller than x bytes or y words. I still maintain that the present table refers to edit byte count and therefore it should be divided by 12 to get suggested word counts. As a guideline, what works best? What is the ideal size for an article, making the assumption that there is at least twice that you can write about the subject? In other words, if it is a given that you are using summary style and subarticles, how big should each subarticle and the main article be, ideally? That's what a guideline is, it gives you that number. I would suggest about 2,500 words to 3,500 words as the ideal size for each article. Something that an average reader can get through in about 10 to 15 minutes, and a fast reader in 5 or 6 minutes. You will notice that I generously rounded up 8.3k to 10 k, for simplicity. Oakwillow (talk) 20:16, 17 June 2008 (UTC)