User:IkeB108/sandbox

= Do Larger Wikipedia Articles Get More Views? =

Introduction
I wanted to see whether Wikipedia articles which are larger get more or less page views than other articles. I predicted that larger articles would get more views than smaller articles. The explanatory variable in this observational study is the size of a Wikipedia article’s most recent version in bytes. The response variable is the total number of views the article has received since its creation.

Data Collection
I took a simple random sample of 20 articles out of the population of all Wikipedia articles. To do this, I used Wikipedia’s “Random article” button, which does not unfairly favor articles based on size, popularity, or any confounding variables. In the “View History” tab, one can find how many page views the article has received since its creation, and the size of the most current version of the article in bytes. The reason to measure articles by bytes instead of word or character count is because counting by bytes factors in information such as images, sound files, and other non-textual elements of an article.

Results

 * Equation of least-squares-regression line: y = 0.0284x + 38.9
 * R (correlation coefficient): 0.5385
 * R squared (coefficient of determination): 0.29

Analysis
The scatterplot shows a very weak positive correlation between article size in bytes and page view count. A linear model is a poor fit for the data, because there is little correlation at all between the two variables. Unusually, there are two outlier articles with a very high residual. If these outliers were removed from the data set, the coefficient of determination would jump from 0.29 to 0.674.

It would not be appropriate to extrapolate this trendline, nor to use the trendline to predict an article’s page views based on its size in bytes. This is because the coefficient of determination of 0.29 is too low to prove that there is a correlation between these two variables.

Interpreting the LSRL Equation
The equation of the least-squares regression line is y = 0.0284x + 38.9. If the line were a good fit for the data, here is what that equation would represent: The b value, 38.9, would mean that we should expect an article sized zero bytes (an empty article) to have received 38.9 views since its creation. The a value, 0.0284, would mean that we should expect each additional byte of data to increase the view count by 0.0284 views. It makes sense that the a value is extremely low, because bytes are very small units of data, so a single byte would not contribute much at all to the view count. One byte is a string of eight ones and zeros, so one byte would roughly translate to one character of text in the article if the article were encoded in ASCII. However, like most websites today, Wikipedia articles are encoded in Unicode, so each character of text requires at least two bytes of data.

What can be inferred from the Residuals plot?
From the residuals plot, it can be inferred that the two scatter plot points with the highest residuals are outliers. In other words, there were two articles in the data set that had an unusually high page view count considering how large they were. A good least-squares regression line will have an equal number of scatter plot points above and below it. As you can see in the residuals graph above, far more data points have a negative residual than a positive one. However, if the two scatter plot points with the highest residuals are removed, the positive and negative residuals are much more balanced, and the coefficient of determination rises from 0.29 to 0.674.