User:Pinkfrog22/Interquartile range

Article Draft
NOTE: Citations 1,7,12 are the same -- fix this

Lead
In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data (CITE HERE). The IQR may also be called the midspread, middle 50%, or H‑spread. It is definited as the the difference between the 75th and 25th percentiles of the data. To calculate the IQR, the data set is divided into quartiles, or four rank-ordered even parts via linear interpolation. These quartiles are denoted by Q1 (also called the lower quartile), Q2 (the median), and Q3 (also called the upper quartile). The lower quartile corresponds with the 25th percentile and the upper quartile corresponds with the 75th percentile, so IQR = Q3 −  Q1.

The IQR is an example of a trimmed estimator, defined as the 25% trimmed range, which enhances the accuracy of dataset statistics by dropping lower contribution, outlying points (CITE: 10.1007/978-3-642-23502-3). It is also used as a robust measure of scale (CITE: 10.1007/978-3-642-23502-3). It can be clearly visualized by the box on a Box plot.

Use
The primary use of the IQR is to represent the difference between the upper and lower quartiles of a data set (Dekking). This can be used as an indicator for variability of the dataset (Dekking).

It is also used to build box plots, which are a graphical representation of probability distribution. In the box plot, the IQR is the height of the box itself, and the whiskers have a length of 1.5*IQR (Dekking). Any data point located outside of the whiskers is referred to as an outlier (Dekking)(see below).

IQR is often used as a preferred measurement or variability to total range or median absolute deviation because it has a lower breakdown point: 25% compared to MAD's 50%. (rousseaux)

The IQR has been practically used in a number of recent studies. Some of these uses include:

Note: from original article but have not yet been able to corroborate with authoritative source
 * Sampling for Design Space Exploration (https://doi.org/10.1115/1.4044432)
 * Predicting Stock Returns (10.1155/2021/9911986)
 * Image Denoising (edsarx.1302.1007)

For a symmetric distribution (where the median equals the midhinge, the average of the first and third quartiles), half the IQR equals the median absolute deviation (MAD).

The median is the corresponding measure of central tendency.

The IQR also may indicate the skewness of the dataset.

The quartile deviation or semi-interquartile range is defined as half the IQR.

Discrete Variables
The IQR of a set of values is calculated as the difference between the upper and lower quartiles, Q3 and Q1. Each quartile is a median calculated as follows.

Given an even 2n or odd 2n+1 number of values


 * first quartile Q1 = median of the n smallest values
 * third quartile Q3 = median of the n largest values

The second quartile Q2 is the same as the ordinary median.

Continuous Variables
The interquartile range of a continuous distribution can be calculated by integrating the probability density function over specific intervals. The lower quartile, Q1, is a number such that integral of the PDF from -∞ to Q1 equals 0.25, while the upper quartile, Q3, is such a number that the integral from -∞ to Q3 equals 0.75. These integrals are as follows

\\ insert pic (commons upload)

In terms of the CDF, the quartiles can be defined as follows: where CDF−1 is the quantile function.

\\ insert pic (commons upload)

Distributions[edit]
Note: No citations provided for this section yet - also not discussed in Dekking

Whaley thesis corroborates this algorithm --> need to merge into algorithm section

The interquartile range and median of some common distributions are shown below

Interquartile range test for normality of distribution[edit]
The IQR, mean, and standard deviation of a population P can be used in a simple test of whether or not P is normally distributed, or Gaussian. If P is normally distributed, then the standard score of the first quartile, z1, is −0.67, and the standard score of the third quartile, z3, is +0.67. Given mean =  and standard deviation = σ for P, if P is normally distributed, the first quartile



and the third quartile



If the actual values of the first or third quartiles differ substantially[clarification needed] from the calculated values, P is not normally distributed. However, a normal distribution can be trivially perturbed to maintain its Q1 and Q2 std. scores at 0.67 and −0.67 and not be normally distributed (so the above test would produce a false positive). A better test of normality, such as Q–Q plot would be indicated here.

Outliers[edit]
The interquartile range is often used to find outliers in data. A fence is used to identify and categorize types of outliers from the data, or on a box plot. (https://doi.org/10.18434/M32189). There are four relevant fences:


 * Lower Inner Fence: Q1 - 1.5*IQR
 * Upper Inner Fence: Q3 + 1.5*IQR
 * Lower Outer Fence: Q1 - 3*IQR
 * Upper Outer Fence: Q3 + 3*IQR

Any data points that fall between the inner and outer fences are called mild outliers. Points that fall beyond the outer fences are called extreme outliers (https://doi.org/10.18434/M32189).