User:Annieyu101/sandbox

In the comparison of various statistical procedures, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator, experiment, or test needs fewer observations than a less efficient one to achieve a given performance. This article primarily deals with efficiency of estimators.

The relative efficiency of two procedures is the ratio of their efficiencies, although often this concept is used where the comparison is made between a given procedure and a notional "best possible" procedure. The efficiencies and the relative efficiency of two procedures theoretically depend on the sample size available for the given procedure, but it is often possible to use the asymptotic relative efficiency (defined as the limit of the relative efficiencies as the sample size grows) as the principal comparison measure.

An efficient estimator is characterized by a small variance or mean square error, indicating that there is a small deviance between the estimated value and the "true" value.

References:

https://www.statisticshowto.datasciencecentral.com/efficient-estimator-efficiency/

link variance and mean square error

Efficient estimator

In general, the spread of an estimator around the parameter θ is a measure of estimator efficiency and performance. This performance can be calculated by finding the mean squared error:

Let T be an estimator for the parameter θ. The mean squared error of T is the value $$MSE(T)=E[(T-\theta)^2]$$.

Here, $$MSE(T)=E[(T-\theta)^2]=E[(T-E[T]+E[T]-\theta)^2] =E[(T-E[T])^2]+2E[T-E[T]](E[T]-\theta)+(E[T]-\theta))^2 =Var(T)+(E[T]-\theta)^2$$ Therefore, an estimator T1 performs better than an estimator T2 if $$MSE(T_1)<MSE(T_2)$$.

For a more specific case, if T1 and T2 are two unbiased estimators for the same parameter θ, then the variance can be compared to determine performance.

T2 is more efficient than T1 if the variance of T2 is smaller than the variance of T1, i.e. $$Var(T_1)<Var(T_2)$$ for all values of θ.

This relationship can be determined by simplifying the more general case above for mean squared error. Since the expected value of an unbiased estimator is equal to the parameter value, $$E[T]=\theta$$.

Therefore, $$MSE(T)=Var(T)$$ as the $$(E[T]-\theta)^2$$ term drops out from being equal to 0.

Efficiency in Statistics
Efficiency in statistics is important because they allow one to compare the performance of various estimators. Although an unbiased estimator is usually favored over a biased one, a more efficient biased estimator can sometimes be more valuable than a less efficient unbiased estimator. For example, this can occur when the values of the biased estimator gathers around a number closer to the true value. Thus, estimator performance can be predicted easily by comparing their mean squared errors or variances.

_______________________________________________________________________________________________________________________________________________

Example
A series of hourly temperatures were measured throughout the day in degrees Fahrenheit. The ordered set is: 57, 57, 57, 58, 63, 66, 66, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81.

A box plot of the data can be generated by calculating the five relevant values: minimum, maximum, median, first quartile, and third quartile.

The minimum is the smallest number of the set. In this case, the minimum day temperature is 57°F.

The maximum is the largest number of the set. In this case, the maximum day temperature is 81°F.

The median is the "middle" number of the ordered set. This means that there are exactly 50% of the elements less than the median and 50% of the elements greater than the median. The median of this ordered set is 70°F.

The first quartile value is the number that marks one quarter of the ordered set. In other words, there are exactly 25% of the elements that are less than the first quartile and exactly 75% of the elements that are greater. The first quartile value can easily by determined by finding the "middle" number between the minimum and the median. For the hourly temperatures, the "middle" number between 57°F and 70°F is 66°F.

The third quartile value is the number that marks three quarters of the ordered set. In other words, there are exactly 75% of the elements that are less than the first quartile and 25% of the elements that are greater. The third quartile value can be easily determined by finding the "middle" number between the median and the maximum. For the hourly temperatures, the "middle" number between 70°F and 81°F is 75°F.

The interquartile range, or IQR, can be calculated:

$$IQR = Q3 - Q1 = q_n(0.75) - q_n(0.25)=75^\circ F-66^\circ F=9^\circ F$$

Hence, $$1.5IQR=1.5*9^\circ F=13.5 ^\circ F$$.

1.5IQR above the third quartile is:

$$Q3+1.5IQR=75^\circ F+13.5^\circ F=88.5^\circ F$$.

1.5IQR below the first quartile is:

$$Q1-1.5IQR=66^\circ F-13.5^\circ F=52.5^\circ F$$.

The upper whisker of the box plot is the smaller of two numbers: the maximum or 1.5IQR above the third quartile. Here, the maximum is 81°F and 1.5IQR above the third quartile is 88.5°F. Therefore, the upper whisker is drawn at the value of the maximum, 81°F.

Similarly, the lower whisker of the box plot is the greater of two numbers: the minimum or 1.5IQR below the first quartile. Here, the minimum is 57°F and 1.5IQR below the first quartile is 52.5°F. Therefore, the lower whisker is drawn at the value of the minimum, 57°F.

History of the Box Plot
The box and whiskers plot was first introduced in 1970 by John Tukey, but it did not become widely known until his formal publication in 1977. The basic graphic form of the box plot, the range-bar, was established in the early 1950s by Mary Eleanor Spear. Tukey's initial description of the box plot contained 5 components:


 * median
 * two hinges, the upper and lower fourths (quartiles)
 * data values adjacent to the upper and lower fences, which lie 1.5 times the inter-fourth range from the median
 * two whiskers that connect the hinges to the fences, and
 * outliers, individual points further away from the median than the extremes