User:Dragons flight/work

Introduction
Current approaches to temperature analysis generally have two key limitations:


 * 1) They fit all data to a grid of latitude and longitude.  Because equally spaced latitude and longitude is not a uniform representation of real area, this leads to non-uniformity at high latitudes relative to low latitudes.  It also leads the weight associated with any single station to vary in an ad hoc manner depending on the grid size and the number of other stations that happen to fall within the same grid cell.
 * 2) They require that most or all of the data share a common time interval which is used to define a baseline temperature.  For example, all temperatures may be expressed as anomalies relative to the local average temperature from 1960 to 1990.  This adjusts for differences in mean climate at different locations, but generally limits the analysis to only that data which has a nearly complete representation of the baseline interval.

From the beginning, our approach to temperature analysis is intended to avoid both of these limitations. In particular, we wish to create a system for defining global mean temperature change that is not affected by arbitrary choices in griding and which allows the maximum amount of data to be used.

In the following document there are a number of details yet to be determined. In some cases, the precise approach will be empirically guided by the structure of the data itself. Where necessary we generally intend to use a small subset of the data (e.g. 5% of stations) to guide development of the algorithms. This should help minimize the impact of any possible design bias on the final result.

Initial Outline
Let us associate the set of stations with the index i, their locations with $$\vec {x_i}$$, and the measured data as $$y_i(t_j)$$, where the index j runs over all possible times.

We will describe the proposed analysis approach by building it up as a series of steps.

For the purposes of simplifying the discussion, we will assume that the large effects of seasonality have already been removed for the data series $$y_i(t_j)$$. This can be accomplished either by using only annual averages in the analysis, or by creating an estimate of mean seasonality at each station location and subtracting that from the data. In principle, the estimate of mean seasonality can also be introduced directly into the global average estimation, and we may ultimately take that approach; however, doing so would add additional terms to the fits that would make the present discussion harder to follow. For the purpose of explaining the approach more clearly, we omit such considerations from the current discussion.

Let $$\hat y(t_j)$$ be a to-be-determined estimator of the global average temperature anomaly at time $$t_j$$. Further, let $$\bar y_i$$ denote the local mean baseline temperature at station i. These baseline values play the role of relating the difference in mean temperature at one station (for example at the equator) to another station (for example near the poles).

An initial approach to estimate the global average temperature would be to minimize the system:


 * $$\chi^2 = \sum_i \sum_j (\hat y(t_j) - ( y_i(t_j) - \bar y_i ) )^2$$

Which has free parameters $$\hat y(t_j)$$ and $$\bar y_i$$. Implicitly, we use the convention here and in the following that the sums are limited to only combinations of i and j such that $$y_i(t_j)$$ exists. In other words, the sum is limited to only those combinations of places and times where data is available.

By using the above to empirically determine $$\bar y_i$$ as a property of the fit, we avoid the requirement that stations share a common overlapping interval. The requirement of a common baseline interval has been a major restriction on the data that could be used in the existing temperature analyses done by the Hadley Center and NOAA. [Note: Due to a redundancy, the fit is undetermined unless an additional constraint is added to $$\bar y_i$$. A natural choice is to insist that the mean of $$\bar y_i$$ is zero.]

Because this is a linear fit, we could then use the standard techniques to estimate the uncertainty on each of $$\hat y(t_j)$$ and $$\bar y_i$$ based on the assumption that the $$\chi^2$$ per degree of freedom be unity. For brevity, we omit these considerations, but we note that it is a feature of our approach that an estimate of uncertainty is always available.

Spatial correction
In the above it is readily apparent that the sum will weight each station equally. Since many regions of the global are sampled much more comprehensively than others, the result is a bias towards changes occurring in regions of dense measurement.

To correct for this, we consider the following. Given any choice of a continuously defined spatial weighting function $$w( \vec{a}, \vec{b} )$$, we may define an interpolated temperature field at all locations on the Earth:


 * $$y(\vec{x}, t_j) = {{\sum_i w(\vec{x_i},\vec x) y_i(t_j)} \over {\sum_i w(\vec{x_i},\vec x)}}$$

Where here $$\vec x$$ — without subscript — denotes an arbitrary location on the surface of the Earth.

Given such an interpolated field $$y(\vec{x}, t_j)$$, the natural global average would simply be to average over the whole surface of the Earth. Which implies an alternative definition of $$\hat y( t_j )$$ as


 * $$\hat y(t_j) = {{\int y(\vec x, t_j) d\vec x} \over {\int d\vec x}}$$

Which can be simplified to:


 * $$\hat y(t_j) = $$

where


 * $$W_i(t_j) = {\int {{{w(\vec{x_i},\vec x)} \over {\sum_m {w(\vec{x_m},\vec x)}}} d\vec x} \over {\int d\vec x}}$$

The dependence on j is implicit in the sum, as only stations where data exists at time $$t_j$$ would be considered possible values for m.

The above shows that $$W_i(t_j)$$ is a natural estimate of the weight each station should have in the global average given its proximity to or isolation from other stations. Hence it is natural to replace the original $$\chi^2$$ with a spatially weighted form by using:


 * $$\chi^2 = \sum_i \sum_j W_i(t_j) (\hat y(t_j) - ( y_i(t_j) - \bar y_i ) )^2$$

This then gives weight to stations in proportion to the number of stations that exist in the same region at the same time.

The details of $$W_i(t_j)$$ will depend on the details of $$w( \vec{a}, \vec{b} )$$. The exact choice of a spatial weighting function should be driven by the characteristics of the data and is still under discussion. However, likely choices are such that w depends only on the distance between the two points, i.e.:


 * $$w(\vec{x_i},\vec x) = w(|\vec{x_i} - \vec x|)$$

and is a function that peaks at zero distance and decreases continuously with increasing distance, such as an exponential or Gaussian. The correlation properties of the data as a function of distance will be used to guide the choice of spatial weighting function.

Record Quality
Thus far in the description, the proposed analysis has regarded all data as being of equal quality and reliability. In the real world this is unlikely to be true. This section discussed the first of several proposed steps to empirically approach with data of varying quality.

First, the quality of individual records can be estimated using their consistency to the overall fit. For example, after an initial fit is produced:


 * $$\epsilon_i^2 = {{\sum_j (\hat y(t_j) - ( y_i(t_j) - \bar y_i ) )^2} \over {\sum_j 1}}$$

Provides an estimate of the mean difference between a station's local temperature anomalies and the global anomalies.

Or, in the alternative, one might choose to compare


 * $$\eta_i^2 = {{\sum_j (y(\vec{x_i}, t_j) - ( y_i(t_j) - \bar y_i ) )^2} \over {\sum_j 1}}$$

Which compares a station to the local interpolating function at the same location.

Either or both of these can be used as an estimator of the stations accuracy. We expect that the latter is likely to be more useful given the spatial variability of climate, however we need to be careful about using this due to the possibility of weighting isolated stations in inappropriate ways.

For the moment, we will simply introduce a placeholder and say that there can exist a station-based weighting function $$S(\epsilon_i, \eta_i)$$ such that:


 * $$\chi^2 = \sum_i \sum_j S(\epsilon_i, \eta_i) W_i(t_j) (\hat y(t_j) - ( y_i(t_j) - \bar y_i ) )^2$$

Initially S would be set to unity, but the approach could be iterated to seek convergence. It is also likely that the expressions above may be modified to account for variations in uncertainty over time, especially those due to variations in the number of active stations.

Outliers
In addition to problems with the quality of individual records, there are also likely to be questions about the reliability of individual measurements within otherwise reliable records. In general, when studying mean behavior one can often improve the robustness of a fit by reducing the weight given the most extreme outliers (e.g. the outer 1% of data). There are several well accepted approaches for doing this, though we have not settled on a particular approach. It is also unclear at this time whether it will be more effective to detect outliers globally or to do so on a local or regional scale. Regardless of the approach, we will test whether removing outlier makes a statistically significant difference at all. Given the abundance of the raw data and the nature of the averaging, it is possible that isolated outliers will prove to be statistically insignificant on the overall construction.

Record discontinuities
Baseline discontinuities are quite common in weather records. Such effects may be caused by changes in station location, instrument type, or changes in the micro-environment of the station such as that due to nearby construction. In general, our approach to record discontinuities is simple. When discontinuities are found, we will break the record into two different fragments that are fit and parameterized independently. This will be done by default when the station metadata provides records to indicate a station move.

In other cases we will have to identify such breaks empirically. Qualitatively, the natural approach is detect statistically significant abrupt changes in the mean state of a record. This can be done while looking at a single record, but due to interannual weather variability, it is likely that a test would be more sensitive when looking at differences between a record and its neighbors. This could be done with either pairwise (as is currently the standard for some temperature analyses) or by comparing the record to the interpolated function at the same location, e.g. $$y(\vec{x_i}, t_j)$$.

We will consider several approaches for detecting abrupt record discontinuities and test their effectiveness.