User:Somethingforstats/sandbox


 * For homogeneity of variance see homoscedasticity.

In statistics, homogeneity and its opposite, heterogeneity, describe the properties of datasets. Homogeneity implies similarity, while heterogeneity refers to disparity. It is often convenient to assume homogeneity, where statistical properties throughout the overall dataset remain similar. In meta-analysis, which combines the data from several studies, homogeneity measures the differences or similarities between the various studies (see also Study heterogeneity).

Homogeneity can be studied to several degrees of complexity, and apply to all aspects of the statistical distributions, including the location parameter or the variability of observations throughout a dataset. More detailed studies examine changes to the marginal distribution, skewness, or joint distributions.

The concept of homogeneity can be applied in many different ways. For certain types of statistical analysis, it is used to look for further properties that might need to be treated as varying within a dataset, once some initial types of non-homogeneity have been addressed.

Regression
Differences in the response values across a dataset might initially be dealt with through a regression model, which uses explanatory variables to examine variations in the response value. Homogeneity and heterogeneity of variance in regression models, are typically referred to as homoscedasticity and heteroscedasticity, respectively.



It is important to examine the behavior of the prediction errors, to see if they behave in the same way across the dataset; this is identifying whether or not there is homogeneity in the distribution of the residuals, as the explanatory variables change. If the errors are not homogenous, it is possible that there is an underlying trend or confounding variable, which we have not properly accounted for in our model. See regression analysis.

Time series
The initial stages in the analysis of a time series may involve plotting values against time to examine homogeneity of the series in various ways, such as investigating whether or not there is stability across time as opposed to a trend.

Homogeneity across Data Sources
In many studies, data that we analyze is collected from multiple sources. However, we need to be aware if a specific source has influence over the data itself; as such, we examine the homogeneity across the source.

For example, in hydrology, data-series from a number of river-flow sites are analyzed. A common model has the distributions of these values to be the same for all sites, apart from a simple scaling factor,linking location to this scale. There can then be questions of examining the homogeneity across sites of the distribution of the scaled values.

In meteorology, weather datasets are acquired over many years, and measurements from stations can be inconsistent. In a given year, we can expect some stations to stop measurements, while others may start. With this, there is the need to distinguish the underlying weather phenomenon from possible heterogeneity across the stations; we need to consider whether or not the records are homogenous over time, when we combine the station measurements. An example of homogeneity testing of wind speed and direction data can be found in Romanić et al., 2015.

Homogeneity within populations
Simple population surveys may start from the idea that responses will be homogeneous across the whole of a population. Assessing the homogeneity of the population involves looking to see whether the responses of certain identifiable subpopulations differ from those of others. For example, car-owners may differ from non-car-owners, or there may be differences between various age-groups.

Tests
A test for homogeneity, in the sense of exact equivalence of statistical distributions, can be based on an E-statistic. A location test tests the simpler hypothesis that distributions have the same location parameter. For example, a hypothesis test could be used to examine if the distributions for different data sets are the same; in other words, if the data combined would be homogenous. An example of a Chi-squared_test for categorical data is as follows, for examining whether or not male and female college students have the same distribution of allergies. Suppose that we randomly select 220 male college students and 217 female college students, and ask about their most dominant allergy: Pollen, Animals, Food, Other.

Our hypothesis are:


 * H0: The distribution of allergies for male college students is the same as the distribution of allergies for female college students


 * H1: The distribution of allergies for male college students is not the same as the distribution of allergies for female college students

The results of our survey of the college students are shown in the table below:

Our Chi-Squared Statistic is 12.9128, which brings us to a p-value of 0.004829, which statistically significant (at the 0.05 level). As such, we would reject the null hypothesis (H0) and accept the alternative hypothesis (H1), and conclude that the distributions of allergies for male and female college students are not the same.