Wikipedia:Reference desk/Archives/Mathematics/2020 September 3

= September 3 =

MLM - weight or random effect
A comprehension question:

Assume I have a mixed linear model of the form

count ~ date + (1|site)

(specifics not too important here) I also know that the individual counts were taken by different survey methods which have different degrees of reliability. I want to integrate the information about survey methods into the model. AFAIK this could be done either in the form of weights or as another random effect. To my understanding, the choice between these approaches would depend on what I know/expect the impact of difference in survey method to be:
 * - if survey method affects the precision of the count, then I should define class weights for each method and weight the counts with these, because I want the less precise counts to have less impact on the model.
 * - if survey method affects the magnitude of the count (i.e. a constant bias), then I should include method as another random effect in the model, because I want the intercept to be able to shift based on method.

Does that sound reasonable? -- Elmidae (talk · contribs) 17:54, 3 September 2020 (UTC)


 * Can you explain your notation a bit? Does $count$ stand for a dependent variable and $date$ for an independent variable? Does the tilde here stand for "is proportional to"? Further, what is the meaning of the term $(1|site)$? I am not familiar with this notation. I assume it stands for the random component of the model, but presumably it is an instance of a more general term $(r|s)$. Does it represent a choice between two models, where $site$ stands for a constant bias depending on the survey method? The latter should not be modelled as a random component, and even when introducing such biases in the model, you'll still need a random component, so this is not either–or. Introducing a constant bias that depends on the survey method is theoretically possible, but only practical if there are a fair number of observations for each method and it is not the case that the values of the independent variable cluster at one end of the scale for some and at the other end for some others, but are reasonably spread out for each participating method.  --Lambiam 20:55, 3 September 2020 (UTC)


 * Okay, just disregard the model notation, and the parameter names; it was merely a simplified example and is quite beside the point. I have a mixed effects model with a number of fixed and random effects, and am wondering whether to account for an as yet unmodeled, further source of variation in the dependent variable by way of weighting or by way of introducing a further random effect. My reading of the above is that if levels of that source correlate with a bias, modeling it as a random effect makes sense if there is a large number of observations and a reasonable spread of values. What about if levels of the source correlate with degrees of random variance rather than bias? My take is that that would make more sense as a weight, similar to the common approach of weighting by 1/variance. (I don't know yet of what type the impact of that source is, still missing some data.)
 * I guess my underlying conceptual problem might be that "weighting by inverse of variance" is something that I'd almost always do when modeling repeated measures vs some possible driver, so thinking of differences in survey method as indicative of likely variance in the observations (even if I don't have any measure of that variance) makes me want to use it as a weight. But then the standard way of accounting for drivers of unknown function in a MLM would be to model them as a random factor. So I'm trying to figure out how to select between these approaches. -- Elmidae (talk · contribs) 22:28, 3 September 2020 (UTC)
 * If you have sufficient sample numbers to estimate model parameters reliably, the most robust approach would be to assume that both mean and variance will change with survey method and estimate both as part of model inference. If the methods are very different, it may be better to estimate the model for each method separately and perform a meta-analysis over the models. -- 22:39, 3 September 2020 (UTC)


 * (ec – I think the following may be the same basic idea in more detail, where all methods share the trend coefficient):
 * Assuming that there are $m$ survey methods, a "rich" model could be one with $m + 1$ linear coefficients (to be estimated), also $m + 1$ independent variables, $m$ weights, and one dependent variable. Writing $y$ for the latter and $x$ for the main independent variable, the model looks like
 * where the ck stand for the coefficients, and the $y = c_{0}x + Σ _{k = 1 .. m} c_{k}b_{k} +  e @ w$ dependent variables bk are all either $m$ or $0$. Furthermore, $1$ represents the random-error term and $e$ a weight. The coefficients are supposed to be estimated using the standard weighted least squares method (WLS), so the equation above stands schematically for as many equations as there are observations. Making this explicit, the $w$-th observation gives rise to an equation
 * WLS should then minimize the value of $i$.
 * Each of these observations is associated with a survey method. The weight is a (predetermined) weight for the associated method. If these methods are numbered $y_{i} = c_{0}x_{i} + Σ _{k = 1 .. m} c_{k}b_{k,i} +  e_{i} @ w_{i}$, and the method associated with the $Σ _{i} w_{i}e_{i}^{2}$-th observation is $1 .. m$, then $i$ if $q$, and $b_{k,i} = 1$ otherwise. The interpretation of $k = q$ is that it is the trend of the regression line, while $= 0$ is the (constant) bias associated with method $c_{0}$. (Note that there is no separate coefficient for the y-axis intercept.)
 * I used the term "rich" because of the large number of unknowns. If there are too many of these compared to the number of obvservations, we get into overfitting territory. The number of unknowns may perhaps be reduced by pooling some methods together. Finally, if necessary, the weights can be determined with a bootstrap scheme. Solve the system with any not totally unreasonable guess for these weights. Isolate the set of $c_{k}$ for each method and determine its variance. Use the reciprocals of these method variances for the weights in a following run; this should converge sufficiently in two to three runs. --Lambiam 22:56, 3 September 2020 (UTC)
 * OK, lemme digest that for a bit... -- Elmidae (talk · contribs) 02:36, 4 September 2020 (UTC)
 * I used the term "rich" because of the large number of unknowns. If there are too many of these compared to the number of obvservations, we get into overfitting territory. The number of unknowns may perhaps be reduced by pooling some methods together. Finally, if necessary, the weights can be determined with a bootstrap scheme. Solve the system with any not totally unreasonable guess for these weights. Isolate the set of $k$ for each method and determine its variance. Use the reciprocals of these method variances for the weights in a following run; this should converge sufficiently in two to three runs. --Lambiam 22:56, 3 September 2020 (UTC)
 * OK, lemme digest that for a bit... -- Elmidae (talk · contribs) 02:36, 4 September 2020 (UTC)