Wikipedia:Reference desk/Archives/Mathematics/2012 October 28

= October 28 =

Zero ring equivalence
Given a ring $$R$$ and an element $$a \in R$$, show that $$R[x]/(ax-1)$$ is the zero ring if and only if $$a$$ is nilpotent.--AnalysisAlgebra (talk) 02:06, 28 October 2012 (UTC)

You're basically introducing a new element $$a^{-1}$$ aren't you? This means $$a^n a^{-n} = 1$$ for all $$n$$. However, if$$a$$ is nilpotent then there exists $$n$$ such that $$a^n = 0$$. Since anything multiplied by $$0$$ is $$0$$, this means $$1 = a^na^{-n} = 0a^{-n} = 0$$. $$1 = 0$$ is a sufficient condition for the ring being the zero ring!--AnalysisAlgebra (talk) 04:11, 28 October 2012 (UTC)

It's a necessary condition too isn't it? Does that prove the "only if" part in the "if and only if"?--AnalysisAlgebra (talk) 04:13, 28 October 2012 (UTC)


 * I'm not strong in this area, but I can see a few problems with your approach. The polynomial ring R[x] (and hence the quotient ringR[x]/(ax − 1)) is very different from the ring R.  And you are not introducing an element a−1; it either already exists in Ror it doesn't. Another point: be clear whether by "zero ring" is meant a zero ring (a ring in which every product is zero) or the trivial ring (the ring with one element). — Quondum 06:58, 28 October 2012 (UTC)
 * AnalysisAlgebra means $$x$$ in $$R[x]/(ax-1)$$ will be "a new element $$a^{-1}$$", and if you want, you can replace all his$$a^{-1}$$s with $$x$$s; his proof is correct. But it doesn't show the "only if". For that, consider what it means for the quotient ring to be the trivial ring (which is what is meant). What would have to be in the ideal you are quotienting by? And what would that imply about $$a$$, after a little work? John Z (talk) 09:40, 28 October 2012 (UTC)
 * That would imply $$R[x] = (ax-1)$$ wouldn't it?--AnalysisAlgebra (talk) 10:49, 28 October 2012 (UTC)
 * I think for every polynomial $$P$$, $$P(x)(ax-1) \in (ax-1)$$. How do you now exploit $$a \in R = (ax-1)$$ which is the only other information you have?--AnalysisAlgebra (talk) 10:52, 28 October 2012 (UTC)
 * In particular, $$1 \in (ax-1)$$, so $$P(x)(ax-1) = 1$$ for some polynomial $$P$$. Now analyse the coefficients of$$P$$.--80.109.106.49 (talk) 11:03, 28 October 2012 (UTC)
 * I'VE GOT IT! In that case, you must have $$P(x) = -\sum_{k=0}^\infty a^kx^k$$ and this is a finite polynomial only if $$a$$ is nilpotent! M-M-M-M-MONSTER KILL!!!--AnalysisAlgebra (talk) 11:52, 28 October 2012 (UTC)
 * Oh and thanks for all your help by the way.--AnalysisAlgebra (talk) 11:54, 28 October 2012 (UTC)
 * Hey, wait a minute - why is $$1 \in (ax-1)$$?--AnalysisAlgebra (talk) 12:10, 28 October 2012 (UTC)
 * Because $$R[x] = (ax-1)$$, as you said above, and $$1 \in R[x]$$. — Precedingunsigned comment added by 80.109.106.49 (talk) 17:23, 28 October 2012 (UTC)

Bayesian estimation of variance vs statistical unbiasedness
I have recently been thinking about the estimation of an unknown variance from a Normal distribution.

As is well known, maximum likelihood gives an estimate
 * $$\hat{\sigma^2}_{ML} = \frac{S^2}{n}$$

where S2 = &Sigma; (xi - &mu;)2 and n is the number of observed data points

This is unbiased in the statistical sense -- i.e. for any true value of &sigma;2, the expected value of the estimator is also &sigma;2 -- if the mean of the underlying distribution is known

If the mean is not known, it is standard to apply Bessel's correction, to give
 * $$\hat{\sigma^2}_{pop} = \frac{S^2}{n-1}$$

to give an unbiased estimator, as derived as an example in the Bias of an estimator article.

If it is an unbiased estimation of standard deviation we're interested in, given unknown mean, there is a slightly different correction that can be made,
 * $${\hat\sigma}_{pop} \simeq \sqrt{ \frac{1}{n-1.5} \sum_{i=1}^n(x_i - \bar{x})^2}$$

though in practice pretty much nobody bothers with this, and just uses the square root of the unbiased estimator of population variance instead.

So far, so standard. But then I turned to Bayesian estimation of &sigma;2 and &sigma;

A very standard prior to use is the Jeffreys prior for the problem, which places an (improper, but that may not be a problem) rescaling-invariant flat distribution on ln &sigma;,$$\scriptstyle{p(\ln \sigma|I) \; \propto \; 1}$$, which corresponds to a distribution $$\scriptstyle{p(\sigma^2|I)\; \propto \; 1/\sigma^2}$$, or $$\scriptstyle{p(\sigma|I) \; \propto \; 1/\sigma}$$.

If the mean &mu; is known, this leads to a posterior probability for &sigma;^2 given by an scaled inverse chi-squared distribution, giving a posterior expectation value of
 * $$\langle{\sigma^2}_{Bayes}\rangle = \frac{S^2}{n-2}$$

If the mean is not known, marginalising over the joint distribution of &mu; and &sigma;2 gives another scaled inverse chi-squared distribution, with a posterior expectation value for &sigma;2 of
 * $$\langle{\sigma^2}_{pop-Bayes}\rangle = \frac{S^2}{n-3}$$

The posterior probability for &sigma; follows an scaled inverse chi distribution, which by comparison with the more general generalized gamma distribution we can find has a posterior expectation value for &mu; known of
 * $$\langle{\sigma}_{Bayes}\rangle = \frac{S}{\sqrt{2}} \; \frac{\Gamma((n-1)/2)}{\Gamma(n/2)}$$

which, for all but the smallest n, converges rapidly towards
 * $$\langle{\sigma}_{Bayes}\rangle \simeq \sqrt{ \frac{S^2}{n-1.5}}$$

the same formula as previously found, but for the unbiased frequentist population estimator -- ie the frequentist estimator with unknown mean.

For &mu; not known in the Bayesian case, we get
 * $$\begin{align}

\langle{\sigma}_{pop-Bayes}\rangle = & \frac{S}{\sqrt{2}} \; \frac{\Gamma((n-2)/2)}{\Gamma((n-1)/2)} \\ \simeq & \sqrt{ \frac{S^2}{n-2.5}} \end{align}$$

The Bayesian posterior expectation values for &sigma;2 and for &sigma;, both for &mu; known and &mu; unknown, are therefore quite different to the Frequentist "unbiased" estimators.

Question: Is it cause for concern that the Bayesian estimators are not unbiased in the Frequentist sense (even given a supposedly uninformative prior)?

Is this just par for the course for a Bayesian inference, which in general cares little about such things? Or does it suggest the prior is not all it should be? (Or have I got my sums wrong?) Jheald (talk) 09:18, 28 October 2012 (UTC)
 * I can confirm your results for the case of known mean.
 * I don't think it's a cause for concern. The Jeffreys prior is just too wide and wild, the variance estimator doesn't even have a mean with 2 or less datapoints. Expecting its mean to be equal the parameter with more datapoints is just too much. (The same could be said for the mean estimator, but for it the prior is at least uniform in linear scale.)
 * In practice we usually know something (however faint) a priori, so a non-informative prior is not necessarily the best choice. I think the frequentist approach implicitly assumes tame behavior of the parameter. -- Meni Rosenfeld (talk) 13:57, 28 October 2012 (UTC)


 * I don't think it's right to says that "the frequentist approach implicitly assumes tame behavior of the parameter" -- the frequentist approach is intended to work whatever the value of the parameter, from epsilon above zero right through to 1/epsilon.


 * On the other hand you're right that many practising Bayesians, particularly if they perhaps are going to be doing automated Bayes factor model comparison, will try to do all they can to make their priors as realistic as possible, throwing in everything (they think) they know about the problem. And of course, once you've moved to informative priors then the game is very different -- the Bayesian isn't interested in any "unbiasedness" that doesn't take any of that additional information into account; what the Bayesian is interested in is building the best book of odds that they can, so that if an unbiased frequentist does wander into their betting shop, they think they will take them to the cleaners.


 * But if we exclude that additional information, and stay with a prior that is at least supposed to be uninformative, I still find it very striking that the Bayesian can apparently choose an expectation value for &sigma;^2 that will be higher than the Frequentist's "unbiased" estimator in every case -- and expect to be consistently nearer, in a least squares sense, to the truth. Is this correct?   Is the Frequentist approach of trying to make$$\scriptstyle{E(\hat{\theta}|\theta) = \theta}$$ profoundly misconceived, compared to the Bayesian choosing $$\scriptstyle{\hat{\theta}}$$such that $$\scriptstyle{E(\theta|\hat{\theta}) = \hat{\theta}}$$ ?  And yet, as your response above indicates, we do tend to assume there is something right about the Frequentist objective, and $$\scriptstyle{\hat{\sigma}_{N-1}}$$; even when a principled Bayesian calculation leads to a higher estimate for every single dataset.


 * One method or the other, it seems to me, has to face off worse out of this. Jheald (talk) 15:56, 28 October 2012 (UTC)


 * There are also some Bayesian books I have had a quick look at, and there does seem to perhaps be a reluctance to "call out" the$$\scriptstyle{\hat{\sigma}_{N-1}}$$ estimator, or shine too bright a light on it. (I'll come back and edit in page-refs later; and apologies if I've missed something in quite a cursory flick through).


 * David MacKay (2003), Information Theory, Inference, and Learning Algorithms finesses the issue (p. 320) by considering the distribution for p(ln&sigma;), which he shows has its modal value at &sigma; = sqrt(S2/N) for the case where &mu; is known, and at &sigma; = sqrt(S2/N-1) for the distribution obtained after marginalising out &mu; when &mu; is not known. This closely reflects the approach he took in his early work on neural nets, which grew out of similar work in the radioastronomy analysis group in Cambridge.  But it's far from clear why the value which maximises p(ln &sigma;) is the most appropriate point-estimate for &sigma;2 or &sigma;.


 * E.T. Jaynes (fragmenatary 1994 draft ), Probability Theory: the Logic of Science included in passing s/sqrt(N-3) as the predicted standard deviation of the estimated mean (eqn 7-23, p. 7.11), "as we will see chapter 20, Estimation with a Gaussian Distribution". This seems to have been dropped in the 2003 edited published version of the book.  It would have been around page 210.  At that time Chapter 20 hadn't been completed.  So it seems that this is another of the parts of the book that never got finalised on paper -- the material doesn't seem to be in the 2003 text.  Though he does have two really quite polemical chapters against "unbiasedness", describing the cult of it as a pathology.


 * Box and Tiao (1973), Bayesian Inference in Statistical Analysis consider estimation of the parameters of a normal distribution in detail, but their focus is on finding "highest probability density" (HPD) regions -- which they note will not be invariant under reparametrisation. Their recommendation is to consider the HPD region under a transformation that most nearly makes the prior flat; though I imagine others might be most concerned about loss probabilities.  As far as I can see, Box and Tiao do not mention or even allude to $$\scriptstyle{\hat{\sigma}_{N-1}}$$.


 * Gelman et al (1995, 2e 2004), Bayesian Data Analysis also consider the estimation in some detail, giving the form both of the probabilities for uninformative priors, and of update rules for conjugate informative priors. But again, there seems to be no discussion of or allusion to a frequentist estimator like $$\scriptstyle{\hat{\sigma}_{N-1}}$$.


 * Finally, Harold Jeffreys (1939, 3e 1961), Theory of Probability also considers Normal distribution parameter estimation, though again what is discussed seems to be more the form of the distribution, rather than any point estimator, or comparison with a popular estimator like$$\scriptstyle{\hat{\sigma}_{N-1}}$$.


 * All of the above take p(&sigma;) ∝ 1/&sigma; for their uninformative distribution.


 * So does anybody know of any literature that tackles this particular issue more directly? Jheald (talk) 16:52, 28 October 2012 (UTC)

Update. After a fair amount more thought and more reading, I have found an answer which I think feels satisfying. I have updated the article Bias of an estimator, here Jheald (talk) 16:28, 11 November 2012 (UTC)

Product of ideals is NOT an ideal ?
Given two ideals $$I$$ and $$J$$, why is their direct product $$\{ab | a\in I, b\in J\}$$ NOT necessarily an ideal? If you have an element in that set and an element $$r$$ in the ring, don't you have $$(ab)r = a(br)$$ which is again the product of an element $$a$$in $$I$$ and an element $$br \in J$$ (since J is an ideal)? Maybe you get problems if the multiplication is not commutative but apparently this is still a problem when it is.--AnalysisAlgebra (talk) 12:33, 28 October 2012 (UTC)

OH. That's not the problem; the problem is that it is not closed under addition. WHOOPS.--AnalysisAlgebra (talk) 12:38, 28 October 2012 (UTC)
 * That's exactly it. I remember making the same mistake when I first encountered the idea (of products of ideals in general, not necessarily `direct' products). You do need to close under addition in the ring -- whenever I see the notation 'IJ', it usually means finite sums of something in I times something in J (apologies for stating that sloppily). Icthyos (talk) 13:28, 28 October 2012 (UTC)

Riemann series theorem - generalized
RST states that if a series converges conditionally but not absolutely, its terms can be rearranged to give a different value for the sum (and indeed any value, or even to make it diverge). Is it possible to construct a conditionally convergent series such that even merely "rearranging the brackets" will give a different value? This is what I'm talking about:

Consider a series $$\sum^{\infty}_{k=1}(-1)^{k+1}b_n=b_1-b_2+b_3-b_4...$$ (assume all bn are nonnegative also). This series is implicitly bracketed thus:

$$\sum^{\infty}_{k=1}(-1)^{k+1}b_n=(...(((b_1-b_2)+b_3)-b_4)+...$$

One possible rearrangement is:

$$\sum^{\infty}_{k=1}(-1)^{k+1}b_n=(b_1-b_2)+(b_3-b_4)+(b_5-b_6)+...$$

What I'm talking about is possible for the Grandi series but Grandi's diverges; I'm looking for a convergent series. Thanks,24.92.74.238 (talk) 16:15, 28 October 2012 (UTC)
 * No, it's not possible. Every partial sum for the bracketed series is also a partial sum for the original series.  If the sequence of partial sums for the original series converges, then a subsequence of it must also converge to the same value. Looie496 (talk) 16:28, 28 October 2012 (UTC)

Computable analog of P/poly
Our article on P/poly says that the class "can be used to model practical algorithms with a separate expensive preprocessing phase and a fast processing phase" and that adversaries in cryptography are sometimes considered to be P/poly machines since "this also admits the possibility that adversaries can do heavy precomputation for inputs up to a certain length". However, since the advice function may be uncomputable, the class of problems in P/poly is far larger than one could ever achieve with Turing machine pre-computation even theoretically. It would seem more reasonable for this purpose to use the problems solvable by a polynomial-time machine with polynomial-sized advice, subject to the restriction that the advice also be computable as a function of the input size. Is there a standard name for this class? « Aaron Rotenberg « Talk « 19:44, 28 October 2012 (UTC)

Continuum hypothesis
Does anyone know, if;


 * $$2^{\aleph_\alpha} = \aleph_{\alpha+1}.$$

What this evaluates to?


 * $$2^{{{2}^2}^...}.$$

Edit: It won't let me format that how i wanted to - basically I mean what is 2 to the power of 2 to the power of 2 to the power of 2 to the power of 2 and so on. — Preceding unsigned comment added by86.151.16.138 (talk) 20:41, 28 October 2012 (UTC)


 * The question is what do you mean by that notation? Express it as a limit of finitary expressions! For example, if you mean the limit of the sequence
 * $$\aleph_\alpha, 2^{\aleph_\alpha}, 2^{2^{\aleph_\alpha}}, 2^{2^{2^{\aleph_\alpha}}}, \ldots $$
 * then the generalized continuum hypothesis implies that this is $$\aleph_{\alpha+\omega} $$. JRSpriggs (talk) 06:46, 29 October 2012 (UTC)


 * Perhaps a more concrete statement might help. You need to remember that $$\alpha$$ is any ordinal number.  So we get the sequence by evaluating the expression for the first few ordinals:
 * $$\aleph_0, \aleph_1=2^{\aleph_0}, \aleph_2=2^{2^{\aleph_0}}, \aleph_3=2^{2^{2^{\aleph_0}}}, \dots$$
 * The ordinals are not countable (the subscripts become various infinities), so this list is not countable. — Quondum 08:55, 29 October 2012 (UTC)

Ah thank you for your help. I did have one other question. Is there a limit to the cardinal numbers or do they just go on forever;


 * $$\aleph_0, \aleph_1, \aleph_2, \aleph_3, \dots$$ — Preceding unsignedcomment added by 109.153.191.44 (talk) 19:13, 29 October 2012 (UTC)


 * As noted above, they go on more than forever; there's a distinct aleph number for every ordinal number. For a direct proof of the infinitude of the infinite cardinals, Cantor's theorem is that (even without assuming the continuum hypothesis) the power set of every set has a larger cardinality than the set itself. Applying this repeatedly to the the natural numbers gives an ever-increasing sequence of infinite cardinals. « Aaron Rotenberg « Talk « 00:43, 30 October 2012 (UTC)


 * The Vatican needs to be told about this immediately. --   Jack of Oz   [Talk]  01:00, 30 October 2012 (UTC)