Wikipedia:Reference desk/Archives/Mathematics/2020 October 14

= October 14 =

Stumped on a simple thing
I can solve this iteratively, but because my application is a real-time analysis project, I tried to find a closed-form solution but I'm running into a square root of a negative number, and I shouldn't be. I've been tearing my hair out most of the day trying to figure out where I went wrong.

The problem is simple enough: Given a set of N real values where N-1 values are known, what must the unknown Nth value be so that it equals d standard deviations from the mean of all the values? Basically I need to solve for xN in this expression, in which all terms on the right also contain xN:
 * $$x_N = \mu \pm d \sigma $$

Actually there are two possible values for xN. Simple enough. So I start with the variance identity, with xi being one of the values, &mu; is the mean, and &sigma; is the standard deviation:
 * $$ \sigma^2 = \left( \frac{1}{N} \sum_{i=1}^N x_i^2 \right) - \mu^2 $$

Because the value xN is unknown, the mean and variance can be rewritten as:
 * $$\mu = \frac{s_1+x_N}{N}, \quad \sigma^2 = \frac{s_2+x_N^2}{N} - \mu^2$$

where s1 and s2 are the sums of the known N-1 values of xi and xi2, respectively:
 * $$s_1 = \sum_{i=1}^{N-1} x_i, \quad s_2 = \sum_{i=1}^{N-1} x_i^2$$

Expanding the first equation above, I get
 * $$x_N = \frac{s_1+x_N}{N} \pm d \sqrt{\frac{s_2+x_N^2}{N}- \frac{(s_1+x_N)^2}{N^2}}$$

Rather than working this out myself, I plugged this into the Wolfram Alpha solver, ran the result through the simplify operation, and got
 * $$x_N = -\frac{\sqrt{-d^2 N (d^2 - N + 1) ((N - 1) s_2 - s_1^2)} - s_1 (d^2 - N + 1)}{(N - 1) (d^2 - N + 1)}$$

The problem is, no matter how I break up the problem, do steps manually and then put them into Wolfram Alpha, the expression under the square root always comes out negative. And this is a set of real positive x numbers, using d=2 and N=20.

Is there anything wrong with my approach? As I said, this has been stumping me all day. There's something I'm just not seeing. ~Anachronist (talk) 04:06, 14 October 2020 (UTC)
 * Why would the inside of this square root be always negative? Note that such a real $$x_N$$ sometimes exists, but does not always. We know that $$(N - 1)s_2 \geq s_1^2$$ and $$d^2N \geq 0 $$ so it hinges on the sign of the expression $$d^2 - N + 1$$. If this expression is negative then there is a real solution. In particular, say in the extreme case that $$N = 2$$. Then for any value of $$x_1$$ I cannot reasonably ask for an $$x_2$$ that is, say, a hundred standard deviations away from their mean. Note Chebyshev's inequality which places stringent restrictions on how many points can be a certain number of standard deviations away from the mean. To apply the inequality, consider the uniform distribution on your $$N$$ points, so that if there are m points $$x_i$$ such that $$|x_i - \mu| \geq d\sigma$$ the probability of that event is given by $$m/N$$. With $$k = d$$ the inequality then asserts that $$(m/N) \leq 1/d^2$$ or $$m \leq N/d^2$$. If $$N/d^2 < 1$$ then this forces $$m = 0$$, i.e. there are no real solutions. This proves that solutions cannot always exist. Note that this is a weaker restriction than $$d^2 - N + 1 < 0$$; adding $$N - d^2$$ to both sides gives $$1 < N - d^2$$ or $$1/d^2 < N/d^2 - 1$$, or $$1 + 1/d^2 < N/d^2$$. --Jasper Deng (talk) 05:04, 14 October 2020 (UTC)
 * The expression under the square root isn't negative. There's a negative sign at the front, but $$(d^2-N+1)=-15$$ is also negative.  The last term under the root, $$((N-1)s_2 - s_1^2)$$, is $$(N-1)^2$$ times the variance of the first N-1 values, so nonnegative.--101.98.109.114 (talk) 05:07, 14 October 2020 (UTC)
 * and anonymous. Thank you. I'll point out that $$d^2 - N + 1$$ is always negative for my values of d and N (d=2 and N=20 in my tests). That means the second term under the radical must be the culprit; the expression under the square root is always negative in all of my tests with real positive numbers for 19 values of x and solving for the 20th. My computer program is doing this with running sums on a time series; I've checked s1, so maybe I'm not updating s2 properly. I'm not trying to solve for a bizarre distribution of data, say, 20 values randomly distributed between 80 and 90. When I solve it iteratively, I always get the correct solution. Let me go back and check my code.... ~Anachronist (talk) 05:20, 14 October 2020 (UTC)
 * Assuming you calculated $$s_1$$ correctly, clearly your implementation of $$s_2$$ is not correct. In Python:


 * I get 564.0255249385683 as the solution. Remarkably, the existence of a real solution depends only on $$N$$ and $$d$$, and not any of the actual data points. I don't know which language you are using but if you are using Python, note that x^2 is not the square of x. It is the exclusive or of x with 2. You need to use the notation I used above to get the square.--Jasper Deng (talk) 05:30, 14 October 2020 (UTC)
 * I'm not using a power operator, I'm doing x*x for x squared. Also, for d<<N, a real solution depends only on that second term under the radical, $$(N - 1) s_2 - s_1^2$$, and I have verified that it is always negative. I put the numbers in Excel and verified that s2 and s1 are mathematically correct, but I discovered that I was using the incorrect sample size. Instead of calculating the sums for the last N-1 data values to calculate the Nth value, I was using the last N values. That consistently gave me negative numbers under the square root.
 * It helped to talk this out with you. Thanks. ~Anachronist (talk) 06:38, 14 October 2020 (UTC)

It is clear that for N=10 and d=3, there is no solution because $$d^2 - N + 1$$ goes to zero, and it's in the denominator. I'm wondering how to get around that, because I can easily find a solution iteratively. ~Anachronist (talk) 14:23, 14 October 2020 (UTC)


 * That means your quadratic is actually linear. --JBL (talk) 14:54, 14 October 2020 (UTC)
 * You mean continuous?
 * Unfortunately, the specific case of d=3 and N=10 is one of the situations that are mandatory to solve in this computer program. Due to the instability of the closed-form solution, I think I need to go back to the iterative approach. I was hoping there was a way to avoid the CPU overhead of iterating this. ~Anachronist (talk) 16:37, 14 October 2020 (UTC)
 * Nay. He means linear. The quadratic formula is not valid when the coefficient of the quadratic term is zero, as appears to be the case here. But in that case, your equation is linear and so the solution is in another form.(indeed, if there indeed were no solution, even iterative methods are doomed to fail).--Jasper Deng (talk) 16:47, 14 October 2020 (UTC)
 * Consider the quadratic equation $$ax^2 + bx + c = 0$$ with real coefficients. If $$ a \neq 0$$ and $$ b^2 - 4ac \geq 0$$ then it has solutions $$ \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$$.  If $$ a \neq 0$$ and $$ b^2 - 4ac < 0$$ then it does not have real solutions.  If $$a = 0$$ then actually it is not a quadratic equation at all, it is the much simpler linear equation $$bx + c = 0$$, with unique real solution $$ \frac{-c}{b}$$ whenever $$b \neq 0$$. (Pretty sure your CPU should be able to handle that ;).)  Also this is probably not relevant to you but I think it is nice to notice that if $$ b > 0 $$ then $$ \lim_{a \to 0} \frac{-b + \sqrt{b^2 - 4ac}}{2a} = -\frac{c}{b}$$, so the quadratic knows about its degenerate linear case; if $$b < 0$$ then it's the other root that approaches the root of the linear equation. --JBL (talk) 17:19, 14 October 2020 (UTC)
 * It seems to be less than stable near that discontinuity in the solution though. I tried d=2.9 with N=10 and got some wild looking results. I had solved this some years ago with an iterative technique. In trying to port my software to another platform I figured I'd try for a closed-form solution. I'll go back to iterative, but not the crude cut-stepsize-in-half approach I used before. I think Newton's method may be more efficient since the derivative of the function is pretty straightforward. ~Anachronist (talk) 22:19, 14 October 2020 (UTC)
 * Why not just use the linear case's closed-form solution in the case of exactly one root? Or better yet, use Mueller's formula, which is still valid in the linear case (but you need to still take some care near the discontinuity). The downside of course is having a square root in the denominator, but you can also move it to the numerator by rationalizing the denominator. Indeed, when $$0 < b^2 - 4ac < 1$$, the square root of that is greater than itself, so you lose about half of your precision. You can partly save this by expanding the square root as a binomial series: $$\sqrt{b^2 - 4ac} = b\sqrt{1 - 4ac/b^2} = b\sum_{i = 0}^\infty {1/2 \choose i}(-4ac/b^2)^i$$. For a very near zero, truncating this series and substituting it for your square root will be a stabler solution than computing the square root "in full" with a square root function.--Jasper Deng (talk) 22:54, 14 October 2020 (UTC)
 * Indeed, why not? Well, it's because all of this started as a fairly simple algebraic problem that I got stuck on because I had unknowingly used N instead of N-1 values to calculate the sums s1 and s2, consistently giving me a square root of a negative number. Once I figured out that the error was in my code and not my formulas, I found further complications at or near combinations of N and d that create a divide-by-zero error. So now to get around that and still avoid iteration, I need to account for special cases and calculate it as an approximation. The closed form turns out to be not so straightforward after all, in practice. I have concluded that if I'm going to be approximating it anyway, I may as well use Newton's method, which is straightforward, elegant, converges quickly, and better than the iterative solution I came up with years ago. I can make the iteration better, and the underlying calculations way more efficient. So I'll still have an improvement in this new version I'm porting.
 * This has been an interesting exercise and I learned more than I expected (and brought back memories of knowledge that had been long forgotten). Thanks for that. ~Anachronist (talk) 05:00, 15 October 2020 (UTC)
 * And watch out: Newton's method won't do at roots of the derivative! The thing about numerical analysis really is that there's no one-fits-all for anything. This thing about "expand using a Taylor series to stabilize" is common. For example, $$\frac{e^x - 1}{x}$$ is going to be less stable than $$\sum_{i = 1}^\infty \frac{x^{i - 1}}{i!}$$.--Jasper Deng (talk) 06:01, 15 October 2020 (UTC)

Well, the derivative of $$f(x_N) = \mu + d \sigma - x_N$$ does have a suspicious denominator:
 * $$f'(x_N) = \frac{1}{N}-1-\frac{d s_1}{N \sqrt{(s_2-s_1^2)-2 s_1 x_N }}$$

That quantity $$s_2-s_1^2$$ is N-1 times the variance of N-1 samples. Subtracting from that the sum of N-1 samples times the final Nth value may give me a negative square root. If that's the case (I haven't had time to test it), I'll go back to a binary search approximation. It'll still be better than what I had, because before I was inefficiently calculating the whole mean and standard deviation in each iteration, without first separating out s1 and s2, which can be treated as constants. So at least I'll be able to eliminate any loops in each iteration. ~Anachronist (talk) 18:58, 16 October 2020 (UTC)
 * If I were you, I would get rid of square roots before applying Newton's method. Here that would mean the equation $$(d\sigma)^2 = (x_N - \mu)^2$$. This does not introduce any extraneous solutions since $$d\sigma$$ can be of either sign (corresponding to either solution). Square roots are bad for stability (again, they lose half of your precision almost automatically). By "binary search" I assume you mean the bisection method, but in that case you still need to ensure you have one point below the x-axis and another above in order to guarantee finding a root. Such points are guaranteed to exist but for a very near zero, it will not do you any good since you will have to go very far out in order to get at the root which diverges as $$a \to 0$$.--Jasper Deng (talk) 19:29, 16 October 2020 (UTC)
 * Yes, bisection method. My original algorithm never encountered an error over years of use by multiple people with millions of streaming real-time values, so I'm less worried about special situations than about division-by-zero or square-root-of-negative with other approaches. My old approach simply calculated, for a test value of xN, the mean and standard deviation of the entire set of N values, and adjusted xN depending on whether it was less or greater than the value calculated for d&sigma; for that iteration, changing the sign and increment of the adjustment until xN fell within an acceptable error around d&sigma;. All sequences of N values always had a variance, so it always converged. I just wasn't happy with the efficiency. No one ever complained that my algorithm was bogging things down. I just figured, since I'm porting this to another platform, that I would see what I can do to improve it. ~Anachronist (talk) 04:54, 17 October 2020 (UTC)
 * I manually obtained the following formula:
 * $$x_N=\frac{s_1\pm\sqrt{D}}{N-1}$$,
 * where
 * $$D=s_1^2+(N-1)\frac{Ns_2d^2-s_1^2(1+d^2)}{N-1-d^2}=\frac{N(N-1)^2d^2}{N-1-d^2}((N-1)s_2-s_1^2)$$,
 * which does not suffer from any problems. Ruslik_ Zero 13:28, 17 October 2020 (UTC)
 * The formula may be rewritten in a very simple form:
 * $$x_N=\pm d\sqrt{\frac{N}{N-1-d^2}}\sqrt{-^2}$$,

where
 * $$=\frac{s_1}{N-1}$$, $$=\frac{s_2}{N-1}$$.
 * The expression under the second square root is obviously positive. Under the first root it is positive if $$d^2<1+N$$. Ruslik_ Zero 13:48, 17 October 2020 (UTC)
 * That's not valid for $$N - 1 - d^2 = 0$$ and unstable in cases close to that, which is what the OP is having trouble with.--Jasper Deng (talk) 16:45, 17 October 2020 (UTC)
 * It's a much simpler and easier-to-understand expression for the solution, though. ~Anachronist (talk) 16:49, 17 October 2020 (UTC)
 * I found an incredibly fast convergence for this:
 * Start with a guess for $$x_N$$; a good guess is the previous value found
 * Evaluate $$x_b = \mu + d \sigma$$ (first equation at the start of this thread)
 * Set $$x_N=x_b$$
 * Repeat from step 2 until $$|x_N-x_b|<\epsilon$$
 * This converges after just a couple of iterations (often 1 iteration for my error $$\epsilon$$), and doesn't need bisection method or Newton's method. ~Anachronist (talk) 02:16, 18 October 2020 (UTC)
 * You're doing what we call fixed-point iteration. This can converge extraordinarily quickly, but be careful to ensure you keep enough digits to avoid the loss of precision due to the square root, and also to choose a good initial guess (it must be within the roots' basin of attraction).--Jasper Deng (talk) 05:38, 18 October 2020 (UTC)
 * It turned out to be unstable for values of d larger than 2 or 3, so I went back to bisection. I'm satisfied that I eliminated the loops for calculating mean and standard deviation, to make it more efficient than my previous version. ~Anachronist (talk) 06:06, 19 October 2020 (UTC)
 * Typically one only uses bisection method to obtain a good guess for a better/faster method like the fixed-point iteration or Newton's method. You probably also saw just how hard satisfying all the conditions for fixed-point iteration to work can be!--Jasper Deng (talk) 07:11, 19 October 2020 (UTC)


 * This is archived now, but I wanted to provide a small followup. It turns out I don't need an iterative solution after all. That pesky term $$N-1-d^2$$ that appears in the denominator and in the square root turns out not to be a hole or discontinuity in the solution. It's a limit! That is, there is no solution when $$N-1-d^2 \le 0$$. I proved this to myself empirically. For $$N=10, d \ge 3$$, I generated 9 random numbers and a 10th outlier, and calculated $$\mu + d\sigma$$. No matter how large I set the outlier, it never exceeds $$\mu+d\sigma$$. Same for any other combinations of N and d when $$N-1-d^2 \le 0$$. I wrote early on in this thread that for this specific case "I can easily find a solution iteratively" but that was incorrect; at the time I was having other implementation problems.
 * So I can use my closed-form solution, as long as $$N-1>d^2$$. I'll go with that. Thanks again for your help. ~Anachronist (talk) 18:51, 24 October 2020 (UTC)

Primes (base 10) questions
Some questions on primes (all questions are base 10)
 * 1) Can it be proved that there is no maximum prime consisting only of the digits 1,3,7 & 9
 * 2) Can it be proved that for every length number that there is a prime consisting of only 1,3,7,&9?
 * 3) (Had another question that built on top of it that turned out to be about right truncatable primes)

(Side question, do any of these answers change if 2 is allowed in the first position in the numbers?)Naraht (talk) 13:54, 14 October 2020 (UTC)


 * I believe that the answer to question 2 is that no proof is known (and the only way to say, "Yes, it can be proved", is if you have a proof), but that it is nevertheless almost certainly true, in the same way that Goldbach's conjecture is true. The number of primes of length $$n$$ whose decimal representation contains only digits 1, 3, 7 and 9 appears to grow proportionally to $$4^n/n$$, so there are an increasing lot of those. Since 1 follows from 2, the same holds for 1. --Lambiam 00:15, 15 October 2020 (UTC)