Wikipedia:Reference desk/Archives/Mathematics/2007 November 17

= November 17 =

Integration
I need some help - what method would one use to evaluate the integral: $$\int \frac {1}{1+e^x}\, dx$$? Thanks. -- Sturgeonman (talk) 16:24, 17 November 2007 (UTC)
 * Try the substitution $$t=e^x$$. -- Meni Rosenfeld (talk) 17:08, 17 November 2007 (UTC)
 * By the way, there is a "Show preview" button, and Help:Formula has some information about writing formulae. -- Meni Rosenfeld (talk) 17:11, 17 November 2007 (UTC)


 * The substitution $$t=e^x$$ will work and you will have to decompose fractions. You can also do this one by parts as opposed to substituting like Meni said.  And then you might need to do the second integral by parts again.  Anyway, it comes out to $$x-\ln(e^x+1)$$.A Real Kaiser (talk) 19:05, 17 November 2007 (UTC)
 * We prefer not to give final results for what may be homework questions. -- Meni Rosenfeld (talk) 21:50, 17 November 2007 (UTC)
 * Alternatively, expand $$1/(1+e^x)$$ as a geometric series, integrate term by term, and recognize the Taylor series. Fredrik Johansson 20:14, 17 November 2007 (UTC)


 * The easiest way seems to be to divide top and bottom by $$e^x$$, then the expression to be integrated is of the form -du/u, giving -ln(u). A bit of manipulation gives the answer in its simplest form, with not a "by parts" in sight. 81.159.14.111 (talk) 12:46, 18 November 2007 (UTC)
 * Not clear. You have $$\int \frac{e^x}{e^x+e^{2x}}\ dx$$, but the numerator is not the derivative of the denominator . -- Meni Rosenfeld (talk) 12:58, 18 November 2007 (UTC)
 * Dividing top and bottom by $$ e^x $$ gives $$ \int \frac{e^{-x}}{e^{-x} + 1} \ dx $$. SmaleDuffin (talk) 16:57, 18 November 2007 (UTC)
 * Right, I should work on my reading comprehension skills. -- Meni Rosenfeld (talk) 18:05, 18 November 2007 (UTC)

Optimal initial interval for my algorithm
This is going to be relatively complicated, and probably long, but I think it's an interesting problem and the solution will help me run things faster. I'm relatively well-educated mathematically, but probability/statistics and I often don't get along, so that's the part you may need to help me with.

I have a function, $$f(x)$$, defined over $$x \in [0,\infty)$$ that maps a real number to a boolean. It's monotonically decreasing, if that can be said of a boolean function.  That is,

$$f(x)=\begin{cases} 1 & 0 \leq x \leq q\\ 0 & x > q \end{cases}$$

The function $$f(x)$$ is expensive to evaluate, and I would like to evaluate it as few times as possible. Pretty much anything you could express here is orders of magnitude computationally cheaper than evaluating $$f(x)$$.

My algorithm finds $$q$$ within $$\epsilon$$. It does so in two stages:


 * 1) Find an interval over which $$f(x)$$ changes, that is, an $$a$$ and a $$b$$ such that $$f(a)=1$$ and $$f(b)=0$$
 * 2) Repeatedly bisect this interval, evaluating the middle point and halving the interval each time, until $$b-a < \epsilon$$.  The estimate of $$q$$ is then $$\frac{a+b}{2}$$.

Now, I'm pretty sure the second stage is as efficient as it can be. It will take

$$\left \lceil \frac{\log(b^* -a^*) - \log \epsilon}{\log 2} \right \rceil$$

calls of the function $$f(x)$$, where $$a^*$$ and $$b^*$$ are the ends of the interval when we start the second stage.

So let's focus on the first stage for a moment. Here's how I've implemented it:


 * The calling routine gives an estimate of $$a$$ and $$b$$. Let's call these $$a^0$$ and $$b^0$$


 * I evaluate $$f(a^0)$$ and $$f(b^0)$$. If they're different, move on to stage 2 right away, the initial interval has been found.


 * If they're the same, we have to search for an interval. I'll only give you the algorithm for if they're both 1; the algorithm for if they're both 0 is analogous (though modified a little to ensure that we never evaluate a negative number).  Set $$k=0$$


 * Evaluate $$f(2b^k-a^k)$$. This was picked because it doubles the interval, $$2b-a=b+(b-a)$$. If $$f(2b^k-a^k)$$ is 0, then we've found an appropriate interval and can move on to the next stage.  Otherwise, set $$a^{k+1}=2b^k-a^k$$, $$b^{k+1}=a^k$$, increment $$k$$ by one, and repeat this step.

So I think this stage is also optimal, but I'll take suggestions for improving it if anyone has any.

Now we get to my question. I want to choose $$a^0$$ and $$b^0$$ well. It's a question of balancing the average number of function evaluations required for the two stages: too narrow an interval, and the first stage will take too long; too wide an interval, and the second stage will take too long.

Obviously if I'm given no information whatsoever about $$q$$ there's no good way to choose this interval. But I happen to know something about $$q$$ before we even start this algorithm. It's a normally distributed random variable with mean $$\mu$$ and variance $$\sigma^2$$. (Actually, it's not, it's a completely deterministic number, but I happen to have a model that predicts $$q$$ and gives an estimate of mean square error analogous to the variance of a normal distribution).

So, how do I pick $$a^0$$ and $$b^0$$ in such a way to minimize the total number of evaluations of $$f(x)$$, over both stages?

Thanks, moink (talk) 20:47, 17 November 2007 (UTC)


 * The normal distribution gives some hints the 68-95-99.7 rule states that for a normal distribution, almost all values lie within 3 standard deviations of the mean. So you have a 99.7% chance of q lying in $$\mu\pm 3\sigma$$. So $$a_0=\mu-3\sigma$$ $$b_0=\mu+3\sigma$$ will give good estimate. You might also try $$\mu\pm 2\sigma$$ with a 95% chance of q lieing in the range. --Salix alba (talk) 21:04, 17 November 2007 (UTC)
 * Right, but that means that most of the time the interval will be too big. So the majority of the time, stage 1 wouldn't happen at all, but stage 2 would take longer than it would have, given a smaller interval.  I'm looking for the initial interval that gives the optimum tradeoff between the two stages.  moink (talk) 21:12, 17 November 2007 (UTC)
 * Must q be nonnegative? If so, it is not clear how can it be (treated as) normally distributed. Do you mean, perhaps, that &mu; is large enough for the discrepancy to be insignificant? Or perhaps q is the absolute value of a normally distributed variable?
 * Yes, q is non-negative. So obviously it's not really normally distributed.  It's just the best approximation I have for it.  I guess I can think of it as a truncated normal distribution, with the low end of the tail cut off.
 * I'll comment that, since q is assumed to be normally distributed, neither of your two steps is optimal. If you know, say, that $$f(1)=1$$ and $$f(2)=0$$, then you are dealing with the conditional distribution of q given $$1 \le q \le 2$$. The optimal next evaluation point depends on all parameters, but I suspect that as $$\epsilon \to 0$$, taking the median is optimal (as opposed to the mid-range).
 * I also don't think that starting with two points, $$a_0$$ and $$b_0$$, is optimal. You should start with a single point, see what it tells you (either $$q < a$$ or $$q \ge a$$) and continue from there.
 * I have two versions of my algorithm: one that starts with an initial point, and another that starts with an initial interval. I wrote the second because I was frustrated watching the first run: it seemed like the better the guess for the initial point, the worse the algorithm performed.  But that was before I developed the alternative model for q.
 * Here's what I think is optimal as $$\epsilon \to 0$$, : Start with $$0 \le q < \infty$$. At every step, take the median of the probability distribution conditional on what you currently know about q. Evaluate f at that point, and take the appropriate interval based on the result.
 * I haven't really thought about how to calculate the median, but it shouldn't be too hard. -- Meni Rosenfeld (talk) 22:08, 17 November 2007 (UTC)
 * Ha! That's the part I probably am bad at.  Ok, I understand your idea, and I think you're proabably right.  Now I just have to calculate the median of a conditional normal distribution, i.e. a normal distribution truncated at one end or both.
 * Actually, that depends on what you want to minimize. This minimizes the expected number of evaluations. Minimizing worst case is not really possible, since the worst case will always be unboundedly many evaluations. If you want something in between, a simple bisection might have merits, though you will have to define what it is exactly that you want.
 * Yes, the expected number of evaluations is what I'm looking to minimize. I should have made this more clear.  It's more complicated than I've made it out to be, but the simple version is that I need a whole lot of evaluations of $$q(\mathbf{u})$$ over a range of $$\mathbf{u}$$, since f is really $$f(x,\mathbf{u})$$, where $$\mathbf{u}$$ is a vector of parameters I have some control over.  The plan is actually to vary $$\mathbf{u}$$ to maximize $$q(\mathbf{u})$$.
 * That actually makes a significant difference, since the results for different values of $$\mathbf{u}$$ can be used together to make the search faster. If nothing else, at the very least you'll know that if $$f(x_0,\mathbf{u}_0)=0$$ and $$f(x_1,\mathbf{u}_1)=1$$ for some $$x_0 \le x_1$$, then there's no more point in searching at $$\mathbf{u}_0$$ — it can't maximize $$q$$.  Also, if $$q(\mathbf{u})$$ is expected to be continuous or even just have any kind of autocorrelation, you should be able to use that information to further refine your search.  —Ilmari Karonen (talk) 09:40, 19 November 2007 (UTC)
 * It's actually not that simple. If we want to maximize $$q(\mathbf{u})$$, we need to use some algorithm for it, for example, gradient ascent (in particular if $$q(\mathbf{u})$$ is concave). Depending on the algorithm, it can depend in all sorts of ways on the values of q at specific points. It may need to know the value of q at some point to some precision, even if it is known that the maximum is not achieved at that point.
 * I agree with the main issue, though, that the nature of the problem suggests that we can do better than naively computing q for any input. For example, as we approach the maximum, we have a rough idea of what q is, so we should no longer treat it as normally distributed over the entire range of possibilities. Our required precision should also increase as we approach the maximum. Optimizing the maximization problem is harder, and depends on our assumptions regarding $$q(\mathbf{U})$$. -- Meni Rosenfeld (talk) 12:35, 19 November 2007 (UTC)
 * Needless to say, the minimal possible expected number of evaluations is directly related to the information theoretical entropy of the distribution. -- Meni Rosenfeld (talk) 22:15, 17 November 2007 (UTC)
 * Ummm, sure, needless to say. Information theory is another one of my weak points.  Thank you very much for your help, you've given me a direction to go.  moink (talk) 22:37, 17 November 2007 (UTC)
 * After reading information entropy, yes, that is indeed obvious. moink (talk) 22:53, 17 November 2007 (UTC)