Talk:AdaBoost

Untitled
I'm not sure with the way the weight α is computed. In Viola and Jones paper, they define α as 0.5*log(1-ε/ε) and not ln. If this is ln, α is negative, which is not possible.
 * First, log is not necessarily logarithm with base 10, often it is the same as ln. Second, if it is, then log10(x) = ln(x)/ln(10), so log10(x) and ln(x) are either both positive or both negative. Since it is postulated that ε<0.5, then α is always positive. Finally, (1-ε)/ε ≠ 1-ε/ε. Victordk13 (talk) 10:19, 7 July 2008 (UTC)

What is a weak classifier? —Preceding unsigned comment added by 202.89.161.163 (talk) 06:38, 27 March 2009 (UTC)
 * Good question. Originally, "weak classifier" was placed in a link that took one to the boosting page. But the term used on that page was "weak learner," and it was not immediately obvious that the meaning was similar. I added a few sentences here to make things clearer. -AlanUS (talk) 18:24, 19 November 2011 (UTC)

When was Adaboost invented? How has it affected Machine Learning research? How does it relate to the support vector machine idea? I'd like to see some more history and big picture ideas in this article.--Singularitarian (talk) 19:36, 10 November 2009 (UTC)

I think the link http://opencv.willowgarage.com/documentation/boosting.html is broken, should be removed —Preceding unsigned comment added by 77.188.176.114 (talk) 23:32, 25 February 2010 (UTC)

I think there is an error in algorithm. As the algorithm there is for binary classification problem the early termination criterion must be strictly eps = 0.5. As if eps > 0.5 for some h(x) we could take h_1(x) = - h(x), and the error for h_1(x) will be _less_ than 0.5 —Preceding unsigned comment added by Kzn (talk • contribs) 07:34, 6 May 2010 (UTC)
 * You're right. I fixed the page to use |error rate - 0.5|, though the criterion is that it be smaller than some value beta rather than exactly 0. -AlanUS (talk) 19:00, 19 November 2011 (UTC)


 * Shouldn't it be the other way around? Since all D's sum up to 1, ε can be between 0 and 1. If it is 0, then h_t classified all samples correctly, and | 0.5 - ε | would become 0.5. If ε is 1, then h_t classified all samples wrong, which is good, because we can add the inverse of h_t to our final classifier. In this case, | 0.5 - ε | would also yield 0.5. So, the highest value for | 0.5 - ε | is 0.5 and the lowest is obviously 0. We want to stop the algorithm, if it finds a good classifier, thus | 0.5 - ε | close to 0.5. Therefore the termination criterion should be | 0.5 - ε | >= beta. — Preceding unsigned comment added by 137.193.76.44 (talk) 12:03, 11 February 2013 (UTC)


 * Shouldn't then updating D reflect this as well? currently it will reduce the weight of wrongly classified samples if ε>0.5. — Preceding unsigned comment added by 109.226.48.167 (talk) 22:21, 26 June 2013 (UTC)

Definition of H
What is the set H in ... $$\underset{h_{t} \in \mathcal{H}}{\operatorname{argmin}}$$ ...? I can't find its definition. What's the initial definition of the other $$D_{t}(x)$$ for $$t \not = 1$$? You read them without previous writing at the beginning of the loop! Anyway, its a better description than any other one in the world wide web. — Preceding unsigned comment added by 217.254.8.134 (talk) 17:41, 11 June 2011 (UTC)

$$D_{t}(x)$$ for t > 1 is specified by the last line of pseudocode. -- X7q (talk)
 * ℋ refers to the family of weak classifiers, or "hypotheses". $$\textstyle \arg \min_{h \in \mathcal{H}}$$ just tells you to train a weak classifier, which minimizes specified weighted classification error. In some cases, such as with decision stumps, the set of different possible classifiers is finite and small (number of features * number of values a feature takes in the training set for stumps), so you can simply check them all. In other cases, it might not be always possible to find the exact minimum here (IIRC, it's NP-hard even for linear classifiers), so you settle for an approximate minimum in such cases - run another learning method which learns functions from your family of weak classifiers.


 * Thanks, now it is more clear. I first thought you specified those weak classifiers before going into the loop, as $$T$$ weak classifiers are given. Then it would iterate through all $$t \in T$$ and find for every $$t$$ the error minimizing $$h_{t}$$ out of those $$T$$ weak classifiers, but then you would need the $$D_t$$ for all $$t$$. — Preceding unsigned comment added by 93.209.163.155 (talk) 08:05, 12 June 2011 (UTC)


 * That was definitely an error in the article. Weak classifiers are not fixed beforehand, but selected inside the algorithm's loop. Number of iterations T is usually fixed in advance (it could be also chosen adaptively, but that's not in the classic descriptions of the algorithm). Thanks for reporting this, I've fixed it in the article. -- X7q (talk) 19:36, 12 June 2011 (UTC)

Minimisation
$$h_{t} = \underset{h \in \mathcal{H}}{\operatorname{argmin}} \; \epsilon_{t}$$, where $$ \epsilon_{t} = \sum_{i=1}^{m} D_{t}(i)[y_i \ne h(x_{i})]$$ Wouldn't this make it more clear? (see french version of this article) --Madvermis (talk) 18:13, 8 August 2011 (UTC)

Possibly broken link under "Implementations"
http://www.inf.fu-berlin.de/inst/ag-ki/adaboost4.pdf (AdaBoost and the Super Bowl of Classifiers - A Tutorial on AdaBoost.) — Preceding unsigned comment added by 89.134.90.209 (talk) 08:49, 8 October 2012 (UTC)

Possible errors in Discrete AdaBoost algorithm
Acorrada (talk) 18:52, 11 April 2014 (UTC)
 * The $$\epsilon_t$$ function is assigning negative weights to no error, positive to an error. Thus the learner that makes an equal number of errors as corrects gets a value of zero. This is equivalent to the learner that makes no error.
 * The $$\alpha_t$$ parameter as a function of $$\epsilon_t$$ must be wrong. It assigns a weight of zero to the learner that has zero error.


 * $$\epsilon_t = 0$$ does not mean zero error, it means that the sum weight on the errors is equal to the sum of the weights of the correctly predicted samples. $$\epsilon_t = -1$$ means no error.


 * A weak learner with $$\epsilon_t$$ close to zero is essentially no better than random guessing -- a learner with positive $$\epsilon_t$$ is still more useful than one with $$\epsilon_t = 0$$, because you can reverse its outputs to turn it into a learner with negative $$\epsilon_t$$. Bgorven (talk) 12:30, 12 April 2014 (UTC)

Thank you. Your explanation clarifies both examples I made. Does it not seem non-standard to use $$\epsilon = -1$$?. After all, we speak of making "no errors" when a classifier makes zero errors. Talking about "-1 errors" is counter intuitive. Would it not help the pedagogical value of this article, then, to redefine the error function so that zero errors corresponds to $$\epsilon=0$$ and not $$\epsilon=-1$$? Acorrada (talk) 10:54, 14 April 2014 (UTC)

Introduction
The introduction mentions that "as long as the performance of each one is slightly better than random guessing (i.e., their error rate is smaller than 0.5 for binary classification), the final model can be proven to converge to a strong learner." However, the model can converge even if the performance of a weak learner is worse than random guessing. In a case where a weak learner is worse than random alpha becomes negative and the weights for the subsequent learner are updated in the opposite direction. Hence, the intro should state that the final model can converge as long as the performance of a weak learner is not random (i.e. their error rater is smaller or larger than 0.5 for binary classification). — Preceding unsigned comment added by Markthat (talk • contribs) 15:20, 4 May 2015 (UTC)