Talk:Information gain (decision tree)

Untitled
drawbacks : if we don't want the credit card number to show up in the decision tree, we just would not include it in the input attributes. Thus I think this is not a good example. Nulli 08:33, 13 March 2006 (UTC)


 * I don't see why a credit card number would be used to "describe" a customer in the first place, useful for identifying customers but would hold no use in describing them. If we were putting customers into a decision tree from a long list of customers and their attributes, I think including credit card numbers would be analogous to including the list ID number which would obviously be stupid. I think maybe this is what the author of that example is trying to say, i.e. care must be taken as to what attributes to include.


 * But I agree it is not a good example at all. There are also several other disadvantages of decision trees also which are not included. I'll try improve the article. Canderra 20:54, 21 May 2006 (UTC)

Information gain, mutual information, KL divergence
I'm confused about this. Information gain and relative entropy/KL divergence are not the same thing, assuming the common version of information gain used in decision trees. Information gain is mutual information, which is a special case of KL divergence. Both this page as well as the KL divergence page appear to make this mistake -- is there a reason for this, or should I fix it? nparikh 21:57, 21 October 2006 (UTC)


 * Historically, the term Information Gain was introduced by Renyi, as a more intuitive synonym for KL divergence. Information gain can be used in connection with any conditioning step that causes you to move from a distribution Q to a better distribution P.  If the conditioning happens to be based on learning the value of a particular variable, then as you say the Information Gain is equal to the mutual information.  But the term Information Gain is not restricted to this case.  Jheald 10:14, 23 October 2006 (UTC)


 * Then this page (and perhaps the entire machine learning community) appear to be using the term incorrectly, and furthermore this page is internally inconsistent. The definition given in the section of this page labeled "Formal definition" defines the term to mean the specific case in which the conditioning happens to be based on learning the value of a particular variable. However, the definition given at the top of the page defines it to be synonymous with Kullback–Leibler divergence. The definition given in the section of this page labeled "Formal definition" is similar to mutual information, which makes information gain a function of random variables; the definition of it as a synonym of Kullback–Leibler divergence makes information gain a function of probability distributions. These cannot both be right. Therefore this page is internally inconsistent. Since the large machine learning community seems to be using the term differently from the way it was originally defined, I recommend keeping both definitions on the page, with citations to external sources, and a clear note to the effect that different communities use the term in different incompatible ways. Bayle Shanks (talk) 06:18, 21 August 2010 (UTC)


 * Certainty has a probability distribution too -- it's just a very sharp spike. The point I was making above is that the way IG is used on this page is compatible with the more general understanding of the term as a synonym for KL divergence.  Jheald (talk) 11:16, 21 August 2010 (UTC)


 * I'm interested to know more about the second paragraph that states that In particular, the information gain... is the Kullback-Leibler divergence under specific conditions. Are there any proofs or references that can be cited to obtain this? Cortisa (talk) 22:14, 16 November 2012 (CET)


 * See Blachman (1968), "The amount of information that y gives about X". He shows that KL divergence and entropy difference are not the same in general. The Wikipedia page is in error for suggesting that they are equivalent. I'm not sure how to proceed, since there seems to be some disagreement in the research community about what the term information gain really means. Canjo (talk) 02:41, 25 November 2019 (UTC)

Definition
$$\{x\in Ex \wedge value(x,a)=v\} $$ and $$\{x\in Ex|value(x,a)=v\} $$ describes the same set, isn't it? Then it should be written identically also, otherwise it might confuse people. 84.57.82.107 08:49, 5 April 2007 (UTC)

Bullet
What is the definition of bullet? Is it the standard multiplication symbol like asterisk? (EasyWalker) —Preceding unsigned comment added by 80.99.49.64 (talk) 21:11, 13 January 2009 (UTC)

Notation
Please change the notation for "examples" (Ex) to something different. Current notatoin is confusing because it looks as the expectation of x. —Preceding unsigned comment added by 213.155.151.233 (talk) 14:48, 22 January 2009 (UTC)

General and formal definitions disagree
First, the notation is a little bad. The second equation for IG: $$ IG(T,a) = H(T) - \sum_{v \in vals(a)} \frac{| \lbrace \mathbf{x} \in T | x_a = v \rbrace |}{| T |} \ldots $$ conflates the definition of $$\mathbf{x}$$, as $$\mathbf{x}$$ is already defined above as the attributes which makes up $$T$$. $$T$$ is a composed of attributes and an output value, $$y$$.

Also, one only sums over all $$ v \in support(a) $$ when branching along all possible values for that attribute. It may turn out that $$ v \in \lbrace 1, 2, 3 \rbrace $$, but a branch occurs between $$a = 1$$ and $$ a \neq 1 $$. Thus I recommend making it clear that $$ v \in branches(a) $$. This produces the equation:

The general and formal definitions appear to disagree: $$ H(T) - H(T| x_a=v) = - \left(\sum_{t \in T} p(t) \log(p(t))\right) + \left( \sum_{t \in T} p(t|x_a=v) \log(p(t|x_a=v)) \right) $$ approximating $$p(t|x_a=v) \approx \frac{| \lbrace t \in T | x_a = v \rbrace |}{| T |}$$ gives: $$ H(T) - H(T| x_a=v) = - \left(\sum_{t \in T} p(t) \log(p(t))\right) + \left( \sum_{t \in T} \frac{| \lbrace t \in T | x_a = v \rbrace |}{| T |} \log\left( \frac{| \lbrace t \in T | x_a = v \rbrace |}{| T |} \right) \right) $$ Furthermore, summing over all branches produces: $$ H(T) - \left( \sum_{v \in branches(a)} H(T|x_i = v) \right) = - \left(\sum_{t \in T} p(t) \log(p(t))\right) + \left( \sum_{v \in branches(a)} \left( \sum_{t \in T} \frac{| \lbrace t \in T | x_a = v \rbrace |}{| T |} \log\left( \frac{| \lbrace t \in T | x_a = v \rbrace |}{| T |} \right) \right) \right) $$ which still is not the equation given in the formal definition, but feels correct to me.

While I am confused about what the correct form actually is, I see how to recover what is currently in the formal definition if we normalize (I am not even sure if it is correct to normalize) by the number of instances (aka samples) in each branch $$ H(T) - \left( \sum_{v \in branches(a)} \frac{| \lbrace t \in T | x_a = v \rbrace |}{| T |} H(T|x_i = v) \right) = -\left(\sum_{t \in T} p(t) \log(p(t))\right) + \left( \sum_{v \in branches(a)} \frac{| \lbrace t \in T | x_a = v \rbrace |}{| T |} \left( \sum_{t \in T} \frac{| \lbrace t \in T | x_a = v \rbrace |}{| T |} \log\left( \frac{| \lbrace t \in T | x_a = v \rbrace |}{| T |} \right) \right) \right) $$ $$ = - \left(\sum_{t \in T} p(t) \log(p(t))\right) + \left( \sum_{v \in branches(a)} \left( \sum_{t \in T} \left(\frac{| \lbrace t \in T | x_a = v \rbrace |}{| T |}\right)^2 \log\left( \frac{| \lbrace t \in T | x_a = v \rbrace |}{| T |} \right) \right) \right) $$ However, I repeat that normalizing here feels artificial. The information gain ratio normalizes another way (actually by dividing by something similar to the second term).

So I wonder does: $$ IG(T,a) = H(T) - \left( \sum_{v \in branches(a)} \frac{| \lbrace t \in T | x_a = v \rbrace |}{| T |} H(T|x_i = v) \right)$$ or does $$ IG(T,a) = H(T) - \left( \sum_{v \in branches(a)} H(T|x_i = v) \right)$$?

The second one looks correct to me, so I suggest the formal equation should use $$ IG(T,a) = H(T) - \left( \sum_{v \in branches(a)} H(T|x_i = v) \right) $$ $$ IG(T,a) = - \left(\sum_{t \in T} p(t) \log(p(t))\right) + \left( \sum_{v \in branches(a)} \left( \sum_{t \in T} \frac{| \lbrace t \in T | x_a = v \rbrace |}{| T |} \log\left( \frac{| \lbrace t \in T | x_a = v \rbrace |}{| T |} \right) \right) \right) $$

Mouse7mouse9 23:47, 20 November 2014 (UTC) edit: forgot to sign edit2: minor technical correction in grammar. T is not a set of attributes and an output, but rather is composed of attribute, output tuples.


 * Mouse7mouse9 00:12, 21 November 2014 (UTC), here. I think I was an idiot. The more I look at it, the more correct the additional normalization term appears to be. It means the information gain per branch is weighted by the number of instances in each branch, which actually makes a lot of sense. I still think the definition should be much more explicit and use $$

IG(T,a) = H(T) - \left( \sum_{v \in branches(a)} \frac{| \lbrace t \in T | x_a = v \rbrace |}{| T |} H(T|x_i = v) \right)$$ Would it be too much to add in a cleaned up version of the above to derive the form in the formal definition?