Talk:TD-Gammon

Incremental updates of the evaluation function
Step 2 of TD-Gammon is wrong. It didn't have to wait until the end of each game. That's one of the major advantages over Montecarlo-estimation. See http://www.cse.iitm.ac.in/~cs670/book/node108.html for "Tesauro applied the nonlinear TD rule (11.1) fully incrementally, that is, after each individual move." —84.168.22.218, 20:51, June 12, 2010


 * Are you sure that contradicts the article? If the article is wrong, by all means fix it, but I understand the incremental update after each move to occur in relation to the known final board position of the game. Applying the TD rule "after each move" means looking at each board position of a completed game, and tweaking the evaluation function so it judges that board position to be closer in value to the following one. You can't do that until you have a board position with known value, and that's the final board position. To put that another way, what training value is there to back-propagate from if you don't know who won the game? —Ben Kovitz (talk) 07:19, 13 June 2010 (UTC)


 * As described in figure 8.1 from http://www.cse.iitm.ac.in/~cs670/book/node87.html TD(lambda) does not take a miracle look into the future and looks at what the final result is. http://en.wikipedia.org/wiki/Temporal-difference_learning already states for general TD: predict the future (preferable very exactly) and change your "past" model accordingly.


 * Tesauro himself explains: At the end of each game, a final reward signal z (containing four components as described previously) is given, based on the outcome of the game. Once again equation 1 is used to change the weights, except that the difference (z − Yf) is used instead of (Yt+1 − Yt). —Preceding unsigned comment added by 84.168.23.163 (talk) 17:22, 14 June 2010 (UTC)


 * Let's make sure we understand what's being said. First, the current article does not speak of a miracle look into the future. It says that the algorithm is given a whole game transcript and uses the final board position to prime back-propagation of the weights in the neural net (working backward, starting at the final board position).


 * Regarding correcting that, when I read the sources, the question in my mind is, where does the "reward signal" come from? In terms of figure 8.1, where does r come from? I've found the writing on the learning algorithm to be pretty unclear. The current Wikipedia article may well be wrong, but at least it's clear! Now, are you saying that the learning algorithm actually works forward, and on each turn, updates the weights in the neural net so its evaluations of previous board positions are closer to its evaluation of the current position? The only exception would be the final board position, which gets a special evaluation that is independent of the weights in the neural net (that is, "who actually won").


 * The latter interpretation agrees with my understanding of temporal-difference learning in general (though I am no expert). After re-reading those four paragraphs in the Tesauro article several times, I'm now favoring thetter interpretation for TD-Gammon, too (even though it's weird because the algorithm spends most of its time adjusting the weights without any input that's known to be correct). If you also think that's right, I'd say go ahead and edit the page! I can do it if you'd rather not. —Ben Kovitz (talk) 00:21, 17 June 2010 (UTC)


 * I finally rewrote the description of the learning algorithm. Hopefully it's now both clear and accurate. Actually, I know it's not entirely clear, because $$\nabla_w Y_k$$ isn't really explained. I basically took Tesauro's wording, which is not clear: the paper does not explain what the board evaluation at turn $$k$$ has to do with the gradient of network weights. —Ben Kovitz (talk) 17:39, 3 November 2013 (UTC)


 * The gradient of a multiple-variable function is a vector whose components are the partial derivatives of that function. I think both the wikipedia page and the article at are pretty clear: $$\nabla_w Y_k$$ is the gradient of the function calculated by the network (the output of the network) with respect to the weights in the network. That function is noted Yk, and represents the possible outcome of game k (assumedly by giving that outcome a probability, though this isn't stated explicitly in the article). Note that the article states the w in the formula is the vector whose components are all the weights in the network, whereas the wiki page wording seems to imply that w is a single weight in the network (but then the formula wouldn't make sense, because the left-hand side would be a number and the right-hand side would be a vector).


 * Lastly, the article is stating that there are four outputs from the network, not just one. So the notation $$Y_k$$ is slightly misleading, it probably should be something like $$Y_k^i$$ with i in range 1..4, and the formula should sum over i, according to the sentence "In cases where there are multiple output units, the total weight change is given by the sum of the weight changes due to each individual output unit." (the article is apparently explaining about both TD-Learning in general and TD-Gammon specifically, hence the confusion). The complete formula would then be :
 * $$w_{t+1} - w_t = \sum_{i=1}^{4} (\alpha(Y_{t+1}^{i} - Y_t^{i})\sum_{k=1}^{t}\lambda^{t-k} \nabla_w Y_k^{i})$$ —92.140.115.2 (talk) 17:00, 28 April 2016 (UTC)

Inaccuracies
The scholarpedia article by Tesauro suggests there are some inaccuracies in this article; specifically, the program was developed beyond its state in 1992, which appears to be what is described here, and incorporated features not described (a 3-ply search depth in some later versions). Also Tesauro appears to have concluded that the hand-crafted features to the input were not necessary; the 40-hidden-node network he had used before adding them appears to have been too small, and he did not train the system for long enough to see its potential before moving on to including them. 212.159.69.4 (talk) 17:20, 16 October 2011 (UTC)


 * Fantastic! I'd say that Scholarpedia meets our criteria for a reliable source, so please don't hesitate to add the new information to the page. Also, I see that I never got around to making the correction discussed above, if you'd like to do that, too. —Ben Kovitz (talk) 18:59, 16 October 2011 (UTC)


 * Uh oh, I just checked Scholarpedia and Tesauro's article appears to be gone. I couldn't find any article by Tesauro there. Can you suggest another place to look for the information you mentioned above, especially the point about hand-crafted features not being necessary? —Ben Kovitz (talk) 18:36, 3 November 2013 (UTC)


 * I just found this: http://www.scholarpedia.org/article/User:Gerald_Tesauro/Proposed/Td-gammon. The page history says that the article "expired and reverted to proposed status". It looks like excellent info, but I don't think we can cite it until Scholarpedia makes it official again. I just asked about getting it re-approved. —Ben Kovitz (talk) 00:24, 4 November 2013 (UTC)

Advances in backgammon theory section overstated?
I was playing tournament backgammon in the 90's, and I find the claims of the effect of TD-gammon on how experts played the opening rolls to be greatly overstated. The splitting play was the recommended play at least as far back as Magriel's book from 1976. Though fashions changed over time, my memory from the pre-TD-gammon 90s is that splitting play was generally already favored over slotting at most match scores. I'll search my backgammon library and see if I can find some citations for this. I also think the claim that everyone started playing the splitting move after TD-gammon is overstated. See for example http://www.bkgm.com/rgb/rgb.cgi?view+179, where world-class player Kit Woolsey says in 1995, after looking at rollouts from TD-gammon and other programs, that the slotting and splitting plays with 2-1 are equal.

Normally when I see claims I don't believe in WIkipedia, I add a "citation needed", but I'm not quite sure what to do when the incorrect claims in the article are supported by a citation from a reputable source that is nonetheless inaccurate. — Preceding unsigned comment added by Andylatto (talk • contribs) 15:55, 18 November 2011 (UTC)


 * Thanks for spotting this. Normally, each Wikipedia article should only summarize sources that are specifically about the article's topic. You've certainly persuaded me that the source specifically about TD-Gammon overstates TD-Gammon's influence on tournament play. The topic-specific source isn't reliable about this particular fact. In lieu of an authoritative book or article about computer backgammon's influence on tournament play and accepted theory, here are a couple options:
 * (1) Omit saying anything at all about TD-Gammon's influence on tournament play and backgammon theory.
 * (2) Cite the Usenet posting by Kit Woolsey.
 * My feeling right now is that while citing the Usenet posting violates WP:SPS, that source is a very nice find and is reliable enough. Option (2) gives a small risk of a Type I error and option (1) gives certainty of a Type II error. Your thoughts? If you agree, please feel free to just dive in and add the source to the article. —Ben Kovitz (talk) 15:23, 1 November 2013 (UTC)
 * I don't think citing the Woolsey Usenet Post violates WP:SPS; It satisfies all the criteria in the "Self-published or questionable sources as sources on themselves". I'm not citing it as evidence of what expert players in general thought in 1995; I'm citing it as evidence of what Kit Woolsey thought was the best play in 1995, and  this is not a self-serving or exceptional claim. That Kit Woolsey is a backgammon expert is established independently in the Wikipedia article about him. 96.237.64.191 (talk) 15:12, 6 November 2013 (UTC)