Talk:Forward–backward algorithm

---

Untitled
The O_1 matrix listed in the Forward Algorithm section is inconsistent with the one presented in the Example section, even though both the transition matrix (T) and the emission matrix (B) are the same between the two sections. I believe the O_1 matrix listed in the example section is correct. This would be much more clear if there were more hidden states than emissions or vice versa, since the matrix dimensions would be indicative of whether states are being conditioned on emissions or if emissions are being conditioned on states. No edits made. --- — Preceding unsigned comment added by 129.67.44.219 (talk) 16:44, 26 March 2013 (UTC)

The description is very hard to follow. It is incomplete, being based upon and making reference to a book that is not available online. The one-dimensional vectors are written wrongly in the matrix formulas (horizontal instead of vertical). IMO, this article needs a fundamental re-writing, it is extremely frustrating to read as it is.

—Preceding unsigned comment added by 82.181.190.197 (talk) 17:38, 22 October 2008 (UTC) --- —Preceding unsigned comment added by Mihai preda (talk • contribs) 16:22, 15 July 2008 (UTC)

--- I am not comfortable making the edits, but I believe there are major errors here.

With respect to the backward messages b. It is stated:

Finally note that the description in Russel & Norvig 2003 pp. 550 excludes the point product, thought the procedure is required earlier.

I don't believe this is correct. It should be as in the book:

b_{k+1:t} = TO_{k+1}b_{k+2:t}

no point product with f, no normalization. This is only the backward message, implements the updates shown on pp. 545 eq (15.7).

You multiply the b with the corresponding f in each time slice to obtained the smoothed vectors sv, as shown on pp. 544 eq (15.6). So as written, there are two multiplications by f taking place! Eq 15.7 shows how the RHS of 15.6 breaks down, NOT the LHS. (BB)

---

I did the modification (but I did not recompute the final result in the example). Indeed, this was not correct. Mayerwin (from Wikipedia FR).

---

Thanks for the sanity check Mayerwin.

If someone cares to do the formatting, this is what my implementation yields for this scenario:

i: 1 fv: {s:Rain=0.8181818181818181, s:No Rain=0.18181818181818182}

i: 2 fv: {s:Rain=0.883357041251778, s:No Rain=0.1166429587482219}

i: 3 fv: {s:Rain=0.19066793972352525, s:No Rain=0.8093320602764746}

i: 4 fv: {s:Rain=0.7307940045849821, s:No Rain=0.2692059954150179}

i: 5 fv: {s:Rain=0.8673388895754848, s:No Rain=0.13266111042451526}

i: 5 B: {s:Rain=1.0, s:No Rain=1.0}

i: 4 B: {s:Rain=0.69, s:No Rain=0.41000000000000003}

i: 3 B: {s:Rain=0.4593, s:No Rain=0.2437}

i: 2 B: {s:Rain=0.090639, s:No Rain=0.15025100000000002}

i: 1 B: {s:Rain=0.06611763, s:No Rain=0.04550767}

i: 5 sv: {s:Rain=0.8673388895754848, s:No Rain=0.13266111042451526}

i: 4 sv: {s:Rain=0.8204190536236754, s:No Rain=0.17958094637632463}

i: 3 sv: {s:Rain=0.3074835760066177, s:No Rain=0.6925164239933823}

i: 2 sv: {s:Rain=0.8204190536236753, s:No Rain=0.17958094637632466}

i: 1 sv: {s:Rain=0.8673388895754848, s:No Rain=0.13266111042451528}

The interpretation of these smoothed vectors depends on the case usage, but for this example one conclusion we could draw is the best explanation of {umbrella, umbrella, no umbrella, umbrella, umbrella} seems to be {Rain, Rain, No Rain, Rain, Rain} (the Viterbi alg would explicitly output this). (BB)

---

The word 'umbrella' appears suddenly in the middle of the example. What is this t=1 'umbrella'? --- —Preceding unsigned comment added by 134.174.140.104 (talk) 14:58, 3 July 2008 (UTC) I am afraid this isn't complete? Forward-backward approach is to avoid the time complexity of the brute force one, but the method really isn't elaborated here. --huggie

Also, the time complexity formula should omit any constant factors (ie, the 2). The forward-backward algorithm itself just exploits the Markov property to reduce the time complexity to something like $$ O(N \cdot T) $$ where $$N$$ is the number of symbols in the alphabet and $$T$$ is the length of the sequence. Some rough pseudo-code is below: The initial call can either be done by creating a loop and calling with all initial probabilities, or creating a dummy start state with transition probabilities equal to the initial probabilities and calling using that. Mskinn99 (talk) 21:04, 13 January 2008 (UTC)

Viterbi algorithm has better pseudo-code, and a better description. The two algorithms are so similar (they can both be implemented in a single function) that it might be worth merging the articles. Mskinn99 (talk) 23:24, 13 January 2008 (UTC)

It seems that the pseudocode includes only the forward part of the algorithm. Can someone familiar with the algoirthm put a suitable pseudocode? —Preceding unsigned comment added by 92.45.185.176 (talk) 20:55, 28 May 2010 (UTC)

Merge with Forward Algorithm
I am against the proposed merge of forward algorithm into this article, because the forward algorithm is an important algorithm in its own right; it is probably more frequently employed than the forward-backward algorithm. Since the forward algorithm is used as a subprocedure of the forward-backward algorithm (it is precisely step 1 in the description), I would propose the following: We make the description of the algorithm a template, which we then simply include in both articles. 131.159.0.7 (talk) 17:18, 21 July 2010 (UTC)

I am also against the proposed merge, for the same reasons. In addition, it's nice to see a short-n-sweet description of the Forward algorithm by itself. If the Forward-Backward article were appended, then you glance at how long the scroll bar is, and flip through the long article and feel daunted by the complexity. A user. —Preceding unsigned comment added by 15.203.233.76 (talk) 17:51, 19 November 2010 (UTC)


 * Removed the merge proposal since it's been around for a year, nobody defended it and I'm the third reader to disagree here. The forward algorithm is useful in its own right and deserves its own article. Qwertyus (talk) 09:46, 2 August 2011 (UTC)


 * Someone took a bold step and replaced "Forward algorithm" with a redirect (edit was 30. september 2011 kl. 19:34‎ User:Max Libbrecht). I assume good faith, but having seen the discussion here, and no discussion anywhere else that supports the effective deletion, I'm going to revert the Forward algorithm page to being a separate item. I might also put a "not to be confused with..." --mcld (talk) 10:58, 2 May 2012 (UTC)

Page Cleanup
I extended the page with a brief overview a semi formal description of the algorithm, referenced some performance improvements and included an example. I personally feel that the matrix based description is best to follow. Additionally I felt that it is most helpful to understand the workings of this algorithm through a numerical example. Therefore I included a (rather extensive) example based on an example presented in the widely used AI text book of Russel and Norvig. I also referenced a repository of source code where java code implementing the algorithm may be found. I am generally new to editing at Wikipedia but tried to follow the guidelines I found to be relevant. If the content is not found suitable, contains errors or is otherwise unsuitable, I certainly welcome feedback, critique and any edits, including deletions, deemed necessary :) BJJV (talk) 11:08, 26 May 2008 (UTC)

I just finished cleaning up the page a bit to try and make it easier to follow. Hopefully I didn't make things worse. I am new to the subject but described the problem as it is addressed in the first reference (Rabiner, 1989). I also modified the notation so that the forward probabilities use subscripts of the form: $$\mathbf{f_{0:t}}$$ (zero-based instead of one-based). This provides a consistent form that describes the problem and allows us to calculate the state vector at time t=0. I also used the "hat" to distinguish between the actual probabilities of an event sequence and the normalized values that describe the state vector. I hope the notation is OK, and I worry about my terminology (is "state vector" appropriate? despite my reading I can't find what this is actually called). Also, my opinion is that the matrix example is very helpful. I found it useful for checking my results when I developed an HMM program and think it should be left in. Bodhi.root (talk) —Preceding undated comment added 19:36, 15 September 2009 (UTC).


 * Sorry, had to modify things once more. The explanation I gave before was not technically correct.  The new version explains the underlying probabilities and how to interpret the forward, backward, and smoothed probabilities.  It is a little harder to follow, but it is now technically correct.  I also modified the example section to use the new notation.  The matrix examples should probably be reworked so that they use the formulas given earlier (i.e. scaling by the $$c_t$$'s), but I'm done for now and don't want to recalculate all that. Bodhi.root (talk)

Explaining this algorithm with Matrices
It was a terrible idea to explain this algorithm in terms of matrices. Yes, it makes it so that you can elegantly describe the algorithm with a single formula, but it has many disadvantages, including: If we *really must* use matrix math to explain this algorithm, it needs to be cleaned up a lot. Frankly, I think the traditional explanation that uses probabilities and intuition is much better suited for an encyclopedic article. I propose that we rip out the matrix math completely. Also, I agree very much with the comment made earlier that this algorithm is so similar to the Viterbi algorithm that they should be explained together, and in a consistent manner. —Preceding unsigned comment added by 128.187.80.2 (talk) 20:33, 3 February 2009 (UTC)
 * 1) It makes the algorithm inaccessible to people that aren't comfortable with matrix math.
 * 2) It performs superfluous computations that would not be done in an efficient implementation.
 * 3) It hides context, and makes it easy to forget the meaning of columns and rows.
 * 4) The matrix math shown here is broken anyway. It shows row vectors where column vectors should be, and implies impossible calculations with matrices that aren't compatible for multiplication. When the smoothed values are computed, the symbol "x" is used, which implies cross-product, but straight scalar multiplication is used.

Completely agreed (especially since the example is _still_ incorrect); unfortunately, this is how it is done in parts of the Russel & Norvig book that the description follows. But yea, A rewrite without matrices is needed.Fizban99 (talk) 13:42, 8 October 2009 (UTC)

Semi-agreed semi-disagreed. I am taking a class on HMM and this matrix treatment of the algorithms was sorely needed. However the switch from right multiplication in the description of the forward algorithm to left in the example is a serious mistake, especially since the transposes that are necessary are invisible in the example since the matricies are symmetrical. What this page needs is an introductory section without the matrix notation and a cleanup of the matrix examples/explanation, but please don't jettison a useful resource. --Wjeck (talk) 13:54, 25 November 2009 (UTC)

I agree. The problem for me is one of clarity, not accessibility or computational efficiency. Matrix notation is meant to make some concepts clearer, and this is not really one of them. It should be fine to keep the matrix notation in another section, but the first description should be in more straightforward terms. But I tried looking for such an explanation on the web, and couldn't find it, so maybe the first step would be to link to one from this article. I didn't find Russell and Norvig's presentation very clear. A5 (talk) 23:07, 4 August 2010 (UTC)


 * Strongly agree. I work with matrix maths a lot, but it's obvious to me that the explanation given in this page is quite hard to understand. It seems to be oriented towards computational efficiency rather than clear exposition. The bit with the emission probabilities placed into a diagonal matrix in particular, really rams the point home that the layout is not so we can understand it! --mcld (talk) 11:05, 2 May 2012 (UTC)

I agree. There's an additional problem; the evidence states 00100 are a palindrome, which is extremely confusing for an algorithm that is supposed to have separate forward & backward steps. Furthermore, the transition matrix is symmetric, which also makes the backwards step more unclear. Should you transpose it? Chrism2671 (talk) 12:45, 6 April 2022 (UTC)

Merge with BCJR algorithm
It has been proposed since 2009 to merge BCJR algorithm into this article. Any thoughts? --KarlB (talk) 22:18, 7 July 2012 (UTC)
 * The merge tag was only added and no discussion was initiated. I am therefore closing the merge request as improper and removing the tags. WTF? (talk) 18:18, 11 October 2012 (UTC)

ECC & BCJR
Unfortunate that there's no discussion, since the explanation of BCJR & viterbi, when applied in the digital domain, is considerably easier to understand than the HMM version given here. Unfortunately, the BCJR article is currently a stub with no content in it :-( Anyway, just to make this article easier to understand, I think its important to expand on these other, easier-to-explain, easier-to-comprehend algorithms. User:Linas (talk) 00:46, 9 November 2013 (UTC)

Consistent notation across HMM articles
The notation used on this page is inconsistent with the notation used on the HMM, forward algorithm, and Viterbi algorithm pages. Namely, in the latter cases, $$x$$ represent the hidden states and $$y$$ the observations, whereas here $$o$$ represent the hidden states with $$x$$ being the observations. I know that generally there is no standard notational convention across the many HMM papers, but for this set of articles, it would be much less confusing to stick with a consistent notation. Thoughts? — Preceding unsigned comment added by OhGodItsSoAmazing (talk • contribs) 19:32, 7 December 2013 (UTC)

Request for efficient higher order extension
I've been looking for a reference that describes the efficient extension when using an order greater than one. I know the basic trick is to take (Q x Q) --> Q', then use this space which maps from [q_k at t-1 and q_l at t] to [q_j at t and q_i at t+1]. Clearly q_l=q_j. This causes many zeros in the transition probability matrix.

Without modifying the algorithm, time is wasted on these zeros. A second order HMM implemented naively has a run time of O(N^4 T). But if we squeeze the zeros out in the algorithm, then this reduces to O(N^3 T). A third order HMM implemented naively runs in O(N^6 T), but squeezing the zeros out to avoid the algorithm running over them would run in O(N^4 T). I've done some work myself, but without references it would count as original research.

I'll provide one possible extension in the forward pass below (using the forward algorithm listed here in non-matrix form http://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf):

f0:1(i) = pi(i)*b_i(O_1)

f0:2(j,i) = pi(j,i)*b_{j,i}(O_2), using pi(j,i) as the joint prior probability of state q_j followed by state q_i. Alternatively

f0:2(j,i) = f0:1(j) * T{j}{i} * b_{j,i}(O_2), but here T{j,i} is an element of the first-order transition matrix

f0:t(j,i) = sum_{l,k}^{N} f0:{t-1}(k,l) * T_{k,l}{j,i} * B{j,i}(O_t}, but q_l=q_j. Squeezing the zeros out give:

f0:t(j,i) = sum_{k}^{N} f0:{t-1}(k,j) * T_{k,j}{j,i} * B{j,i}(O_t}, for i and j in 1, 2, ..., N

Mouse7mouse9 20:20, 28 January 2014 (UTC) — Preceding unsigned comment added by Mouse7mouse9 (talk • contribs)

Forward Probabilities
There is already a page on Forward Algorithm. It would be better to reference that instead of repeating. The section on backward probabilities starts:

A similar procedure can be constructed to find backward probabilities.

referring the section on forward probabilities, forcing the reader who's already read the article on the Forward Algorithm to go review above section to ensure they understand the reference ...