Talk:Earley parser

FINISHED and PARTS-OF-SPEECH
The pseudocode references FINISHED(state) which is not defined. I am guessing that it means that the current position in the production is at the end.

PARTS-OF-SPEECH is undefined too.


 * I'd also like someone who has the book the pseudocode is based on to clarify what PARTS-OF-SPEECH is. BernardoSulzbach (talk) 20:27, 16 January 2021 (UTC)

[Untitled]
The article says that rules like a -> a b will cause both top-down and bottom-up parsers to go into an infinite loop, but I thought it was only top-down parsers that had a problem with left-recursion. Any comments? &mdash; Cadr
 * Since no-one's commented, I'll go ahead and change the page... &mdash; Cadr
 * It's not exactly true, top down parsers like LL(1) will enter in a infinite loop, but other top-down parsers, indeterministic LL, (and I think some kind of top-down parser that uses breadth-first search also) will eventually finish.

Works on all CFGs?
http://pages.cpsc.ucalgary.ca/~aycock/spark/doc.html seems to suggest that "[...] should work with any Backus-Naur form (BNF) grammar, including grammars with empty right-hand sides and ambiguous grammars. There are apparently a few rather obscure cases where Earley's parsing algorithm fails, but they're not likely to crop up in any reasonable application." — Preceding unsigned comment added by 18.111.116.225 (talk) 16:26, 24 April 2012 (UTC)


 * The Dragon book claims that Earley parsers can handle all CFGs. Two other (googled) references repeat that (http://tratt.net/laurie/tech_articles/articles/parsing_the_solved_problem_that_isnt, http://tom-ridge.blogspot.com/2011/06/simple-functional-sound-and-complete_22.html). I can't find anywhere else that someone claims Earley parsers don't parse all CFGs, so I would tend to remove the discuss tag unless you can cite more substantial evidence. JustinBlank (talk) 16:42, 25 October 2012 (UTC)


 * Hopcroft+Ullman 1979 (p.145), claim that Earley's algorithm is the "most practical, general, context-free recognition algorithm". Aycock's restriction may refer to his particular implemetation only. I remove the ((discuss)) tag in the article. - Jochen Burghardt (talk) 14:18, 1 April 2016 (UTC)

NP-complete?
Parsing Unrestricted languages is not NP-complete, it is much harder.

It can be easily shown (simulation of Turing machines by grammars gives an immediate reduction of the halting problem to parsing of general languages) that the problem is not computable.

Unless anybody objects, I'll correct it.

--UrsusArctos 22:37, 19 Oct 2004 (UTC)


 * Be careful with the meaning of the word "parse", though. You could argue that you don't have to be able to decide a language in order to parse it, and consequently even unrestricted languages would be "parseable" in a sense, i.e. the sense that if there was a parse, you would get it eventually. I'm not sure there's any general agreement on whether this would count as "parsing" or not, so I think we should word it carefully. Otherwise yeah, the NP-complete thing is incorrect. Cadr 12:15, 20 Oct 2004 (UTC)

Turning the recognizer into a parser
Everyone describes Earley's algorithm as a parser, but no-one ever bothers to explain how you turn the recognizer into a parser. My understanding is that you modify the completer to record a backpointer.. but exactly what item gets the backpointer and what it is to point to is still kind of a mystery to me. I believe that the new item you are creating should receive the backpointer and it should point to the item you are copying (to advance the dot in). However this is not sufficient, you also need some mechanism to inherit backpointers, otherwise you end up with a disconnected tree. Can anyone clarify this process?


 * That would be very useful! Unfortunately, I haven't found any description about that yet... --ThSoft 21:39, 14 September 2006 (UTC)
 * My understanding is that you generate a certains number of Earley items, which may or may not be part of *a* complete parse. --92.137.198.224 (talk) 11:42, 1 August 2010 (UTC)
 * It's all laid out in Earley's dissertation (1968). I just added references to it. In the article that was already listed, the part that you're really wanting to see says "for more information, see my dissertation" (basically). I just got it implemented myself in Java. Good luck! Limited Atonement (talk) 19:12, 12 September 2012 (UTC)


 * Be forewarned that the method described in Earley's paper for reconstructing parse trees from the Earley sets may have problems. As stated by Elizabeth Scott, 2008, "adding multiple pointers to a given instance of a non-terminal can create errors"; that paper, citing an example of Tomita , says the grammar "S ::= SS | b" on the input "bbb" will correctly identify derivations of bbb, but will also include erroneous derivations of bbbb and bb.  (Scott's paper says it provides an algorithm for correct construction of the parse forest; I am trying to work through this as well as Aycock's various works on this subject now.) Pnkfelix (talk) 11:29, 2 March 2013 (UTC)


 * I just finished getting through Elizabeth Scott's paper and implementing a toy JavaScript parser (with an annotated version), so I added that to the JS implementations section and wrote a section for the main page covering the things that I understand about generating the parse forest. I think I've followed the rules, but it's the first edit I've made on Wikipedia that wasn't just a typo fix, so let me know if I've screwed anything up. JoshGrams (talk) 11:03, 16 September 2015 (UTC)


 * Actually, you don't need to modify the recognizer at all to reconstruct your parse tree. States are all you need.  You need to start from the completed start rule, then search for the rules that may have caused its completion.  I call that the "un-complete" step. There are analogous "un-predict" and "un-scan" steps. Which is applicable depends on the state rule you are examining.  Also don't forget to backtrack.  This should look like a straightforward tree construction recursive algorithm. If your parse is not ambiguous, you will always find one and only one possibility at each step.  I have done one such reconstruction manually. I have yet to automate it. 88.180.140.48 (talk) 16:34, 23 October 2013 (UTC)

Error in the algorithm description?
I believe that:


 * Prediction: For every state in S(k) of the form (X → α • Y β, j) (where j is the origin position as above), add (Y → • γ, j+1) to S(k) for every production with Y on the left-hand side.

Should be changed to:


 * Prediction: For every state in S(k) of the form (X → α • Y β, j) (where j is the origin position as above), add (Y → • γ, k) to S(k) for every production with Y on the left-hand side.

This is a change of "j+1" to "k". See the example. —Preceding unsigned comment added by 87.116.85.20 (talk) 17:14, 4 December 2007 (UTC)


 * You were correct and this has been fixed in the meantime. Commenting to help future talk page readers. BernardoSulzbach (talk) 20:24, 16 January 2021 (UTC)

Unused symbol
Z is defined to represent a single nonterminal. However, it never gets used anywhere for the rest of the page. I believe that its definition should be removed. If nobody objects, I'll correct it. Opaldraggy (talk) 04:09, 14 December 2008 (UTC)

Added Parse::Marpa
I've added Parse::Marpa under External Links. It's a new implementation of a parser based on Earley's algorithm, with the improvements suggested by Aycock and Horspool. I'm the author of this Perl module, but I believe this edit is in compliance with WP:COS. --Jeffreykegler (talk) 01:58, 11 January 2009 (UTC)

Propose tidy of eternal links
I'd like to remove the subsections from external links section, so it would look the example below, any objections? This would be to make it concise and to avoid repeating information.

Example doesn't agree with stated algorithm
In the algorithm pseudocode, the line for each state in chart[i] is perhaps misleading. In the example, S(0), say, Rule 4 is generated from rule 3. This either means that chart[i] is looped over multiple times (?) or somehow self-generates more items?. This should be made clear in the pseudocode. Limited Atonement (talk) 13:56, 10 September 2012 (UTC)

Ambiguity Concerning "Input Position"
The words "Input Position" show up on the page twice (as of now), and it means a different thing each time. The first one is explained, "input position (which represents a position between tokens)", but the second one is not explained: "input position k is called S(k)", but must refer to the input position between grammatical terminals or nonterminals. I may be incorrect, but I've been working with this for a while now, and believe it to be true. Limited Atonement (talk) 21:13, 10 September 2012 (UTC)


 * The second and first uses of the term are consistent and the second is correct, though it took me some thought to see that and at this point I've more than a little experience with these matters. However, the original definition of "input position" is, strictly speaking, wrong -- in an input of length n, positions 0 and n are not "between" tokens.  I hope my new language is better.


 * Position in an Earley parse turns out to be a very deep concept, and it plays a major role in the Earley-derived Marpa parser.--Jeffreykegler (talk) 00:41, 11 September 2012 (UTC)

Incomplete Example?
The algorithm states "Prediction: For every state in S(k) of the form (X -> a . Y b, j) ..., add [all productions with Y on the left]." However, in the example, the state in S(k) (M -> . M * T, 0) does not expand as expected (it is nowhere expanded!). Limited Atonement (talk) 15:11, 11 September 2012 (UTC)


 * I've just made some changes to the description of the algorithm in the way of cleaning up terminology and clarification. The pseudo-code does not mention two important points.  First, that duplicate states are never added to a state set.  Second, that the loop over the prediction/scanning/completion operations must be performed as long as new states are being added to the state set.


 * Regarding the loop. I think it might worth noting this. For instance, I thought that you could not do it if you had a left recursive. However, I realized that this would not happen if duplicate rules are ignored. This part is not obvious and is sometimes omitted in books.


 * The second point is often missing in both descriptions and implementations, in part because it is not in Earley's original paper. This is a well-known bug with Earley's original description of his algorithm.  Newer variants of Earley's, such as Marpa and the Aycock-Horspool algorithm, have found ways to make one pass through the queue sufficient, but these are complex and adding the loop is the standard way to restate the classic Earley's algorithm in bug-free form.--Jeffreykegler (talk) 18:26, 13 September 2012 (UTC)

Jurfafsky&Martin description does not fit the generic case
The scanner phase of the algorithm is tailored for the output of a POS tagger. Furthermore, it apparently describes the case of several (possible) tags per word. This is not the original Earley algorithm, I believe. It is an extension. I suggest fixing the scanner phase. — Preceding unsigned comment added by Srchvrs (talk • contribs) 20:03, 11 December 2012 (UTC)

Possible Errors
Hi, I have been reading this article while studying and tried to follow the example. Some things seem odd to me, but I'm not sure if it's me misunderstanding or an error:

In S(2) (1) it originates from a scan in S(1) and shouldn't it thus be: (1) S → S + • M     (1)

and S(3) (4) it originates from a complete in S(2) and shouldn't it thus be: (4) S → S + M •     (2)

--Mac C. Million (talk) 11:48, 8 February 2017 (UTC)

Algorithm incomplete?
I have implemented the Early parser myself, and I noticed a problem with indirect rules (like in the following grammar).

In that grammar, since the symbol `` can be completed, the algorithm continues to ``, but `` introduces ``, just that since `` was already in the state set, it is not added again and so `` can never be completed. This can be fixed by going through the whole state again and finishing all states that are already finished again (better suggestions welcome).

Cpiber (talk) 09:07, 4 August 2021 (UTC)

this https://aclanthology.org/pdf/P/P84/P84-1073.pdf doesn't exists could use https://aclanthology.org/P84-1073.pdf — Preceding unsigned comment added by 168.196.26.41 (talk) 02:50, 20 June 2023 (UTC)

Wouldn't it be better to name the origin position always j?
tuple (X → α • β, i) the position i in the input at which the matching of this production began: the origin position

Later on: (X → α • Y β, j) (where j is the origin position as above)

Why not name the variable for the origin position always 'j' for consistency? 2001:628:404:208:6043:1418:A51E:6232 (talk) 01:40, 5 July 2023 (UTC)

Possible typo
"The scanner creates an SPPF node representing the non-terminal it is scanning." The scanner is looking for a, where a is a terminal. Shouldn't 'non-terminal' be 'terminal'? 2001:628:404:208:FCF4:5ED3:4A4D:2DC2 (talk) 18:25, 9 July 2023 (UTC)