Talk:Chomsky normal form

Terminology / Meta-Variable Confusion
This wikipedia article is very bad. It gives the impression that grammar rules are built by upper case variables A, B, C. But these upper case variables are meta variables that range over the set of non-terminals N. Since a normal form talks about a class of grammars.

So in practice you would have grammar rules like:

expression :== term expression-rest

When for example N={expression, term, expression-rest, etc..}. The bad quality of this article concerning this meta level then leads to the confusion that in a CNF rule A <- B, C it is not possible that B and C are the same non-terminal.

So the phrase "A, B and C are nonterminal symbols" should be better changed into the phrase "A, B and C represent non-terminal symbols" or "A, B and C stand for non-terminal symbols".

The following pages have the same problem:

http://en.wikipedia.org/wiki/Kuroda_normal_form

http://en.wikipedia.org/wiki/Greibach_normal_form

Janburse (talk) 12:17, 20 June 2011 (UTC)
 * I think it's enough that we do what most WP:RS do, i.e. say that productions have this form. I don't see much confusion myself and given the number of sources who have written about this, one is bound to have noticed if it were indeed that confusing, but apparently it's not, except to you. If we use different non-terminals in every production template, then that also might be interpreted as them being necessarily different across the set of patterns (that make up the normal form) as a whole. JMP EAX (talk) 08:26, 17 August 2014 (UTC)

I have difficulties understanding the following, maybe because of the confusion discussed here: "If a rule has $$A$$ as a singleton on its RHS, add a new rule $$R = A \rightarrow \varepsilon$$ unless $$R$$ has already been removed through this process."

As I read it, it means that if there are rules

$$foo \rightarrow bar$$

$$bar \rightarrow \varepsilon$$

(that: is A is bar, as A is the epsilon-producing nonterminal under investigation), a new rule $$bar \rightarrow \varepsilon$$ should be added. This seems to make no sense at all to me. Should it be a rule $$foo \rightarrow \varepsilon$$? Lasse Hillerøe Petersen (talk) 23:38, 15 November 2014 (UTC)

Following Sipser's Lead
I see that the original article follows Sipser pretty closely, including allowing epsilon to be in the language. As far as I know, this is in fact a little EXTENSION to the original idea of CNF. Here is how Hopcroft, Motwani, and Ullman define CNF:

"all productions [in the grammar] are in one of two simple forms, either:

1. A --> B C, where A, B, and C are each variables, or 2. A --> a, where A is a variable and a is a terminal.

Further, [the grammar] has no useless symbols."

Note the differences from the Sipser presentation: no epsilon rule at all, no restrictions on B and C except that they are variables.

By this (H/M/U's) definition, you can create a CFG in CNF for any CFL that does not include epsilon. If the CFL includes epsilon, you are out of luck. I speculate this is the original definition. Sipser also omits the mention of useless symbols.

Question: Is my speculation correct?

Question: If so, does the 'original' definition of CNF deserve a mention?

Question: Who developed this extension that Dr. Sipser presents? Was it he himself?

Also I've edited the article to add that (in the extended version) the start variable is not allowed to be on the right side of any production (a necessary wrinkle since we allow the epsilon rule for this variable).

LandruBek 07:14, 22 March 2006 (UTC)

How is it different from the Backus Naur/Normal Form?
I am guessing it's the same. Can someone confirm/refute this? I think the relation to BNF should be mentioned in the main article. SDX2000 (talk) 15:43, 26 May 2009 (UTC)
 * There are a lot of differences: In BNF there is nowhere a restriction on the length of the right hand side of productions. Furthermore, right-hand sides can freely mix terminals and nonterminal symbols. In my viewpoint BNF is merely a syntactical convention for writing down context-free grammars (CFGs) in ASCII; CFGs in Chomsky normal form are special cases of CFGs Hermel (talk) 21:15, 17 August 2009 (UTC)

Exponential blow-up?
I think the sentence "the size blow-up in the worst case may range from $$|G|^2$$ to $$2^{2 |G|}$$" is not correct. The exponential blow-up in the worst-case cannot be avoided. Take for instance the following CFG:

1. S --> A_1A_2...A_n 2. A_i --> epsilon | a_i —Preceding unsigned comment added by 164.15.123.103 (talk) 12:53, 14 June 2010 (UTC)


 * As stated in Lange&Leiss09(see page 5), exponential blowup doesn't occur if you binarize the rules before removing epsilon-productions.
 * For your particular case you jump from 2n+1 rules to 3n rules after splitting the rules and removing epsilon-productions, and a quadratic augmentation occurs when removing unit rules (from trying it by hand, I believe that you end up with 1/2 n² + 5/2 n - 1 rules in the general case).
 * PS : I'm new to wikipedia editing so correct me if I answered the wrong way. 147.210.12.19 (talk) 12:34, 26 April 2011 (UTC)


 * What exactly is |G|, anyway? The number of production rules? The sum of the lengths of the right-hand sides of the production rules? Or what? -- UKoch (talk) 14:45, 15 December 2011 (UTC)
 * Incidentally, this is really easy to find out: I just followed the link to the pdf of the cited journal article. According to the authors, the size is measured (more or less) as the number of symbols in a standard mathematical description of the grammar: $$|G| := \sum_{A \in V}\sum_{A \to \alpha \in P} |A\alpha|$$ . This notion of size is of course linearly related to the size of the grammar in Backus-Naur-Form, as well as to the sum of the lengths of the right-hand-sides. By the way, the number of productions contradicts the intuition of a size measure for (non-CNF) grammars, as there are infinitely many grammars with n productions, for fixed n. Hermel (talk) 23:35, 15 December 2011 (UTC)
 * Thanks for pointing that out. I think that information should go into the article, and into a few others as well. I'll start right away. -- UKoch (talk) 16:17, 19 December 2011 (UTC)

Simple English version please
This is really dense stuff, and a lot of interested Chomsky fans would appreciate a layman's terms version, thanks. —Preceding unsigned comment added by 76.24.47.64 (talk) 04:44, 2 January 2011 (UTC)


 * I don't think that's really doable--to understand Chomsky normal form, you need to understand context-free grammars. If you follow the links, you end up at Formal grammar, which does an adequate job of explaining (derivations in) formal grammars in layman's terms, I think. -- UKoch (talk) 14:43, 15 December 2011 (UTC)


 * Perhaps the postal address example used in the Backus-Naur Form article can be used here? That would at least make it easier to read for someone with programing experience but no formal CS education. We can worry about plain English later.Sjors (talk) 02:42, 26 December 2011 (UTC)

Notation
I suggest changing \alpha (α) to a in the def. I did that at Kuroda, there per RS, and I also doubt the sources cited here have that different of a convention. JMP EAX (talk) 08:31, 17 August 2014 (UTC)

I just checked out Hopcroft & Ullman and he uses a not alpha on p. 92, so I made the change. JMP EAX (talk) 08:35, 17 August 2014 (UTC)

Suggested rewrite
Triggered by the above issue of, I examined the section Chomsky normal form, and found that it is currently in a rather poor state. I have rewritten it, now giving source references from Hopcroft.Ullman.1979 (mainly for the transformation algorithms) and from Lange.Leiß.2009 (mainly for their order). Moreover, I moved the blow-up discussion from the lead into that conversion section, and moved the section Chomsky normal form almost to the end (I couldn't verify the sources given in its subsection Chomsky normal form; I doubt that Hopcroft.Ullman.1979 or Hopcroft.Motwani.Ullman.2006 mentions the notion "Chomsky reduced form", it certainly does not appear in the index of any of these books).

My suggestion for the article can be found at User:Jochen Burghardt/sandbox2. Please comment it here or/and improve the sandbox text directly; eventually, I would like to replace the current article by the sandbox text. - Jochen Burghardt (talk) 13:19, 10 May 2015 (UTC)


 * I moved the sandbox page to the article today. - Jochen Burghardt (talk) 10:59, 31 May 2015 (UTC)

Change wording of BIN
The current wording of the BIN transformation says to only change rules composed of more than two non-terminals, suggesting it does not apply to rules composed of a mixture of terminals and non-terminals. If TERM is applied first, this doesn't cause any issue. However, if the transformation ordering START,BIN,DEL,UNIT,TERM is followed (as suggested in a later subsection), following this wording does not result in a CNF grammar. I think it should be changed so the transformation applies to any production rule not of the form A → BC or A → a. For instance, applying the transformation to the production rule A → aBbC would result in the new rules A → aD, D → BE, E → bC. Then, when TERM is applied later, these rules would follow the form A → BC as required by CNF. — Preceding unsigned comment added by 98.102.200.9 (talk) 18:58, 23 October 2021 (UTC)