User talk:Terjik

LECTURE NOTES ON

CMP 311(COMPILER CONSTRUCTION I)

BY

MR. AGBER SELEMUN

Introduction The use of computer languages is an essential link in the chain between human and computers. All computers execute relatively simple commands (albeit very quickly). A program for a computer is a combination of these simple commands in machine language. Because machine language is tedious and error-prone, programs are written using high-level languages, which are very different from machine language. The types of translators in use include: Assemblers: these translate code written in assembly language to machine language. Assembly language uses mnemonics to represent instructions and is a little advancement of the machine language. Interpreters: used to translate command languages. Command languages are those languages for which the instructions are executed immediately they are translated. Compilers: these create some alternative representation of the high-level code. Assignment Compare and contrast the translators mentioned in (ii) and (iii). Highlight the advantages of HLL over LLL and ML. What is a Compiler Simply put, a compiler is a translator, which takes a source program written in high-level language as its input and produces an object program in another language as output. Alternatively, a compiler is a program that reads a program written in one language – the source language – and translates it into an equivalent program in another language – the target language. As part of this process, it reports to its user the presence of errors in the source program. The translation of source program (source code) to object program (object code) occurs at compile time. The object program is executed during run-time to obtain results from the program. There are thousands of source languages, ranging from the traditional programming languages such as FORTRAN and Pascal and C to specialized languages that have arisen in virtually every area of computer application. Target languages are equally as varied – it could be another programming language or assembly language or the machine language of any computer between a microprocessor and a supercomputer. Compilers are classified as single-pass, multi-pass, load-and-go, debugging or optimizing, depending on how they have been constructed or on what function they are supposed to program. Despite this apparent complexity, the basic tasks any compiler must perform are the same and we can construct compiler for a variety of source languages and target machines using these basic techniques.

Parts of Compilation A compiler must perform an analysis of the source program and then a synthesis of the object program – it must first decompose the source program into its basic parts and then build equivalent object program parts from them. During analysis, the operation implied by the source program are determined and recorded in a hierarchical structure called a tree. Assignment Differentiate between; Single pass, multi pass compilers. Load and go, debugging and optimizers. Analysis of the Source Program Analysis of a source program during compilation consists of three phases; Linear analysis: the stream of characters making up the source program is read from left to right and grouped into tokens (sequences of characters that have a collective meaning or logically belong together). Hierarchical analysis: the tokens are grouped into nested collections with collective meaning. Semantic analysis: checks are performed to ensure that the components of the program fit together meaningfully. The Phases of a Compiler The process of compiling a program occur in series of sub-processes called phases, each of which transforms the source program from one representation to another. A phase is a logically cohesive operation that takes as input one representation of the source program and produces as output another representation after performing the required operations on it. Each of these phases are carried out by one component of the compiler.

The Scanner (Lexical Analysis) This is the simplest part of the compiler (sometimes called a lexical analyzer). It reads the characters from the source program and puts them into tokens. It also removes comments, puts identifiers into the symbol table and performs other simple tasks that can be done without analyzing the source program. The tokens are stored and passed to the next phase in some internal form. The Syntax Analyzer (Syntax Analysis) This component groups the tokens of the source program into grammatical phases that are used by the compiler to synthesize output (this is referred to as hierarchical analysis or parsing or syntax analysis). The phases of the source program are represented by a parse tree (syntax tree). The information collected in this phase is put into the symbol table and other relevant tables.

Fig1: Components of a Compiler The Semantic Analyzer (Semantic Analysis) This component checks the program constants relieved from the syntax analyzer for semantic corrections and gathers type information for the subsequent phase. It performs type checking – checks that each operator has operands that are permitted by the source language specification. The information collected is also put into the necessary tables. The Intermediate Code Generator (Intermediate Code Generation) This created an explicit intermediate representation of the source program – which could be viewed as a program for an abstract machine. This intermediate representation must have two important properties; It should be easy to produce. It should be easy to translate into the target program.

The Code Optimizer (Code Optimization) Here, the intermediate code that is generated is improved in some way so that faster running machine code will result. There is a great variation in the amount of code optimization performed by different compilers. Optimizing compilers – those that do the most optimization – use a significant fraction of the compiler time in this phase. These are simple optimizations that improve the running time of the target program without slowing down compilation. The Code Generator (Code Generation) This is the messiest and most detailed part of the compiler. This is where the actual translation of the internal source program into the object code is done. The compiler generated code for every other code in the internal form of the program in the order in which they are passed. Information is gotten from the relevant tables in order to generate target program instruction that perform the same task as the source program equivalent. Tables of Information (Dictionary) As program analysis is carried out, information is obtained from declarations, procedures, headings, for-loops etc and saved for use later on. Exactly what must be saved depends on the source language, object language and the sophistication of the compiler (the amount of optimization that is to be performed). One table which must be present in every compiler is a symbol table (identifier list or name table). It is a table of the identifier address and any other information about it that is needed in order to generate code. Most likely, it would be necessary to have a table of constants used in the source program. Also a table of for-loops or other loops. As mentioned earlier, the exact information to be collected is determined by design factors.

The Error Handler The error handler is invoked when a flaw is detected in the source program. It warns the programmer by issuing a diagnostic and therefore adjusts the information being passed from phase to phase so that each phase can proceed. The desire is that compilation be completed even on flawed programs, especially through the syntax analysis phase, so that as many errors as possible can be detected in one compilation. Pass(es) When implementing a compiler, one or more phases can be considered into a module (portion of phases can be considered as well) called a pass. A pass reads the source program, or the output of the previous pass, makes the transformation specified by its phase(s) and writes output into an intermediate file that may be read by a subsequent pass. The number of passes and the structure of the pass (i.e. the grouping of the phases) is determined by a number of considerations the same to the language and machine. These include; The structure of the source program. The environment in which the compiler will operate (size of available memory). Speed of compilation. Grammars and Languages In English language, we know that the sentence “the large pig swallowed the bottle” is a valid sentence of the language either intuitively or by applying adhoc rules taught in grammar classes (the figure below shows a diagrammatical representation of the sentence). In order to mechanically decompose sentences, formal, precise rules which indicate their structure must be defined. A metalanguage is used to do this.  	      The      large        pig   swallowed  the    bottle Fig2: Structure of the English Sentence The point of interest is in describing the syntax of programming language and not their semantics. The fig2 (above) is called a syntax tree and it shows the structure or syntax of the sentence by showing its constituent parts. The tree shows that is composed of followed by. If we abbreviate “is composed of” by the symbol “::=”, we can represent the rules of this sentence as; ::=   ::=    ::=     ::= the ::= large ::= swallowed ::= pig ::= bottle This set of rules is called a grammar. The rules of the grammar can be used to derive or produce a sentence by the following algorithm: Begin with the entity, find a rule with to the left of “::=” and rewrite it as the string to the right of “::=”. => 	Repeat this process by taking one syntactic entity from the string at a time and replacing it with the corresponding string until all the entities in the string appear only on the right part of the rule. The symbol “=>” means one symbol to the left of ”::=” is replaced using a rule of the grammar to yield the string to the right of “=>”. This derivation of a sentence is abbreviated using =>+ thus; =>+ the large pig swallowed the bottle Note: the rules in a grammar may be used to form different sentences by combining them. For example, from the rules; ::=   ::= they ::= he  ::= she ::= eat ::= drink ::= black We can form nine(9) sentences Let us formally define some of the terms we have been using and others we may use later. Def:	V produces W or W reduces to V (written v=>+W) if there exists a sequence of derivations            v => uo=>u1=>u[n] =>w. the string w is a word for v and we write v=>*w if v=>+w or v::=w. Def:	A production rule or rewriting rule is an ordered pair (u, x) usually written u::=x where u is a symbol and x is a non-empty finite string of symbols. Def:	A symbol is an atomic entity represented by a character, or sometimes by a reserved or key word. Def:	An alphabet is a non-empty set of elements. The elements of an alphabet are called symbols e.g. A={x,y,z}; alphabet of Pascal will include {begin, end, for etc}. Def:	A string is any finite sequence of symbols from an alphabet. Example: possible strings over the alphabet M{x,y,z} are x,y,z,zy,yz,xz,xxy,xyz etc. the order of the symbols in a string is important – i.e. xy ≠ yx. Def:	The empty string ε is the string with no symbols in it. Def:	The length of a string x, written |x|, is the number of symbols in the string. Def:	Concatenation of strings is the logical combination of two or more strings to form one longer string. Example: if x=mn and y=op, then the concatenation xy=mnop. Def:	If x=mn is a string, then m is a head and n is a tail of x. m is a proper head if n is not empty and n is a proper tail if m is not empty. Def:	If we denote sets of strings over an alphabet by capital letters, we can define the product of two sets of strings AB as AB={xy|x in A and y in B}. i.e. the product is obtained by concatenating the strings in A by those in B. Def:	Powers of strings – if x is a string then xo=ε, x1=x, x2=xx and in general x^n=xx…xx(n times). Def:	Powers of an alphabet - Ao={ε},A1=A, A^A=AA^(n-1)for all n>0. Def:	The closure and positive closure of a set is defined as follows: positive closure A+=A1 union A2 union…union A^n union… closure A*=A0 union A+. Def:	A grammar G[Z] is finite non-empty set of rules. Z is a symbol (called the distinguished symbol) which must appear as the left part of at least one rule. All the symbols used in the rules form the vocabulary, V. the symbols appearing as left parts are called non-terminals, they form the set of non-terminals, VN. The symbols not in VN are called terminals and form the set of terminals, VT. Example: The grammar G1[number] can be represented as; ::= 			 ::= 4 ::=  		 ::= 5 ::= 			 ::= 6 ::= 0				 ::= 7 ::= 1				 ::= 8 ::= 2				 ::= 9 ::= 3 Using the BNF (Backus-Naur Form) notation, rules, (such as u::=x; u::=y; u::=z) with identical left parts are written as u::=x|y|z, so that G1[number] could be written as: ::=  ::=  | ::= 0|1|2|3|4|5|6|7|8|9   Assignment Discuss briefly any three notations used to describe formal languages. Def:	Let G be a grammar, the string V directly produces the string w, written v=>w if we can write v=xUy, w=xUy for none strings x and y, where U::=u is a rule of G. we can also say that w is a direct derivation of v or that w directly reduces to v. the strings x and y could be empty, so that for any rule, U::=u of the grammar G, we have U=>u. Def:	V produces w, or w reduces to v, written v=>+w if there exists a sequence of direct derivations v=>uo=>u1=>…=>u[n]=>w where n>0. The sequence is called a derivation of length n. The string w is said to be a word for v. Also, if v=>+w then we can write v=>*w. Def:	Let G[z] be a grammar. A string x is a sentential form if x is derivable from the distinguished symbol z of the grammar i.e. if z=>*x. Def:	A sentential form x, is a sentence if it contains only terminal symbols. Def:	The language L(G[z]) of a grammar is the set of sentences of the grammar: L(G)= {x|z=>*x and x∈VT+} Class Work Identify the sentences of the following grammar; := 	  :=   | 	 :=  	  := he|she := the|a := big := ate := pig|bottle Write a grammar whose language is the set of even numbers. Construct a grammar for the language {a(b^n)a|n=0,1,2…} Def:	Let G[z] be a grammar. Also let w=xuy be a sentential form. Then u is a phrase of the sentential form w for a non-terminal U if z=>*xUy and U=>u Example: From the grammar GI[number] (Defn 13) what are the phrases of 1? We have =><no>=><no> =><no>1. Thus, =>*<no> and <no>=>+<no>1 =>*<no> and =>1 Def:	A handle of any sentential form is a leftmost simple phrase. Def:	Let G[z] be a grammar. We say G is recursive if there exists a derivation U=>+ …U… G is recursive if U=>+U… and is right recursive if U=>+…U. Recursive grammars are used to define infinite languages (languages with infinite number of sentences). Syntax Trees Def:	A syntax tree is a graphical representation of a derivation. Each node is labeled by some non-terminal and the branches of this node are labeled from left to right by the symbols in the right side of the non-terminal on the left. The production by which the non-terminal was replaced in the derivation. The endnodes (also called leaves) are those nodes which have no branches emanating from them (nodes whose labels are terminal symbols). Reducing the nodes of the tree from left to right gives the sentential form derived by the derivation which the tree represents. How to Draw/Derive a Syntax Tree Start with the distinguished symbol and draw branches whose nodes form the replacing string. From each of the resulting nodes, draw the branches again, for the string replacing this node. Continue in this manner until all the nodes are end nodes. Example 1: Given the grammar G2[E], draw the syntax tree for the sentence (a) i+i (b)i*i (c) i+i*i (d) i*i+i. E::= T|E+T|E-T T::= F|T*F|T/F F::= (E)|i 2.	Draw the syntax tree for the sentence 25 from grammar G1[number] Concerning syntax trees, we can draw the following conclusions; For each syntax tree, there exists at least one derivation and there is a corresponding syntax tree for each derivation though several derivations may have the same tree. A branch of the tree indicates a direct derivation. Thus a rule exists in the grammar whose left part is the name of the branch and whose right part is the string if branch nodes. The end nodes of the tree form the derived sentential form. Let U be the root of a sub-tree for a sentential form w=xuy where u forms the string of end nodes of that sub-tree. Then u is a phrase for U of the sentential form w. It is a simple phrase if the English grammar with the rules. Ambiguity As mentioned earlier, several derivations may have the same tree. The difference between these derivations would be the order in which the rules are applied for each derivation. A syntax tree doesn’t specify the order of derivation. Consider a section of the English grammar with the rules; ::= |  ::=  ::=  ::=  ::= time|flies ::= time|flies We can generate the sentence “time flies” in two different ways, both of which make sense: either “find out how fast flies fly” or “time goes by quickly”. Once we remove the content, we become unsure what to do and cannot derive the sentence correctly. The sentence is said to be ambiguous. Def:	A sentence of a grammar is ambiguous if there exists two or more syntax trees for it. A grammar is ambiguous if it contains an ambiguous sentence. Note:A grammar is said to be ambiguous, not the language. An ambiguous sentential form has more than one syntax tree and therefore more than one handle. It is possible to change an ambiguous grammar, without changing the sentences, to arrive at an unambiguous one but this is not always the case for some languages, no unambiguous grammar exists. Example: Consider the grammar G2[E] for arithmetic expressions; E::= E+E|E-E|E*E|E/E|(E)|-E|i The sentence of the grammar i+i*i is ambiguous. We can disambiguate the grammar by specifying the associativity and precedence of the arithmetic operators. Suppose we use the precedence rules as said in mathematics and take left associativity for one operator, we would disambiguate the grammar. Another grammar which has all these rules included in the productions is G3[E], below; E::= T|E+T|E-T|-E T::= F|T*F|T/F F::= (E)|i From G3[E], there is only one tree for the sentence i+i*i which implies the sentence is unambiguous and the derived precedence operations hold. Basic Parsing Techniques We will look at how to check or not an input sting is a sentence of a given grammar and how to construct a pass tree for the sentence if desired. As every compiler perform some kind of syntax analysis, they would take as input a sequence of tokens and produce as output some representation of the passé tree. Def:	A passé of a sentential form is the construction of a derivation and possibly a syntax tree for it. Def:	A passer of a grammar G, is a program that takes as input a string, w and produces as output either a passé tree for w, if w is a sentence of G. A passer is also called a recognizer, since it recognizes only sentences of the grammar in question. Based on the nodes of operation, two types of passers are known. The bottom-up builds the tree from the leaves to the root, while the top-down builds the tree from the root and works down to the roots. Top-Down Parsing As mentioned previously, a top-down parse builds the parse tree for any input string from the root and creates the nodes|branches of the tree in pre-order. This method is relatively straightforward. Complications arise because of the records needed to perform back tracing in a manner that assures that all possible trees are attempted. Generally, the main problem in top-down parsing concerns the choice to make when there are multiple alternatives, suppose the leftmost non-terminal to be replaced is V and there are n-rules. V->S1|S2|…|Sn. How do we know which string to replace V by? Other difficulties with top-down parsing are: Left recursion – A grammar is left recursive if there is a production A->Aά for any ά. This can cause a top-down parse to go into an infinite loop i.e when trying to expand A, we may find ourselves expanding A continuously without consuming any input. Direct left recursion is eliminated by re-writing the rules of the grammar using the iterative and optimal notation where rules such as E->E+T|T are written as E->T{+T}. it is worthy to note that this may eliminate direct recursion but still allows for indirect recursion. Backtracking – If after a sequence of expansions we discover a mismatch, we may have to undo the semantic effects of these expansions. Also, when failure is reported, we have no ideas where the error occurred. The only way to solve this problem is to use passers that do no backtracking at all. Examples of each passers are the recursive descent and predictive passers. The order in which alternatives are tried can affect the language accepted. This problem is elated and directly affected by backtracking and is solved in a similar manner. Bottom-Up Parsing The bottom-up technique starts at the string itself and tries to reduce it to the distinguished symbol. At each step, a handle (leftmost simple phrase) of the current sentential form is reduced and thus the string to the right of this handle always contains only terminals. The main problem in bottom-up parsing involves indentifying the handle, and what to reduce it to. Other problems of bottom-up parsers are dependent on the particular implementation. Relations A (binary) relation on a set is any property that either holds or does not hold for any two ordered symbols of the set. The symbols “=>” and “=>+” are examples of relations between strings. As a form of notation we write cRd if the relation R holds between members c and d of a set. We can therefore say a relation is the set of ordered pairs for which a property holds, i.e. (c,d) in R iff cRd. Things to note concerning relations: A relation P contains another relation R if (c,d) in R implies (c,d) in P. The transpose of a relation is defined by c TRANSPOSE(c)d iff cRd implies dRc. Example “brother of” relation. A relation reflexive if cRc holds for all elements of the set A relation is transitive if aRc follows from aRb and bRc Given two relations R and P defined over the same set, we define the product of R and P as cRpd iff there exists an e such that cRe and epd. Using the product, we can define power of a relation R by R1=R1 R2=RR, R^n=RR^(n-1) for n>1 Also, we define Ro to be the identity relation i.e. aRob iff a=b. The transitive closure R+ of a relation is defined as cR+d iff cR^nd for some n>0. Obviously, if cRd then cR+d. The reflexive transitive closure R* of a relation is defined as cR*d iff c=d or cR+d. Relations Concerning Grammars There are four relations that are of main interest with reference to grammars; Given a grammar G and a non-terminal U, we will define a set of head symbols S, of derivation of U. the relation between U and this set is U FIRST S and is defined iff there is a rule U::=S… By definition of transitive closure, we have UFIRST+S iff there is a chain of at least one rule(es) U::=S1, S1::=S2,…, Sn::=S. This implies that UFIRST+S iff U=>+S… The set of symbols which end a derivative of some symbol U, is defined by the relation LAST iff there is a rule U::=…S. The transitive closure LAST+ is defined iff U=>+…S Finally, the set of symbols which are derivation of a symbol U is defined by the relation SYMB iff there is a rule U::=S the transitive closure SYMB+ is defined iff U=>+S Boolean Matrices and Relations The use of relations and Boolean matrix theory provides a single algorithm for calculating different sets that are useful for constructing parsing algorithms. The best representation, in a computer, of relations over an alphabet is a Boolean matrix B. the elements of this matrix may take only the values 1 or 0 (true or false) i.e. B[i,j]=1 iff the relation R holds for any two elements Si and Sj of the alphabet. The “Addition” of n by n Boolean matrices involves “ORing” the matrices elementwise while “multiplication” is the same as in matrix multiplication except that the operations are logical (* is replaced by ‘and’ while + is replaced by ‘or’). Thus the element D[i,j] of the matrix D=B+C is defined as; D[i,j]=B[i,j]+C[i,j]:= if B[i,j]=1 then 1 else C[i,j] and the element D[i,j] of the matrix D=BC is defined as; D[i,j]:=B[i,1]*C[1,j]+B[i,2]*C[2,j]+…+B[i,n]*C[n,j] Where * is defined by a*b:= if a=0 then 0 else b. Obviously, Boolean matrix addition is associative, A+(B+C)=(A+B)+C and commutative A+B=B+A while multiplication is only associative A(BC)=(AB)C. the two operations satisfy the distributive law A(B+C)=AB+AC. Theorem 1: Let A be an alphabet of n symbols and R be any relation on A. if for two symbols S1 and b we have S1R+b then there exists a positive integer k≤n such that S1RnKb. Proof: Since S1R+b, there exists an integer p>0 such that S1R^pb. This means there exists symbols S2, S3,…Sp in A such that S1RS2, S2RS3,…S{p-1]RSp and SpRb (by definition of R^p) suppose that the smallest such p is greater than n. then for this smallest p for two integers i and j with i<j≤p we must have si=sj since A has only n symbols. Bu the relations S1RS2,….S[i-1]RSi, SjRS[j+1],…S[p-1]RSp, SpRb Show that S1Rnb where k=p-(j-i) in contradiction to the fact that p was the smallest. Thus the hypothesis that p>n is false and the theorem is proved. Theorem 2: The product of two relations are the same alphabet is given by the product of the Boolean matrices representing the relations. Proof: Assuming two matrices B and C, representing two relations P and Q on the alphabet S. if D[i,j]=1 then by the definition of D[i,j] for some k, we must have B[i,k]=1 and C[k,j]=1. Thus SipSk and SkQsj which means (Si,Sj) is in the product PQ of the relations P and Q. Alternatively, if (si,sj) is in PQ then there exists a k such that SipSk and SkQSj and D[i,j] must be 1. This means the matrix D represents the relation PQ. Theorem 3: Let B be any n by n Boolean matrix representing a relation R over an alphabet S of n symbols. Then the matrix B+ defined by B+=B+BB+…+B^n represents the transitive closure of R. Proof: Since, in general, R^n is defined recursively by R(R^(n-1)) for n>1, it follows by induction on n that the matrix B^n=BB…B (n times) represents the relation R^n. We can also conclude this from the definition of R+, theorem 1 and theorem 2. Example: From the rules given below, construct the Boolean matrix B representing the relation FIRST. Use this matrix to show the pavis for which the relation FIRST+ holds. A->Af|B B->DdC|Dc C->e D->Bf Assignment The following algorithm was developed for calculating the matrix B+=B+BB+…+B^n from the matrix B. prove that this algorithm works. Set a new matrix A=B Set i:=1 For all j, if A[j,i]=1 then for k=1,…,n set A[j,k]:=A[j,k]+A[i,k] Add 1 to i	If i<=n then goto step 3: otherwise stop