-《lex & yacc 2nd》:下載地址參考 http://blog.csdn.net/a_flying_bird/article/details/52486815
本文即此書的學習筆記。
lex文件通常使用的后綴名: .l, .ll, .lex。——實際上,可以是任意的名稱。
文件內容分為三部分,各個部分之間以 %% 分隔:
%{ /* part 1: Definition Section. e.g.: Global declaration of C. */%}%%/* part 2: Rules section. Rule = Pattern + Action. */%%part 3: C codes.注意,%} 不要寫成 }% 了,否則 PRemature EOF。
%{ 和 %} 之間的內容會原封不動地拷貝到最后生成的c文件中,所以這里可以是任何合法的C代碼。通常而言,這里放lex文件后面C代碼要用到的一些東西。
使用lex命令,把lex文件轉換成c文件(lex.yy.c);在生成可執行文件的時候,要鏈接庫文件l。示例:
lex simplest.l gcc lex.yy.c -ll -o test對應En Page 2.
代碼(simplest.l):
%%.|/n ECHO;%%編譯(把lex文件轉換成c文件)&鏈接&運行:
$ lssimplest.l$ lex simplest.l $ lslex.yy.c simplest.l$ gcc lex.yy.c -ll -o test$ ./test The simplest lex program. ------ 鍵盤輸入內容The simplest lex program. ------ 程序回顯結果^C$這個例子可以識別指定的這些單詞,其他的不認識的直接回顯。- 對應原書 ch1-02.l
代碼:
%{/* * this sample demonstrates (very) simple recognition: * a verb, or not a verb. */%}%%[/t ]+ /* ignore whitespace */ ;is |am |are |were |was |be |being |been |do |does |did |will |would |should |can |could |has |have |had |go {printf("%s: is a verb/n", yytext);}[a-zA-Z]+ {printf("%s: is not verb/n", yytext);}.|/n {ECHO; /* normal default anyway */ }%%int main(){ yylex(); return 0;}運行:
$ lex recoginzing_word.l $ gcc lex.yy.c -ll -o test$ ./test I am a student. You are a teacher. ------ 鍵盤輸入內容I: is not verbam: is a verba: is not verbstudent: is not verb.You: is not verbare: is a verba: is not verbteacher: is not verb.^C$lex文件的三部分:definition section, rules section, user subroutines section.
definition section可以有一段”%{“和”%}”,這中間用來放C代碼,比如#include,函數原型,全局變量等等。在由lex生成lex.yy.c的時候,這部分原封不動拷貝到C文件中。
rules section: 每個規則由兩部分組成,即 pattern + action. 兩者由空格分開。其中pattern是正則表達式語法。lexer在識別到某個pattern后,就會執行其對應的action?!猘ction: { C codes. }
user subroutines section: 拷貝到.c文件的最后。
特殊的action:
“;”: 同C語言的空余句,即什么也不做?!苯雍雎赃@些輸入“ECHO;”: 缺省行為,將匹配的字符串打印到輸出文件中(stdout,回顯)?!皘”: 使用下一個pattern的action?!⒁?| action的語法,會在pattern后面有一個空格。而作為正則表達式的|則不會有空格。注意1: ;和ECHO;的區別:前者是忽略輸入,后者是打印到輸出??梢詫⑹纠械腅CHO;改成;后觀察輸出的變化情況。
注意2: | action不能像下面這種方法寫到同一行:
had | go {printf("%s: is a verb/n", yytext);}變量:
yytext: 存儲的是匹配到的字符串,其類型可以在生成的.c中看到,即 extern char *yytext;無歧義規則:每個輸入僅匹配一次 + 最長匹配。英文描述如下:
Lex patterns only match a given input characer or string once.Lex executes th action for the longest possible match for the current input.缺省main:
這里的例子中,定義的main()調用了yylex()。yylex()是lex定義的函數,缺省情況下,如果lex文件中沒有定義main()函數,lex系統有一個缺省的main,也會去調用yylex()。
程序退出:
缺省情況下,yylex()只有處理了所有的輸入的時候,才會退出。對于控制臺輸入,則要等到Ctrl+C。當然,用戶也可以主動return,即在action中增加return語句。為此,可以增加如下一個規則作驗證:
quit {printf("Program will exit normally./n"); return 0;}注意:這句話寫到a-zA-Z]+的前面,否則 warning, rule cannot be matched。
可以修改下面兩點,做對比分析:
[/t ]+ {printf("%s: white space/n", yytext);}.|/n {printf("%s: Invalid word/n", yytext);}示例:——注意觀察最后有一個換行符。
I am a student. You are a teacher. !@#$%^&*I: is not verb : white spaceam: is a verb : white spacea: is not verb : white spacestudent: is not verb.: Invalid word : white spaceYou: is not verb : white spaceare: is a verb : white spacea: is not verb : white spaceteacher: is not verb.: Invalid word : white space!: Invalid word@: Invalid word#: Invalid word$: Invalid word%: Invalid word^: Invalid word&: Invalid word*: Invalid word: Invalid word對應 ch1-03.l
可以識別出動詞、副詞、介詞等等?!恍枰黾訉膔ules即可。
代碼:
%{/* * this sample demonstrates (very) simple recognition: * a verb, or not a verb. */%}%%[/t ]+ {printf("%s: white space/n", yytext);}is |am |are |were |was |be |being |been |do |does |did |will |would |should |can |could |has |have |had |go {printf("%s: is a verb/n", yytext);}very | simple | gently | quietly | calmly | angrily {printf("%s: is an adverb/n", yytext);}to |from |behind |above |below | between {printf("%s: is a preposition/n", yytext);}if | then | and | but | or {printf("%s: is a conjunction/n", yytext);}their | my | your | his | her | its {printf("%s: is a adjective/n", yytext);}I | you | he | she | we | they {printf("%s: is a pronoun/n", yytext);}QUIT { printf("Program will exit normally./n"); return 0; }[a-zA-Z]+ {printf("%s: don't recognize/n", yytext);}.|/n {printf("%s: Invalid word/n", yytext);}%%int main(){ yylex(); return 0;}運行:
he is a student. and he is a teacher. QUIT (ENTER)he: is a pronoun : white spaceis: is a verb : white spacea: don't recognize : white spacestudent: don't recognize.: Invalid word : white spaceand: is a conjunction : white spacehe: is a pronoun : white spaceis: is a verb : white spacea: don't recognize : white spaceteacher: don't recognize.: Invalid word : white spaceProgram will exit normally.對應 ch1-03.l, 這個例子說明如何在lex中寫更復雜的C代碼。
前面的例子是把每個單詞都定義在lex文件中,接下來對其優化。
比如,可以在文件中按照特定語法來定義單詞的詞性:
noun dog cat horse cowverb chew eat lick即每行開頭一個單詞用來定義詞性,接下來的每個單詞都屬于該詞性。如此,可以在文件中作這種定義。當然,具體到這里的示例代碼,暫時不處理文件輸入,而仍然從控制臺輸入。這時,就有兩種輸入:
定義:即首字母表示詞性,接下來是一系列屬于該詞性的單詞;識別:同前一個例子,要求識別出每個單詞的詞性。代碼:
%{#include <stdbool.h>#include <string.h>#include <stdio.h>#include <stdlib.h>/* * Word recognizer with a symbol table. */enum { LOOKUP = 0, /* default - looking rather than defining. */ VERB, ADJ, ADV, NOUN, PREP, PRON, CONJ};int state; // global variable, default to 0(LOOKUP).bool add_word(int type, char *word);int lookup_word(char *word);%}%%[/t ]+ ; /* ignore whitespace *//n {state = LOOKUP;} // end of line, return to default state. /* Whenever a line starts with a reserved part of speech name */ /* start defining words of that type */^verb {state = VERB;}^adj {state = ADJ;}^adv {state = ADV;}^noun {state = NOUN;}^prep {state = PREP;}^pron {state = PRON;}^conj {state = CONJ;} /* a normal word, define it or look it up */[a-zA-Z]+ { if (state != LOOKUP) { /* define the current word */ add_word(state, yytext); } else { switch(lookup_word(yytext)) { case VERB: printf("%s: verb/n", yytext); break; case ADJ: printf("%s: adjective/n", yytext); break; case ADV: printf("%s: adverb/n", yytext); break; case NOUN: printf("%s: noun/n", yytext); break; case PREP: printf("%s: preposition/n", yytext); break; case PRON: printf("%s: pronoun/n", yytext); break; case CONJ: printf("%s: conjunction/n", yytext); break; default: printf("%s: don't recognize/n", yytext); break; } } }[,:.] {printf("%s: punctuation, ignored./n", yytext);}. {printf("%s: invalid char/n", yytext);}%% int main(){ yylex(); return 0;}/* define a linked list of words and types */struct word { char *word_name; int word_type; struct word *next;};struct word *word_list; /* first element in word list */bool add_word(int type, char *word){ struct word *wp; // wp: word pointer if (lookup_word(word) != LOOKUP) { printf("!!! warning: word %s already defined./n", word); return false; } /* word not there, allocate a new entry and link it on the list */ wp = (struct word*)malloc(sizeof(struct word)); wp->next = word_list; wp->word_name = (char*)malloc(strlen(word) + 1); strcpy(wp->word_name, word); wp->word_type = type; word_list = wp; return true;}int lookup_word(char *word){ struct word *wp = word_list; for (; wp; wp = wp->next) { if (strcmp(wp->word_name, word) == 0) { return wp->word_type; } } return LOOKUP;}這里的枚舉值有兩個含義:
狀態。缺省是LOOKUP狀態,即對當前輸入行的每個單詞,在詞庫/鏈表中查找其詞性(lookup_word),然后打印出來。但如果每一行的第一個單詞是noun/verb等保留字,則說明要進入defining狀態(細分為VERB等狀態),保留字后續的各個單詞將會添加到詞庫/鏈表中(add_word)。——在添加詞庫的時候,會先檢查該單詞是否已經入庫。類型:詞庫中,每個單詞每個單詞對應的詞性用VERB等表示。運行:
noun pet dog cat cats [ENTER]verb is are [ENTER]adj my his their [ENTER]my pet is dog. their pets are cats. that's ok. [ENTER]my: adjectivepet: nounis: verbdog: noun.: punctuation, ignored.their: adjectivepets: don't recognizeare: verbcats: noun.: punctuation, ignored.that: don't recognize': invalid chars: don't recognizeok: don't recognize.: punctuation, ignored.^C前面的例子把一串字符串識別成了一個個單詞,接下來就是識別句子。
詞法分析:從輸入字符流中識別出一個個單詞,就是所謂的詞法分析,輸出是token。其關鍵就是定義詞法規則(正則表達式);語法分析:在得到一個個單詞(包括詞性)之后,就是做更高級的分析,比如某些詞連在一起是否構成了一個正確的句子?!鱾€token如何組合或搭配在一起。對于不同的token 組合執行不同的action。現在分析,如何由前面得到的noun&pronoun&verb等構造出句子(示例):
主語:(假定只能是)名詞或代詞,即 subject -> noun | pronoun賓語:(假定只能是)名詞,即 object -> noun句子(主謂賓):謂語只支持動詞形式,即 sentence -> subject verb object.這里的subject&object&sentence就是基于詞法分析得到的noun&pronoun&verb等token而構造出來的新的symbol。
在yacc&lex系統中,詞法分析(lex/lexer)和語法分析(yacc/parser)是相對獨立的兩套子系統。詞法分析對應的(庫)函數是yylex(),這個函數對輸入的字符流做詞法分析,然后生成一個個token。語法分析對應的函數是parser(),其輸入是yylex()產生的token。所以,要把lex/lexer/yylex()的輸出作為yacc/parser/parser()的輸入。
yylex()的原型:
int yylex (void);這里的關鍵就在于yylex()的返回值,其表示了當前識別的token的類別。當parser()需要一個token的時候,就調用yylex(),根據其返回值,就知道這個token的類別,從而做進一步的處理。
需要注意的是,并非lexer要給parser返回所有的token。比如,注釋部分或空白符號就不需要傳給parser,或者說parser對此不感興趣。這種情況下,lexer直接丟棄即可。
既然yacc和lex基于token通信,自然就需達成一致的規定。這就是所謂的token codes,即每一類token規定一個token code。在yacc&lex系統中,是由yacc來定義token codes,然后lex的代碼include進來。具體地,
在yacc中用%token NOUN VERB語法定義token codesyacc -d test.y 會生成y.tab.c和y.tab.h兩個文件,其中后者就包括了token codes的宏定義在lex中include這個y.tab.h文件。注:取值為0的token code表示結束輸入(a logical end of input)。
此文件自動生成,如下:
/* A Bison parser, made by GNU Bison 2.3. *//* Skeleton interface for Bison's Yacc-like parsers in C ... This special exception was added by the Free Software Foundation in version 2.2 of Bison. *//* Tokens. */#ifndef YYTOKENTYPE# define YYTOKENTYPE /* Put the tokens into the symbol table, so that GDB and other debuggers know about them. */ enum yytokentype { NOUN = 258, PRONOUN = 259, VERB = 260, ADVERB = 261, ADJECTIVE = 262, PREPOSITION = 263, CONJUNCTION = 264 };#endif/* Tokens. */#define NOUN 258#define PRONOUN 259#define VERB 260#define ADVERB 261#define ADJECTIVE 262#define PREPOSITION 263#define CONJUNCTION 264#if ! defined YYSTYPE && ! defined YYSTYPE_IS_DECLAREDtypedef int YYSTYPE;# define yystype YYSTYPE /* obsolescent; will be withdrawn */# define YYSTYPE_IS_DECLARED 1# define YYSTYPE_IS_TRIVIAL 1#endifextern YYSTYPE yylval;運行:
noun dogverb ispron itit is dogit: pronounsubject of a pronoun.is: verbverb.dog: nounobject of a noun.Sentence is valid.或者:
noun dog dogsverb is arepron it theyit is dog they are dogs.it: pronounsubject of a pronoun.is: verbverb.dog: nounobject of a noun.Sentence is valid.they: pronounsyntax error運行:
$ ./test noun dogverb ispron itit is dog.it: pronounis: verbdog: nounSentence is valid.it is dog.it: pronounis: verbdog: nounSentence is valid.noun dogsverb arepron theythey are dogs.they: pronounare: verbdogs: nounSentence is valid.改成如下的代碼會運行錯誤:
int main(){ //while(!feof(stdin/*yyin*/)) { for (;;) { yyparse(); }}運行:
$ ./test noun dogverb ispron itit is dogit: pronounis: verbdog: nounSentence is valid.it is dogit: pronounsyntax erroris: verbsyntax errordog: noun要從文件中讀取,需要使用全局變量yyin。如下這種方式無效:
//extern FILE *yyin;int main(){ FILE* f = NULL; f = fopen("test.txt", "rb"); if (NULL == f) { printf("Open file failed./n"); return 1; } printf("Open file successfully./n"); while(!feof(f)) { yyparse(); }}注:在yy.lex.c中,使用的是yyin全局變量。該變量初始化為0(NULL)。如果用戶沒有更改yyin,會程序跑起來之后會自動設置為stdin。
正確代碼:
extern FILE *yyin;int main(){ yyin = fopen("test.txt", "rb"); if (NULL == yyin) { printf("Open file failed./n"); return 1; } printf("Open file successfully./n"); while(!feof(yyin)) { yyparse(); }}測試文件test.txt的內容:
noun dog dogsverb is arepron it theyit is dog.they are dogs.運行:
$ ./test Open file successfully.it: pronounis: verbdog: nounSentence is valid.they: pronounsyntax error$對應 ch1-06.y
新聞熱點
疑難解答