- 論壇徽章:
- 0
|
******************************
5.5 Recognizing Common Lexical Structures
一棍子打翻一船人
******************************
60.jpg (69.83 KB, 下載次數(shù): 26)
下載附件
2014-03-07 17:56 上傳
Computer languages look remarkably similar lexically.
Lexically, then, functional, procedural, declarative,
and object-oriented languages look pretty much the same. Amazing!
61.jpg (132.17 KB, 下載次數(shù): 31)
下載附件
2014-03-07 17:58 上傳
我說,照片上是同一個人,你信嗎?
61.gif (391.96 KB, 下載次數(shù): 33)
下載附件
2014-03-07 18:03 上傳
That’s great because we have to learn only how to describe identifiers and
integers once and, with little variation, apply them to most programming
languages.
To demonstrate what lexical rules look like, let’s build simple versions of the
common tokens, starting with our friend the humble identifier.
Matching Identifiers
In grammar pseudocode, a basic identifier is a nonempty sequence of uppercase
and lowercase letters.
ID : ('a'..'z'|'A'..'Z')+ ; // match 1-or-more upper or lowercase letters
As a shorthand for character sets, ANTLR supports the more familiar regular
expression set notation.
ID : [a-zA-Z]+ ; // match 1-or-more upper or lowercase letters
Matching Numbers
Describing integer numbers such as 10 is easy because it’s just a sequence
of digits.
INT : '0'..'9'+ ; // match 1 or more digits
or
INT : [0-9]+ ; // match 1 or more digits
Floating-point numbers are much more complicated, unfortunately, but we
can make a simplified version easily if we ignore exponents.
FLOAT: DIGIT+ '.' DIGIT* // match 1. 39. 3.14159 etc...
| '.' DIGIT+ // match .1 .14159
;
fragment
DIGIT : [0-9] ; // match single digit
By prefixing the rule with fragment, we let ANTLR know that the
rule will be used only by other lexical rules. It is not a token in and of itself.
Matching String Literals
The next token that computer languages tend to have in common is the string
literal like "Hello". Most use double quotes, but some use single quotes or even
both (Python). Regardless of the choice of delimiters, we match them using a
rule that consumes everything between the delimiters. In grammar pseudocode,
a string is a sequence of any characters between double quotes.
STRING : '"' .*? '"' ; // match anything in "..."
The dot wildcard operator matches any single character. Therefore, .* would
be a loop that matches any sequence of zero or more characters.
If .*? is confusing,
don’t worry about it. Just remember it as a pattern for matching stuff inside
quotes or other delimiters.
To support the common escape characters, we need something like
the following:
STRING: '"' (ESC|.)*? '"' ;
fragment
ESC : '\\"' | '\\\\' ; // 2-char sequences \" and \\
ANTLR itself needs to escape the escape character, so that’s why we need \\
to specify the backslash character.
Matching Comments and Whitespace
For example, here is how to match both single-line and multiline
comments for C-derived languages:
LINE_COMMENT : '//' .*? '\r'? '\n' -> skip ; // Match "//" stuff '\n'
COMMENT : '/*' .*? '*/' -> skip ; // Match "/*" stuff "*/"
In LINE_COMMENT, .*? consumes everything after // until it sees a newline
In COMMENT, .*? consumes everything after /* and before the terminating */.
Here is how to tell ANTLR to throw out
whitespace:
WS : (' '|'\t'|'\r'|'\n')+ -> skip ; // match 1-or-more whitespace but discard
or
WS : [ \t\r\n]+ -> skip ; // match 1-or-more whitespace but discard
Believe
it or not, that’s a great start on a lexer for even a big programming language.
感覺作者好像了解中國文化一樣,
62.png (48.96 KB, 下載次數(shù): 28)
下載附件
2014-03-07 18:14 上傳
Here’s a lexer starter kit we can use as a reference later:
62.JPG (25.22 KB, 下載次數(shù): 32)
下載附件
2014-03-07 18:17 上傳
63.JPG (44.19 KB, 下載次數(shù): 28)
下載附件
2014-03-07 18:17 上傳
64.JPG (26.11 KB, 下載次數(shù): 26)
下載附件
2014-03-07 18:17 上傳
Before
we move on, though, there are two important issues to consider. First, it’s
not always obvious where to draw the line between what we match in the
parser and what we match in the lexer. Second, ANTLR places a few constraints
on our grammar rules that we should know about. |
|