|
Line 0
Link Here
|
|
|
1 |
.\" @(#)ss4 6.1 (Berkeley) 5/8/86 |
| 2 |
.\" |
| 3 |
.SH |
| 4 |
4: How the Parser Works |
| 5 |
.PP |
| 6 |
Yacc turns the specification file into a C program, which |
| 7 |
parses the input according to the specification given. |
| 8 |
The algorithm used to go from the |
| 9 |
specification to the parser is complex, and will not be discussed |
| 10 |
here (see |
| 11 |
the references for more information). |
| 12 |
The parser itself, however, is relatively simple, |
| 13 |
and understanding how it works, while |
| 14 |
not strictly necessary, will nevertheless make |
| 15 |
treatment of error recovery and ambiguities much more |
| 16 |
comprehensible. |
| 17 |
.PP |
| 18 |
The parser produced by Yacc consists |
| 19 |
of a finite state machine with a stack. |
| 20 |
The parser is also capable of reading and remembering the next |
| 21 |
input token (called the |
| 22 |
.I lookahead |
| 23 |
token). |
| 24 |
The |
| 25 |
.I "current state" |
| 26 |
is always the one on the top of the stack. |
| 27 |
The states of the finite state machine are given |
| 28 |
small integer labels; initially, the machine is in state 0, |
| 29 |
the stack contains only state 0, and no lookahead token has been read. |
| 30 |
.PP |
| 31 |
The machine has only four actions available to it, called |
| 32 |
.I shift , |
| 33 |
.I reduce , |
| 34 |
.I accept , |
| 35 |
and |
| 36 |
.I error . |
| 37 |
A move of the parser is done as follows: |
| 38 |
.IP 1. |
| 39 |
Based on its current state, the parser decides |
| 40 |
whether it needs a lookahead token to decide |
| 41 |
what action should be done; if it needs one, and does |
| 42 |
not have one, it calls |
| 43 |
.I yylex |
| 44 |
to obtain the next token. |
| 45 |
.IP 2. |
| 46 |
Using the current state, and the lookahead token |
| 47 |
if needed, the parser decides on its next action, and |
| 48 |
carries it out. |
| 49 |
This may result in states being pushed onto the stack, or popped off of |
| 50 |
the stack, and in the lookahead token being processed |
| 51 |
or left alone. |
| 52 |
.PP |
| 53 |
The |
| 54 |
.I shift |
| 55 |
action is the most common action the parser takes. |
| 56 |
Whenever a shift action is taken, there is always |
| 57 |
a lookahead token. |
| 58 |
For example, in state 56 there may be |
| 59 |
an action: |
| 60 |
.DS |
| 61 |
IF shift 34 |
| 62 |
.DE |
| 63 |
which says, in state 56, if the lookahead token is IF, |
| 64 |
the current state (56) is pushed down on the stack, |
| 65 |
and state 34 becomes the current state (on the |
| 66 |
top of the stack). |
| 67 |
The lookahead token is cleared. |
| 68 |
.PP |
| 69 |
The |
| 70 |
.I reduce |
| 71 |
action keeps the stack from growing without |
| 72 |
bounds. |
| 73 |
Reduce actions are appropriate when the parser has seen |
| 74 |
the right hand side of a grammar rule, |
| 75 |
and is prepared to announce that it has seen |
| 76 |
an instance of the rule, replacing the right hand side |
| 77 |
by the left hand side. |
| 78 |
It may be necessary to consult the lookahead token |
| 79 |
to decide whether to reduce, but |
| 80 |
usually it is not; in fact, the default |
| 81 |
action (represented by a ``.'') is often a reduce action. |
| 82 |
.PP |
| 83 |
Reduce actions are associated with individual grammar rules. |
| 84 |
Grammar rules are also given small integer |
| 85 |
numbers, leading to some confusion. |
| 86 |
The action |
| 87 |
.DS |
| 88 |
\fB.\fR reduce 18 |
| 89 |
.DE |
| 90 |
refers to |
| 91 |
.I "grammar rule" |
| 92 |
18, while the action |
| 93 |
.DS |
| 94 |
IF shift 34 |
| 95 |
.DE |
| 96 |
refers to |
| 97 |
.I state |
| 98 |
34. |
| 99 |
.PP |
| 100 |
Suppose the rule being reduced is |
| 101 |
.DS |
| 102 |
A \fB:\fR x y z ; |
| 103 |
.DE |
| 104 |
The reduce action depends on the |
| 105 |
left hand symbol (A in this case), and the number of |
| 106 |
symbols on the right hand side (three in this case). |
| 107 |
To reduce, first pop off the top three states |
| 108 |
from the stack |
| 109 |
(In general, the number of states popped equals the number of symbols on the |
| 110 |
right side of the rule). |
| 111 |
In effect, these states were the ones |
| 112 |
put on the stack while recognizing |
| 113 |
.I x , |
| 114 |
.I y , |
| 115 |
and |
| 116 |
.I z , |
| 117 |
and no longer serve any useful purpose. |
| 118 |
After popping these states, a state is uncovered |
| 119 |
which was the state the parser was in before beginning to |
| 120 |
process the rule. |
| 121 |
Using this uncovered state, and the symbol |
| 122 |
on the left side of the rule, perform what is in |
| 123 |
effect a shift of A. |
| 124 |
A new state is obtained, pushed onto the stack, and parsing continues. |
| 125 |
There are significant differences between the processing of |
| 126 |
the left hand symbol and an ordinary shift of a token, |
| 127 |
however, so this action is called a |
| 128 |
.I goto |
| 129 |
action. |
| 130 |
In particular, the lookahead token is cleared by a shift, and |
| 131 |
is not affected by a goto. |
| 132 |
In any case, the uncovered state contains an entry such as: |
| 133 |
.DS |
| 134 |
A goto 20 |
| 135 |
.DE |
| 136 |
causing state 20 to be pushed |
| 137 |
onto the stack, and become the current state. |
| 138 |
.PP |
| 139 |
In effect, the reduce action ``turns back the clock'' in the parse, |
| 140 |
popping the states off the stack to go back to the |
| 141 |
state where the right hand side of the rule was first seen. |
| 142 |
The parser then behaves as if it had seen the left side at that time. |
| 143 |
If the right hand side of the rule is empty, |
| 144 |
no states are popped off of the stack: the uncovered state |
| 145 |
is in fact the current state. |
| 146 |
.PP |
| 147 |
The reduce action is also important in the treatment of user-supplied |
| 148 |
actions and values. |
| 149 |
When a rule is reduced, the code supplied with the rule is executed |
| 150 |
before the stack is adjusted. |
| 151 |
In addition to the stack holding the states, another stack, |
| 152 |
running in parallel with it, holds the values returned |
| 153 |
from the lexical analyzer and the actions. |
| 154 |
When a shift takes place, the external variable |
| 155 |
.I yylval |
| 156 |
is copied onto the value stack. |
| 157 |
After the return from the user code, the reduction is carried out. |
| 158 |
When the |
| 159 |
.I goto |
| 160 |
action is done, the external variable |
| 161 |
.I yyval |
| 162 |
is copied onto the value stack. |
| 163 |
The pseudo-variables $1, $2, etc., refer to the value stack. |
| 164 |
.PP |
| 165 |
The other two parser actions are conceptually much simpler. |
| 166 |
The |
| 167 |
.I accept |
| 168 |
action indicates that the entire input has been seen and |
| 169 |
that it matches the specification. |
| 170 |
This action appears only when the lookahead token is |
| 171 |
the endmarker, and indicates that the parser has successfully |
| 172 |
done its job. |
| 173 |
The |
| 174 |
.I error |
| 175 |
action, on the other hand, represents a place where the parser |
| 176 |
can no longer continue parsing according to the specification. |
| 177 |
The input tokens it has seen, together with the lookahead token, |
| 178 |
cannot be followed by anything that would result |
| 179 |
in a legal input. |
| 180 |
The parser reports an error, and attempts to recover the situation and |
| 181 |
resume parsing: the error recovery (as opposed to the detection of error) |
| 182 |
will be covered in Section 7. |
| 183 |
.PP |
| 184 |
It is time for an example! |
| 185 |
Consider the specification |
| 186 |
.DS |
| 187 |
%token DING DONG DELL |
| 188 |
%% |
| 189 |
rhyme : sound place |
| 190 |
; |
| 191 |
sound : DING DONG |
| 192 |
; |
| 193 |
place : DELL |
| 194 |
; |
| 195 |
.DE |
| 196 |
.PP |
| 197 |
When Yacc is invoked with the |
| 198 |
.B \-v |
| 199 |
option, a file called |
| 200 |
.I y.output |
| 201 |
is produced, with a human-readable description of the parser. |
| 202 |
The |
| 203 |
.I y.output |
| 204 |
file corresponding to the above grammar (with some statistics |
| 205 |
stripped off the end) is: |
| 206 |
.DS |
| 207 |
state 0 |
| 208 |
$accept : \_rhyme $end |
| 209 |
|
| 210 |
DING shift 3 |
| 211 |
. error |
| 212 |
|
| 213 |
rhyme goto 1 |
| 214 |
sound goto 2 |
| 215 |
|
| 216 |
state 1 |
| 217 |
$accept : rhyme\_$end |
| 218 |
|
| 219 |
$end accept |
| 220 |
. error |
| 221 |
|
| 222 |
state 2 |
| 223 |
rhyme : sound\_place |
| 224 |
|
| 225 |
DELL shift 5 |
| 226 |
. error |
| 227 |
|
| 228 |
place goto 4 |
| 229 |
|
| 230 |
state 3 |
| 231 |
sound : DING\_DONG |
| 232 |
|
| 233 |
DONG shift 6 |
| 234 |
. error |
| 235 |
|
| 236 |
state 4 |
| 237 |
rhyme : sound place\_ (1) |
| 238 |
|
| 239 |
. reduce 1 |
| 240 |
|
| 241 |
state 5 |
| 242 |
place : DELL\_ (3) |
| 243 |
|
| 244 |
. reduce 3 |
| 245 |
|
| 246 |
state 6 |
| 247 |
sound : DING DONG\_ (2) |
| 248 |
|
| 249 |
. reduce 2 |
| 250 |
.DE |
| 251 |
Notice that, in addition to the actions for each state, there is a |
| 252 |
description of the parsing rules being processed in each |
| 253 |
state. The \_ character is used to indicate |
| 254 |
what has been seen, and what is yet to come, in each rule. |
| 255 |
Suppose the input is |
| 256 |
.DS |
| 257 |
DING DONG DELL |
| 258 |
.DE |
| 259 |
It is instructive to follow the steps of the parser while |
| 260 |
processing this input. |
| 261 |
.PP |
| 262 |
Initially, the current state is state 0. |
| 263 |
The parser needs to refer to the input in order to decide |
| 264 |
between the actions available in state 0, so |
| 265 |
the first token, |
| 266 |
.I DING , |
| 267 |
is read, becoming the lookahead token. |
| 268 |
The action in state 0 on |
| 269 |
.I DING |
| 270 |
is |
| 271 |
is ``shift 3'', so state 3 is pushed onto the stack, |
| 272 |
and the lookahead token is cleared. |
| 273 |
State 3 becomes the current state. |
| 274 |
The next token, |
| 275 |
.I DONG , |
| 276 |
is read, becoming the lookahead token. |
| 277 |
The action in state 3 on the token |
| 278 |
.I DONG |
| 279 |
is ``shift 6'', |
| 280 |
so state 6 is pushed onto the stack, and the lookahead is cleared. |
| 281 |
The stack now contains 0, 3, and 6. |
| 282 |
In state 6, without even consulting the lookahead, |
| 283 |
the parser reduces by rule 2. |
| 284 |
.DS |
| 285 |
sound : DING DONG |
| 286 |
.DE |
| 287 |
This rule has two symbols on the right hand side, so |
| 288 |
two states, 6 and 3, are popped off of the stack, uncovering state 0. |
| 289 |
Consulting the description of state 0, looking for a goto on |
| 290 |
.I sound , |
| 291 |
.DS |
| 292 |
sound goto 2 |
| 293 |
.DE |
| 294 |
is obtained; thus state 2 is pushed onto the stack, |
| 295 |
becoming the current state. |
| 296 |
.PP |
| 297 |
In state 2, the next token, |
| 298 |
.I DELL , |
| 299 |
must be read. |
| 300 |
The action is ``shift 5'', so state 5 is pushed onto the stack, which now has |
| 301 |
0, 2, and 5 on it, and the lookahead token is cleared. |
| 302 |
In state 5, the only action is to reduce by rule 3. |
| 303 |
This has one symbol on the right hand side, so one state, 5, |
| 304 |
is popped off, and state 2 is uncovered. |
| 305 |
The goto in state 2 on |
| 306 |
.I place , |
| 307 |
the left side of rule 3, |
| 308 |
is state 4. |
| 309 |
Now, the stack contains 0, 2, and 4. |
| 310 |
In state 4, the only action is to reduce by rule 1. |
| 311 |
There are two symbols on the right, so the top two states are popped off, |
| 312 |
uncovering state 0 again. |
| 313 |
In state 0, there is a goto on |
| 314 |
.I rhyme |
| 315 |
causing the parser to enter state 1. |
| 316 |
In state 1, the input is read; the endmarker is obtained, |
| 317 |
indicated by ``$end'' in the |
| 318 |
.I y.output |
| 319 |
file. |
| 320 |
The action in state 1 when the endmarker is seen is to accept, |
| 321 |
successfully ending the parse. |
| 322 |
.PP |
| 323 |
The reader is urged to consider how the parser works |
| 324 |
when confronted with such incorrect strings as |
| 325 |
.I "DING DONG DONG" , |
| 326 |
.I "DING DONG" , |
| 327 |
.I "DING DONG DELL DELL" , |
| 328 |
etc. |
| 329 |
A few minutes spend with this and other simple examples will |
| 330 |
probably be repaid when problems arise in more complicated contexts. |