LL(k) Parsing

LL(1) parsing table

Let's formalize Wikipedia's Constructing an LL(1) parsing table and An Easy Explanation Of First And Follow Sets.

{ } = set (i.e. not ordered, no duplicates)
[] = sequence (i.e. ordered, duplicates allowed)
<key, value> = hash table record
∅ = empty set
∴ = therefore
| = such that
∈ = element of
⊆ = subset or equal to
⊈ = neither subset nor equal to
∪ = union
→ = replaceable with
\ = set minus
≡ = identical to
≢ = not identical to

Given grammar G:

ε= { ε| ε ∴ ∅ }ε (epsilon) denotes the empty set ∅.
T= { t| t ∈ Terminals }t is a Terminal, and T is the set of Terminals.
N= { n| n ∈ Nonterminals }n is a Nonterminal, and N is the set of Nonterminals.
S= { s| s ∈ (T ∪ N ∪ ε) }s is an element of S (set of Terminals, Nonterminals, and denotation of the empty set).
W= { w| w ∈ { [s1s2, …] | si ∈ S, i ≥ 1 } }w is a sequence of one or more elements of S, and W is the set of sequences.
Note: a sequence containing epsilon must only contain epsilon (not shown in set logic).
G= ( <n, { w }>| n → w, n ∈ N, { w } ⊆ W )Associate each Nonterminal with a set of S sequence(s). Each sequence may replace the Nonterminal.

LL(1) First-set is:

The set of the first Terminal(s) encountered at each sequence (∈ W containing one or more S) in the hierarchy of the grammar.

Fi(w | w ∈ W) = First-set from a sequence (w ∈ W containing one or more S).
ε|_ ∈ ε, _ = head(w) }First-set is epsilon, if the first S is epsilon (i.e. sequence contains only epsilon).
t|t ∈ Tt = head(w) }First-set is the first S, if that S is a Terminal.
Fi(n)|ε ∉ Fi(n)n ∈ Nn = head(w) )First-set is the First-set of the first S, if that S is a Nonterminal, and that First-set does not contain epsilon.
ε|Recurse ⊆ ∅, ε ∈ Fi(n)n ∈ Nn = head(w) }First-set is epsilon, if the first S is a Nonterminal, the First-set of that S contains epsilon, and Recurse is the empty set.
( Recurse|Recurse ⊈ ∅, ε ∈ Fi(n)n ∈ Nn = head(w) )First-set is Recurse, if the first S is a Nonterminal, the First-set of that S contains epsilon, and Recurse is not the empty set.
where Recurse is (Fi(n) \ ε) ∪ Fi(tail(w))First-set, of the first S, epsilon removed, and unioned with the First-set of the remainder of the sequence.
Fi(n | n ∈ N) = fold(map(hash(Gn)Fi), ∪, ∅)First-set of a Nonterminal, is the union of the First-set(s) from each sequence (∈ W containing one or more S) in the grammar for the Nonterminal.

LL(1) Follow-set is:

The set of the first Terminal(s) that follows each Nonterminal encountered at each sequence (∈ W containing one or more S) in the hierarchy of the grammar.

Fo(n | n ∈ N) = fold(G, \g -> fold(map(hash(Gg)Recurse(n, g)), ∪, ∅)) whereFollow-set of a Nonterminal, is the union of Recurse applied to each sequence (∈ W containing one or more S) in the grammar, evaluated for the Nonterminal.
Recurse(n | n ∈ Ng | g ∈ Nw | w ∈ W) = 
|_ ∈ ε, _ = head(w) )Sequence containing only epsilon doesn't contain n.
Recurse(ntail(w))|_ ∈ T, _ = head(w) )First S is a Terminal, thus isn't n. Return the Recurse on remainder of sequence.
Recurse(ntail(w))|head(w) ∈ Nn ≢ head(w) )First S is a Nonterminal, but isn't n. Return the Recurse on remainder of sequence.
Fi(tail(w))|ε ∉ Fi(tail(w))head(w) ∈ Nn ≡ head(w) )First S is n. Return the First-set of the remainder of the sequence, because it doesn't contain epsilon.
Fo(g)|ε ∈ Fi(tail(w))head(w) ∈ Nn ≡ head(w) )First S is n. Return the Follow-set of g, because epsilon is contained in the First-set of the remainder of the sequence.

Context Free Grammar

LL(k) with no First-set/Follow-set conflicts, are Context Free Grammars (CFG). Context free means that a non-terminal will always use the same parse table no matter where in the grammar it appears. If we force non-terminal productions to be context sensitive by forcing a choice between conflicts, then our input language still contains the choices and thus the (aliasing) errors will propagate to the semantics.

For example, the semantic intention error with the dangling IF-ELSE problem is not fixed by forcing a choice between conflicting production rules in an ambiguous grammar, because the input language remains semantically ambiguous or opaque. Ditto the ambiguity between return and following expression. The correct solution is to restructure the grammar, so there are no such conflicting production rules.

However, for the ambiguity between prefix and postfix unary ++ and --, there is no such way to restructure the LL(k) grammar production rules. However, we can detect the ambiguous cases and generate an error in the semantic analysis stage of the compiler, which thus makes the grammar context free.

There can exist semantic ambiguities even in non-ambiguous CFG, when the same terminal has multiple semantic contexts. For example, the semantic ambiguity and errors that result from the use of parenthesis both for precedence and a function call, and when the same operator is used both for adding numbers and concatenating strings.

Imagine a method of parsing that would be similar to LL(1), which would not require Follow-sets, because the concept of epsilon would be eliminated and replaced with a change in the way optional tokens in the production rules are tagged and processed in the parser stack. The idea is to redefine the grammar semantics such that instead of epsilon, each token in the set S would include two enumerations (greedy, ungreedy) and (one, one or more, zero or more, zero or one) attribute, and the processing of the token in the parser stack would blindly (greedily or ungreedily) force the choice in such (possibly unknown) conflicts. The problem is that the above conflicts would not be enumerated by such a variant of the LL(1) grammar generator.