forked from cheng/wallet
696 lines
40 KiB
Markdown
696 lines
40 KiB
Markdown
|
---
|
|||
|
title: Parsers
|
|||
|
---
|
|||
|
This rambles a lot. Thoughts in progress: Summarizing my thoughts here at the top.
|
|||
|
|
|||
|
Linux scripts started off using lexing for parsing, resulting in complex and
|
|||
|
incomprehensible semantics, producing unexpected results. (Try naming a
|
|||
|
file `-r`, or a directory with spaces in the name.)
|
|||
|
|
|||
|
They are rapidly converging in actual usage to the operator precedence
|
|||
|
syntax and semantics\
|
|||
|
`command1 subcommand arg1 … argn infixoperator command2 subcommand …`
|
|||
|
|
|||
|
Which is parsed as\
|
|||
|
`((staticclass1.staticmethod( arg1 … argn)) infixoperator ((staticclass2.staticmethod(…)))`
|
|||
|
|
|||
|
With line feed acting as `}{` operator, start of file acting as a `{` operator, end
|
|||
|
of file acting as a `}` operator, suggesting that in a sane language, indent
|
|||
|
increase should act as `{` operator, indent decrease should act as a `}`
|
|||
|
operator.
|
|||
|
|
|||
|
Command line syntax sucks, because programs interpret their command
|
|||
|
lines using a simple lexer, which lexes on spaces. Universal resource
|
|||
|
identifier syntax sucks, because it was originally constructed so that it
|
|||
|
could be a command line argument, hence no spaces, and because it was
|
|||
|
designed to be parsed by a lexer.
|
|||
|
|
|||
|
But EBNF parsers also suck, because they do not parse the same way
|
|||
|
humans do. Most actual programs can be parsed by a simple parser, even
|
|||
|
though the language in principle requires a more powerful parser, becaus
|
|||
|
humans do not use the nightmarish full power of a grammer that an EBNF
|
|||
|
definition winds up defining.
|
|||
|
|
|||
|
Note that [LLVM language creation tools](https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/)
|
|||
|
tutorial does not user an EBNF
|
|||
|
parser. These tools also make creating a new language with JIT semantics
|
|||
|
very easy.
|
|||
|
|
|||
|
We are programming in languages that are not parsed the way the
|
|||
|
programmer is parsing them.
|
|||
|
|
|||
|
Programming languages ignore whitespace, because programmers tend to
|
|||
|
express their meaning with whitespace for the human reader, and
|
|||
|
whitespace grammer is not altogether parallel to the EBNF grammer.
|
|||
|
There is a mismatch in grammers.
|
|||
|
|
|||
|
Seems to me that human parsing is combination of low level lexing, Pratt
|
|||
|
parsing on operator right and left binding power, and a higher level of
|
|||
|
grouping that works like lexing. Words are lexed by spaces and
|
|||
|
punctuation, grouped by operator binding power, with operator
|
|||
|
recognition taking into account the types on the stack, groups of parsed
|
|||
|
words are bounded by statement separators, which can be lexed out,
|
|||
|
groups of statements are grouped and bounded by indenting.
|
|||
|
|
|||
|
Some levels in the hierarchy are lexed out, others are operator binding
|
|||
|
power parsed out. There are some “operators” that mean group separator
|
|||
|
for a given hierarchical level, which is a tell that reveals lex style parsing,
|
|||
|
for example semi colon in C++, full stop and paragraph break in text.
|
|||
|
|
|||
|
The never ending problems from mixing tab and spaces indenting can be
|
|||
|
detected by making a increase or decrease of ident by a space a bracket
|
|||
|
operator, and an increase or decrease by a tab a non matching bracket
|
|||
|
operator.
|
|||
|
|
|||
|
Pratt parsing parsers operators by their left and right binding power –
|
|||
|
which is a superset of operator precedence parsing. EBNF does not
|
|||
|
directly express this concept, and programming this concept into EBNF is
|
|||
|
complicated, indirect, and imperfect – because it is too powerful a
|
|||
|
superset, that can express anything, including things that do not make
|
|||
|
sense to the human writing the stuff to be parsed.
|
|||
|
|
|||
|
Pratt parsing finalizes an expression by visiting the operators in reverse
|
|||
|
polish order, thus implicitly executing a stack of run time typed operands,
|
|||
|
which eventually get compiled and eventually executed as just-in-time typed
|
|||
|
or statically typed operands and operators.
|
|||
|
|
|||
|
For [identity](identity.html), we need Cryptographic Resource Identifiers,
|
|||
|
which cannot conform the “Universal” Resource Identifier syntax and semantics.
|
|||
|
|
|||
|
Lexers are not powerful enough, and the fact that they are still used
|
|||
|
for uniform resource identifiers, relative resource identifiers, and command
|
|||
|
line arguments is a disgrace.
|
|||
|
|
|||
|
Advanced parsers, however, are too powerful, resulting in syntax that is
|
|||
|
counter intuitive. That ninety percent of the time a program file can be
|
|||
|
parsed by a simple parser incapable of recognizing the full set of
|
|||
|
syntactically correct expressions that the language allows indicates that the
|
|||
|
programmer’s mental model of the language has a more simple structure.
|
|||
|
|
|||
|
# Pratt Parsing
|
|||
|
|
|||
|
I really love the Pratt Parser, because it is short and simple, because if you
|
|||
|
add to the symbol table you can add new syntax during compilation,
|
|||
|
because what it recognizes corresponds to human intuition and human
|
|||
|
reading.
|
|||
|
|
|||
|
But it is just not actually a parser. Given a source with invalid expressions
|
|||
|
such as unary multiplication and unbalanced parentheses, it will cheerfully
|
|||
|
generate a parse. It also lacks the concept out of which all the standard
|
|||
|
parsers are constructed, that expressions are of different kinds, different
|
|||
|
nonterminals.
|
|||
|
|
|||
|
To fix Pratt parsing, it would have to recognize operators as bracketing, as
|
|||
|
prefix undery, postfix unary, or infix, and that some operators do not have an
|
|||
|
infix kinds, and it would have to recognize that operands have types, and that
|
|||
|
an operator produces a type from its inputs. It would have to attribute a
|
|||
|
nonterminal to a subtree. It would have to recognize ternary operators as
|
|||
|
operators.
|
|||
|
|
|||
|
And that is a major rewrite and reinvention.
|
|||
|
|
|||
|
Lalr parsers appear to be closer to the programmer mental model, but looking at
|
|||
|
Pratt Parsing, there is a striking resemblance between C and what falls out
|
|||
|
Pratt’s model:
|
|||
|
|
|||
|
The kind of “lexing” the Pratt parser does seems to have a natural
|
|||
|
correspondence to the kind of parsing the programmer does as his eye rolls
|
|||
|
over the code. Pratt’s deviations from what would be correct behavior in
|
|||
|
simple arithmetic expressions composed of numerals and single character
|
|||
|
symbols seem to strikingly resemble expressions that engineers find
|
|||
|
comfortable.
|
|||
|
|
|||
|
When `expr` is called, it is provided the right binding power of the
|
|||
|
token that called it. It consumes tokens until it meets a token whose left
|
|||
|
binding power is equal or lower than the right binding power of the operator
|
|||
|
that called it. It collects all tokens that bind together into a tree before
|
|||
|
returning to the operator that called it.
|
|||
|
|
|||
|
The Pratt `peek` peeks ahead to see if what is coming up is an
|
|||
|
operator, therefore needs to check what is coming up against a symbol table,
|
|||
|
which existing implementations fail to explicitly implement.
|
|||
|
|
|||
|
The Pratt algorithm, as implemented by Pratt and followers, assumes that all
|
|||
|
operators can be unary prefix or infix (hence the nud/led distinction). It
|
|||
|
should get the nature of the upcoming operator from the symbol table (infix,
|
|||
|
unary, or both, and if unary, prefix or postfix.
|
|||
|
|
|||
|
Although implementers have not realized it, they are treating all “non
|
|||
|
operator” tokens as unary posfix operators. Instead of, or as well as, they
|
|||
|
need to treat all tokens (where items recognized from a symbol table are
|
|||
|
pre-aggregated) as operators, with ordinary characters as postfix unary,
|
|||
|
spaces as postfix unary with weaker binding power, and a token consisting of
|
|||
|
a utf8 iterator plus a byte count as equivalent to a left tree with right
|
|||
|
single character leaves and a terminal left leaf.
|
|||
|
|
|||
|
Pratt parsing is like lexing, breaking a stream of characters into groups,
|
|||
|
but the grouping is hierarchical. The algorithm annotates a linear text with
|
|||
|
hierarchy.
|
|||
|
|
|||
|
Operators are characterized by a global order of left precedence, a global
|
|||
|
order of right precedence (the difference giving us left associativity and
|
|||
|
right associativity)
|
|||
|
|
|||
|
If we extend the Pratt algorithm with the concept of unitary postfix
|
|||
|
operators, we see it is treating each ordinary unrecognized character as a
|
|||
|
unitary postfix operator, and each whitespace character as a unitary postfix
|
|||
|
operator of weaker binding power.
|
|||
|
|
|||
|
[Apodaca]:https://dev.to/jrop/pratt-parsing
|
|||
|
|
|||
|
Pratt and [Apodaca] are primarily interested in the case of unary minus, so
|
|||
|
they handle the case of a tree with a potentially null token by
|
|||
|
distinguishing between nud (no left context) and led (the right hand side of
|
|||
|
an operator with left context).
|
|||
|
|
|||
|
Pratt assumes that in correct source text, `nud` is only going to encounter an
|
|||
|
atomic token, in which case it consumes the token, constructs a leaf vertex
|
|||
|
which points into the source, and returns, or a unary prefixoperator, or an
|
|||
|
opening bracket. If it encounters an operator, it calls `expr` with the right
|
|||
|
binding power of that operator, and when `expr`has finished parsing, returns
|
|||
|
a corresponding vertex.
|
|||
|
|
|||
|
Not at all clear to me how it handles brackets. Pratt gets by without the
|
|||
|
concept of matching tokens, or hides it implicitly. Seems to me that correct
|
|||
|
parsing is that a correct vertex has to contain all matching tokens, and the
|
|||
|
expressions cotained therein, so a vertex corresponding to a bracketed
|
|||
|
expression has to point to the open and closing bracket terminals, and the
|
|||
|
contained expression. I would guess that his algorithm winds up with a
|
|||
|
tree that just happens to contain matching tokens in related positions in the tree.
|
|||
|
|
|||
|
Suppose the typical case, a tree of binary operators inside a tree of binary
|
|||
|
operators: In that case, when `expr` is called, the source pointer is pointing
|
|||
|
to the start of an expression. `expr` calls `nud` to parse the expression, and if
|
|||
|
that is all she wrote (because ` peek` reveals an operator with lower left
|
|||
|
binding power than the right binding power that `expr` was called with)
|
|||
|
returns the edge to the vertext constructed by `nud`. Otherise, it parses out
|
|||
|
the operator, and calls `led` with the right binding power of the operator it has encountered, to get the right hand argument of the binary operator. It
|
|||
|
then constructs a vertex containing the operator, whose left edge points to
|
|||
|
the node constructed by `nud` and whose right hand edge points to the node
|
|||
|
constructed by `led`. If that is all she wrote, returns, otherwise iterates
|
|||
|
its while loop, constructing the ever higher root of a right leaning tree of
|
|||
|
all previous roots, whose ultimate left most leaf is the vertex constructed by
|
|||
|
`nud`, and whose right hand vertexes were constructed by `led`.
|
|||
|
|
|||
|
The nud/led distinction is not sufficiently general. They did not realize
|
|||
|
that they were treating ordinary characters as postfix unitary operators.
|
|||
|
|
|||
|
Trouble is, I want to use the parser as the lexer, which ensures that as the
|
|||
|
human eye slides over the text, the text reads the way it is in fact
|
|||
|
structured. But if we do Pratt parsing on single characters to group them
|
|||
|
into larger aggregates, `p*--q*s` is going to be misaggregated by the parser
|
|||
|
to `(( (p*)−) − (q*s)`, which is meaningless.
|
|||
|
|
|||
|
And, if we employ Pratt’s trick of nud/led distinction, will evaluate as
|
|||
|
`p*(-(-q*s))` which gives us a meaningful but wrong result ` p*q*s`
|
|||
|
|
|||
|
If we allow multicharacter operators then they have to be lexed out at the
|
|||
|
earliest stage of the process – the Pratt algorithm has to be augmented by
|
|||
|
aggregate tokens, found by attempting to the following text against a symbol
|
|||
|
table. Existing Pratt algorithms tend to have an implicit symbol table of
|
|||
|
one character symbols, everything in the symbol table being assumed to be
|
|||
|
potentially either infix or unary prefix, and everything else outside the
|
|||
|
implicit symbol table unary postfix.
|
|||
|
|
|||
|
If we extend the Pratt algorithm with the concept of unitary postfix
|
|||
|
operators, we see it is treating each ordinary unrecognized character as a
|
|||
|
unitary postfix operator, and each whitespace character as a unitary postfix
|
|||
|
operator of weaker binding power.
|
|||
|
|
|||
|
Suppose a token consists of a utf8 iterator and a byte count.
|
|||
|
|
|||
|
So, all the entities we work with are trees, but recursion terminates because
|
|||
|
some nodes of the tree have been collapsed to variables that consist of a
|
|||
|
utf8 iterator and a byte count, *and some parts of the tree have been
|
|||
|
partially collapsed to vertexes that consist of a ut8 iterator, a byte count,
|
|||
|
and an array of trees*.
|
|||
|
|
|||
|
C++ forbids `“foo bar()”` to match `“foobar()”`, but
|
|||
|
allows `“foobar ()”` to match, which is arguably an error.
|
|||
|
|
|||
|
`“foobar(”` has to lex out as a prefix operator. But is not
|
|||
|
really a prefix unitary operator. It is a set of matching operators, like
|
|||
|
brackets and the tenary operatro bool?value:value The commas and the closing
|
|||
|
bracket are also part of it. Which brings us to recognizing ternary
|
|||
|
operators. The naive single character Pratt algorithm handles ternary
|
|||
|
operators correctly (assuming that the input text is valid) which is
|
|||
|
surprising. So it should simply also match the commas and right bracket as a
|
|||
|
particular case of ternary and higher operators in the initial symbol search,
|
|||
|
albeit doing that so that it is simple and correct and naturally falls out of
|
|||
|
the algorithm is not necessarily obvious.
|
|||
|
|
|||
|
Operator precedence gets you a long way, but it messed up because it did not
|
|||
|
recognize the distinction between right binding power and left binding
|
|||
|
power. Pratt gets you a long way further.
|
|||
|
|
|||
|
But Pratt messes up because it does not explicitly recognize the difference
|
|||
|
between unitary prefix and unitary postfix, nor does it explicitly recognize
|
|||
|
operator matching – that a group of operators are one big multi argument
|
|||
|
operator. It does not recognize that brackets are expressions of the form
|
|||
|
symbol-expression-match, let alone that ternary operators are expressions of
|
|||
|
the form expression-symbol-match-expression.
|
|||
|
|
|||
|
Needs to be able to recognize that expressions of the form
|
|||
|
expression-symbol-expression-match-expression-match\...expression are
|
|||
|
expressions, and convert the tree into prefix form (polish notation with
|
|||
|
arguments bracketed) and into postfix form (reverse polish) with a count of
|
|||
|
the stack size.
|
|||
|
|
|||
|
Needs to have a stack of symbols that need left matches.
|
|||
|
|
|||
|
# Lalr
|
|||
|
|
|||
|
Bison and yacc are
|
|||
|
[joined at the hip](https://tomassetti.me/why-you-should-not-use-flex-yacc-and-bison/) to seven bit ascii and BNF, (through flex and lex)
|
|||
|
whereas [ANTLR](https://tomassetti.me/ebnf/)
|
|||
|
recognizes unicode and the far more concise and intelligible EBNF. ANTLR
|
|||
|
generates ALL parsers, which allow syntax that allows statements that are ugly
|
|||
|
and humanly unintelligible, while Bison when restricted to LALR parsers allows
|
|||
|
only grammars that forbid certain excesses, but generates unintelligible error
|
|||
|
messages when you specify a grammar that allows such excesses.
|
|||
|
|
|||
|
You could hand write your own lexer, and use it with BisonC++. Which seemingly
|
|||
|
everyone does.
|
|||
|
|
|||
|
ANTLR allows expressions that take long time to parse, but only polynomially
|
|||
|
long, fifth power, and prays that humans seldom use such expressions, which in
|
|||
|
practice they seldom do. But sometimes they do, resulting in hideously bad
|
|||
|
parser performance, where the parser runs out of memory or time. Because
|
|||
|
he parser allows non LALR syntax, it may find many potential meanings
|
|||
|
halfway through a straightforward lengthy expression that is entirely clear
|
|||
|
to humans because the non LALR syntax would never occur to the human. In
|
|||
|
ninety percent of files, there is not a single expression that cannot be
|
|||
|
parsed by very short lookahead, because even if the language allows it,
|
|||
|
people just do not use it, finding it unintelligible. Thus, a language that
|
|||
|
allows non LALR syntax locks you in against subsequent syntax extension,
|
|||
|
because the extension you would like to make already has some strange and non
|
|||
|
obvious meaning in the existing syntax.
|
|||
|
|
|||
|
This makes it advisable to use a parser that can enforce a syntax definition
|
|||
|
that does not permit non LALR expressions.
|
|||
|
|
|||
|
On the other hand, LALR parsers walk the tree in Reverse Polish
|
|||
|
order, from the bottom up. This makes it hard to debug your grammar, and
|
|||
|
hard to report syntax errors intelligibly. And sometimes you just cannot
|
|||
|
express the grammar you want as LALR, and you wind up writing a superset of
|
|||
|
the grammar you want, and then ad-hoc forbidding otherwise legitimate
|
|||
|
constructions, in which case you have abandoned the simplicity and
|
|||
|
directness of LALR, and the fact that it naturally tends to restrict you to
|
|||
|
humanly intelligible syntax.
|
|||
|
|
|||
|
Top down makes debugging your syntax easier, and issuing useful error
|
|||
|
messages a great deal easier. It is hard to provide any LALR handling of
|
|||
|
syntax errors other than just stop at the first error, but top down makes it
|
|||
|
a lot harder to implement semantics, because Reverse Polish order directly
|
|||
|
expresses the actions you want to take in the order that you need to take
|
|||
|
them.
|
|||
|
|
|||
|
LALR allows left recursion, so that you can naturally make minus and divide
|
|||
|
associate in the correct and expected order, while with LL, you wind up
|
|||
|
doing something weird and complicated – you build the tree, then you have
|
|||
|
another pass to get it into the correct order.
|
|||
|
|
|||
|
Most top down parsers, such as ANTLR, have a workaround to allow left
|
|||
|
recursion. They internally turn it into right recursion by the standard
|
|||
|
transformation, and then optimize out the ensuing tail recursion. But that
|
|||
|
is a hack, which increases the distance between your expression tree and
|
|||
|
your abstract syntax tree, still increases the distance between your grammar
|
|||
|
and your semantics during parser execution. You are walking the hack,
|
|||
|
instead of walking your own grammar’s syntax tree in Reverse Polish order.
|
|||
|
Implementing semantics becomes more complex. You still wind up with added
|
|||
|
complexity when doing left recursion, just moved around a bit.
|
|||
|
|
|||
|
LALR allows you to more directly express the grammar you want to express. With
|
|||
|
top down parsers, you can accomplish the same thing, but you have to take a
|
|||
|
more roundabout route to express the same grammar, and again you are likely
|
|||
|
to find you have allowed expressions that you do not want and which do not
|
|||
|
naturally have reasonable and expected semantics.
|
|||
|
|
|||
|
ANTLR performs top down generation of the expression tree. Your code called by
|
|||
|
ANTLR converts the expression tree into the Abstract Syntax tree, and the
|
|||
|
abstract syntax tree into the High Level Intermediate Representation.
|
|||
|
|
|||
|
The ANTLR algorithm can be slow as a week of sundays, or wind up eating
|
|||
|
polynomially large amounts of memory till it crashes. To protect against
|
|||
|
this problem, [he
|
|||
|
suggests using the fast SLL algorithm first, and should it fail, then use
|
|||
|
the full on potentially slow and memory hungry LL\* algorithm.](https://github.com/antlr/antlr4/issues/374) Ninety
|
|||
|
percent of language files can be parsed by the fast algorithm, because people
|
|||
|
just do not use too clever by half constructions. But it appears to me that
|
|||
|
anything that cannot be parsed by SLL, but can be parsed by LL\*, is not good
|
|||
|
code – that what confuses an SLL parser also confuses a human, that the
|
|||
|
alternate readings permitted by the larger syntax are never code that people
|
|||
|
want to use.
|
|||
|
|
|||
|
Antlr does not know or care if your grammar makes any sense until it tries to
|
|||
|
analyze particular texts. But you would like to know up front if your
|
|||
|
grammar is valid.
|
|||
|
|
|||
|
LALR parsers are bottom up, so have terrible error messages when they analyze
|
|||
|
a particular example of the text, but they have the enormous advantage that
|
|||
|
they will analyze your grammar up front and guarantee that any grammatically
|
|||
|
correct statement is LALR. If a LALR parser can analyze it, chances are that
|
|||
|
a human can also. ANTLR permits grammars that permit unintelligible statements.
|
|||
|
|
|||
|
The [LRX parser](http://lrxpg.com/downloads.html) looks the most
|
|||
|
suitable for your purpose. It has a restrictive license and only runs in the
|
|||
|
visual studio environment, but you only need to distribute the source code it
|
|||
|
builds the compiler from as open source, not the compiler compiler. It halts
|
|||
|
at the first error message, since incapable of building intelligible multiple
|
|||
|
error messages. The compiler it generates builds a syntax tree and a symbol
|
|||
|
table.
|
|||
|
|
|||
|
The generically named [lalr](https://github.com/cwbaker/lalr)
|
|||
|
looks elegantly simple, and not joined at the hip to all sorts of strange
|
|||
|
environment. Unlike Bison C++, should be able to handle unicode strings,
|
|||
|
with its regular expressionsrx pa. It only handles BNF, not EBNF, but that
|
|||
|
is a relatively minor detail. Its regular expressions are under documented,
|
|||
|
but regular expression syntax is pretty standard. It does not build a symbol
|
|||
|
table.
|
|||
|
|
|||
|
And for full generality, you really need a symbol table where the symbols get
|
|||
|
syntax, which is a major extension to any existing parser. That starts to
|
|||
|
look like hard work. The lalr algorithm does not add syntax on the fly. The
|
|||
|
lrxpg parser does build a symbol tree one on the fly, but not syntax on the
|
|||
|
fly – but its website just went down. No one has attempted to write a
|
|||
|
language that can add syntax on the fly. They build a syntax capable of
|
|||
|
expressing an arbitrary graph with symbolic links, and then give the graph
|
|||
|
extensible semantics. The declaration/definition semantic is not full
|
|||
|
parsing on the definition, but rather operates on the tree.
|
|||
|
|
|||
|
In practice, LALR parsers need to be extended beyond LALR with operator
|
|||
|
precedence. Expressing operator precedence within strict LALR is apt to be
|
|||
|
messy. And, because LALR walks the tree in reverse polish order, you want
|
|||
|
the action that gets executed at parse time to return a value that the
|
|||
|
generated parser puts on a stack managed by the parser, which stack is
|
|||
|
available when the action of the operator that consumes it is called. In
|
|||
|
which case the definition/declaration semantic declares a symbol that has a
|
|||
|
directed graph associated with it, which graph is then walked to interpret
|
|||
|
what is on the parse stack. The data of the declaration defines metacode
|
|||
|
that is executed when the symbol is invoked, the directed graph associated
|
|||
|
with the symbol definition being metacode executed by the action that parser
|
|||
|
performs when the symbol is used. The definition/declaration semantic allows
|
|||
|
arbitrary graphs containing cycles (full recursion) to be defined, by the
|
|||
|
declaration adding indirections to a previously constructed directed graph.
|
|||
|
|
|||
|
The operator-precedence parser can parse all LR(1) grammars where two
|
|||
|
consecutive nonterminals and epsilon never appear in the right-hand side of any
|
|||
|
rule. They are simple enough to write by hand, which is not generally the case
|
|||
|
with more sophisticated right shift-reduce parsers. Second, they can be written
|
|||
|
to consult an operator table at run time. Considering that “universal” resource
|
|||
|
locators and command lines are parsed with mere lexers, perhaps a hand written
|
|||
|
operator-precedence parser is good enough. After all, Forth and Lisp have less.
|
|||
|
|
|||
|
C++ variadic templates are a purely functional metalanguage operating on the
|
|||
|
that stack. Purely functional languages suck, as demonstrated by the fact
|
|||
|
that we are now retroactively shoehorning procedural code (if constexpr) into
|
|||
|
C++ template meta language. Really, you need the parse stack of previously
|
|||
|
encountered arguments to potentially contain arbitrary objects.
|
|||
|
|
|||
|
When a lalr parser parses an if-then-else statement, then if the parser
|
|||
|
grammer defines “if” as the nonterminal, which may contain an “else”
|
|||
|
clause, it is going to execute the associated actions in the reverse order.
|
|||
|
But if you define “else” as the nonterminal, which must be preceded by an
|
|||
|
“if” clause, then the parser will execute the associated actions in the
|
|||
|
expected order. But suppose you have an else clause in curly brackets
|
|||
|
inside an if-then-else. Then the parse action order is necessarily going to
|
|||
|
be different from the procedural. Further, the very definition of an if-then-else clause implies a parse time in which all actions are performed, and a procedural time in which only one action is performed.
|
|||
|
|
|||
|
Definition code metacode must operate on the parser stack, but declaration
|
|||
|
metacode may operate on a different stack, implying a coroutine relationship
|
|||
|
between declaration metacode and definition metacode. The parser, to be
|
|||
|
intelligible, has to perform actions in as close to left to right order as
|
|||
|
possible hence my comment that the “else” nonterminal must contain the “if”
|
|||
|
nonterminal, not the other way around – but what if the else nonterminal
|
|||
|
contains an “if then else” inside curly braces? The parser actions can and
|
|||
|
will happen in different order to the run time actions. Every term of the
|
|||
|
if-then-else structure is going to have its action performed in syntax order,
|
|||
|
but the syntax order has to be capable of implying a different procedural
|
|||
|
order, during which not all actions of an if-then-else structure will be
|
|||
|
performed. And similarly with loops, where every term of the loop causes a
|
|||
|
parse time action to be performed once in parse time order, but procedural
|
|||
|
time actions in a different order, and performed many times.
|
|||
|
|
|||
|
This implies that any fragment of source code in a language that uses the
|
|||
|
declaration/definition syntax and semantic gets to do stuff in three phases
|
|||
|
(Hence in C, you can define a variable or a function without declaring it,
|
|||
|
resulting in link time errors, and in C++ define a class without declaring
|
|||
|
its methods and data, resulting in compilation errors at a stage of
|
|||
|
compilation that is ill defined and inexplicit)
|
|||
|
|
|||
|
The parser action of the declaration statement constructs a declaration data
|
|||
|
structure, which is metacode, possibly invoking the metacode generated by
|
|||
|
previous declarations and definitions. When the term declared is then used,
|
|||
|
then the metacode of the definition is executed. And the usage may well
|
|||
|
invoke the metacode generated by the action associated at parse time with the
|
|||
|
declaration statement, but attempting to do so causes an error in the parser
|
|||
|
action if the declaration action has not yet been encountered in parse action
|
|||
|
order.
|
|||
|
|
|||
|
So, we get parser actions which construct definition and declaration metacode
|
|||
|
and subsequent parser actions, performed later during the parse of subsequent
|
|||
|
source code that invoke that metacode by name to construct metacode. But, as
|
|||
|
we see in the case of the if-then-else and do-while constructions, there must
|
|||
|
be a third execution phase, in which the explicitly procedural code
|
|||
|
constructed, but not executed, by the metacode, is actually executed
|
|||
|
procedurally. Which, of course, in C++ is performed after the link and load
|
|||
|
phase. But we want procedural metacode. And since procedural metacode must
|
|||
|
contain conditional and loops, there has to be a third phase during parsing,
|
|||
|
executed as a result of parse time actions, that procedurally performs ifs
|
|||
|
and loops in metacode. So a declaration can invoke the metacode constructed
|
|||
|
by previous declarations – meaning that a parse time action executes metacode
|
|||
|
constructed by previous parse time actions. But, to invoke procedural
|
|||
|
metacode from a parse time action, a previous parse time action has to have
|
|||
|
invoked metacode constructed by an even earlier parse time action to
|
|||
|
construct procedural metacode.
|
|||
|
|
|||
|
Of course all three phases can be collapsed into one, as a definition can act
|
|||
|
as both a declaration and a definition, two phases in one, but there have to
|
|||
|
be three phases, that can be the result parser actions widely separated in
|
|||
|
time, triggered by code widely separated in the source, and thinking of the
|
|||
|
common and normal case is going to result in mental confusion, collapsing
|
|||
|
things that are distinct, because the distinction is commonly uniportant and
|
|||
|
elided. Hence the thick syntactic soup with which I have struggling when I
|
|||
|
write C++ templates defining classes that define operators and then attempt
|
|||
|
to use the operators.
|
|||
|
|
|||
|
In the language of C we have parse time actions, link time actions, and
|
|||
|
execution time actions, and only at execution time is procedural code
|
|||
|
constructed as a result of earlier actions actually performed procedurally.
|
|||
|
|
|||
|
We want procedural metacode that can construct procedural metacode. So we
|
|||
|
want execution time actions performed during parsing. So let us call the
|
|||
|
actions definitional actions, linking actions, and execution actions. And if
|
|||
|
we ware going to have procedural actions during parsing, we are going to have
|
|||
|
linking actions during parsing. (Of course, in actually existent C++, second
|
|||
|
stage compilation does a whole lot of linker actions, resulting in
|
|||
|
excessively tight coupling between linker and compiler, and the inability of
|
|||
|
other languages to link to C++, and the syntax soup that ensues when I define
|
|||
|
a template class containing inline operators.
|
|||
|
|
|||
|
# Forth the model
|
|||
|
|
|||
|
We assume the equivalent of Forth, where the interpreter directly interprets
|
|||
|
and executes human readable and writeable text, by looking the symbols in the
|
|||
|
text and performing the actions they comand, which commands may command the
|
|||
|
interpreter to generate compiled and linked code, including compiled code that
|
|||
|
generates compiled and linked code, commands the interpreter to add names for
|
|||
|
what it has compiled to the name table, and then commands the interpreter to
|
|||
|
execute those routines by name.
|
|||
|
|
|||
|
Except that Forth is absolutely typeless, or has only one type, fixed
|
|||
|
precision integers that are also pointers, while we want a language in which
|
|||
|
types are first class values, as manipulable as integers, except that they
|
|||
|
are immutable, a language where a pointer to a pointer to an integer cannot
|
|||
|
be added to a pointer, and subtraction of one pointer from another pointer of
|
|||
|
the same type pointing into the same object produces an integer, where you
|
|||
|
cannot point a pointer out of the range of the object it refers to, nor
|
|||
|
increment a reference, only the referenced value.
|
|||
|
|
|||
|
Lexing merely needs symbols to be listed. Parsing merely needs them to be, in C++ terminology, declared but not defined. Pratt parsing puts operators in forth order, but knows and cares nothing about types, so is naturally adapted to a Forth like language which has only one type, or values have run time types, or generating an intermediate language which undergoes a second state compilation that produces statically typed code.
|
|||
|
|
|||
|
In forth, symbols pointed to memory addresses, and it was up to the command whether it would load an integer from an address, stored an integer at that address, execute a subroutine at that address, or go to that address, the ultimate in unsafe typelessness.
|
|||
|
|
|||
|
Pratt parsing is an outstandingly elegant solution to parsing, and allows compile time extension to the parser, though it needs a lexer driven by the symbol table if you have multi character operators, but I am still lost in the problem of type safety.
|
|||
|
|
|||
|
Metaprogramming in C++ is done a lazily evaluated purely functional language
|
|||
|
where a template is usually used to construct a type from type arguments. I
|
|||
|
want to construct types procedurally, and generate code procedurally, rather
|
|||
|
than by invoking pure functions.
|
|||
|
|
|||
|
In Pratt parsing, the the language is parserd sequentially in parser order, but
|
|||
|
the parser maintains a tree of recursive calls, and builds a tree of pointers
|
|||
|
into the source, such that it enters each operator in polish order, and
|
|||
|
finishes up each operator in reverse polish order.
|
|||
|
|
|||
|
On entering in polish order, this may be an operand with a variable number of
|
|||
|
arguments (unary minus or infix minus) so it cannot know the number of operands
|
|||
|
coming up, but on exiting in reverse polish order, it knows the number and
|
|||
|
something about the type of the arguments, so it has to look for an
|
|||
|
interpretation of the operator that can handle that many arguments of those
|
|||
|
type. Which may not necessarily be a concrete type.
|
|||
|
|
|||
|
Operators that change the behavior of the lexer or the parser are typically
|
|||
|
acted upon in polish order. Compilation to byte code that does not yet have
|
|||
|
concrete types is done in reverse polish order, so operators that alter the
|
|||
|
compilation to byte code are executed at that point. Operators that manipulate
|
|||
|
that byte code during the linking to concrete types act at link time, when the
|
|||
|
typeless byte code is invoked with concrete types.
|
|||
|
|
|||
|
Naming puts a symbol in the lexer symbol table.
|
|||
|
|
|||
|
Declaring puts a symbol in the parser symbol table
|
|||
|
|
|||
|
Defining compiles, and possibly links, the definition, and attaches that data
|
|||
|
to the symbol where it may be used or executed in subsequent compilation and
|
|||
|
linking steps when that symbol is subsequently invoked. If the definition
|
|||
|
contains procedural code, it is not going to be executed procedurally until
|
|||
|
compiled and linked, which will likely occur when the symbol is invoked later.
|
|||
|
|
|||
|
An ordinary procedure definition without concrete types is the equivalent of an
|
|||
|
ordinary C++ template. When it is used with concrete types, the linker will
|
|||
|
interet to the operations it invokes in terms of those concrete types, and fail
|
|||
|
if they don’t support those operations.
|
|||
|
|
|||
|
A metacode procedure gets put into the lexer symbol table when it is named,
|
|||
|
into the parser symbol table when it is defined. When it is declared, its
|
|||
|
definition may be used when its symbol is encountered in polish order by the
|
|||
|
parser, and may be executed at that time to modify the behavior of parser and
|
|||
|
linker. When a named, declared, and defined symbol is encountered by the
|
|||
|
parser in reverse polish order, its compiled code may be used to generate
|
|||
|
linked code, and its linked and compiled code may manipulate the compiled code
|
|||
|
preparator to linking.
|
|||
|
|
|||
|
When a symbol is declared, it gets added to the parser and lexer symbol table. When it is defined, it gets added to the linker symbol table. When defined with a concrete type, also gets added to the linker symbol table with those concrete types, as an optimization.
|
|||
|
|
|||
|
If an operation could produce an output of variant type, then it is an additive
|
|||
|
algebraic type, which then has to handled by a switch statement.
|
|||
|
|
|||
|
There are five steps: Lexing, parsing, compiling, linking, and running, and
|
|||
|
any fragment of source code may experience some or all of these steps, with the
|
|||
|
resulting entries in the symbol table then being available to the next code
|
|||
|
fragment, Forth style. Thus `77+9`gets lexed into `77, +, `, parsed into `+(77, 9)`, compiled into `77 9 +`,
|
|||
|
linked into `77, 9 +<int, int>` and executed into `int(86` and the rest of the source code proceeds to parse, compile, link, and
|
|||
|
run as if you had written `86`.
|
|||
|
|
|||
|
Further the source code can create run time code, code that gets declared,
|
|||
|
defined, and linked during the compile that is executed during the compile,
|
|||
|
modifying the behavior of the lexer, the parser, the compiler, and the linker
|
|||
|
over the course of a single compile and link. This enables a forth style
|
|||
|
bootstrapping, where the lexer, parser, compiler and linker lexes, compiles,
|
|||
|
and links, most of its own potentially modifiable source code in every compile,
|
|||
|
much as every c++ compile includes the header files for the standard template
|
|||
|
library, so that much of your program is rewritten by template metacode that
|
|||
|
you included at the start of the program.
|
|||
|
|
|||
|
Compiled but not linked code could potentially operate on variables of any
|
|||
|
type, though if the variables did not have a type required by an operator, you
|
|||
|
would get a link time error, not get a compile time error. This is OK because
|
|||
|
linking of a fragment of source is not a separate step, but usually happens
|
|||
|
before the lexer has gotten much further through the source code, happens as
|
|||
|
soon as the code fragment is invoked with variables of defined type, though
|
|||
|
usually of as yet undefined value.
|
|||
|
|
|||
|
A console program is an operator whose values are of the type iostream, it gets
|
|||
|
linked as soon as the variable type is defined, and executed when you assign
|
|||
|
defined values to iostream.
|
|||
|
|
|||
|
Because C++ metacode is purely functional, it gets lazily evaluated, so the
|
|||
|
syntax and compiler can cheerfully leave it undefined when, or even if, it
|
|||
|
gets executed. Purely functional languages only terminate by laziness. But
|
|||
|
if we want to do the same things with procedural metacode, no option but to
|
|||
|
explicitly define what get executed when. In which case pure lalr syntax is
|
|||
|
going to impact the semantics, since lalr syntax defines the order of parse
|
|||
|
time actions, and order of execution impacts the semantics. I am not
|
|||
|
altogether certain as to whether the result is going to be intellibile and
|
|||
|
predictable. Pratt syntax, however, is going to result in predictzble execution order.
|
|||
|
|
|||
|
The declaration, obviously, defines code that can be executed by a subsequent
|
|||
|
parse action after the declaration parse action has been performed, and the
|
|||
|
definition code that can be compiled after the definition parse action
|
|||
|
performed.
|
|||
|
|
|||
|
The compiled code can be linked when when invoked with variables of defined
|
|||
|
type and undefined value, and executed when invoked with variables of defined
|
|||
|
type and an value.
|
|||
|
|
|||
|
Consider what happens when the definition defines an overload for an infix
|
|||
|
operator. The definition of the infix operator can only be procedurally
|
|||
|
executed when the parser calls the infix action with the arguments on the parse
|
|||
|
stack, which happens long after the infix operator is overloaded.
|
|||
|
|
|||
|
The definition has to be parsed when the parser encounters it. But it is
|
|||
|
procedural code, which cannot be procedurally executed until later, much
|
|||
|
later. So the definition has to compile, not execute, procedural code, then
|
|||
|
cause the data structure created by the declaration to point to that compiled
|
|||
|
code. And then later when the parser encounters an actual use of the infix
|
|||
|
operator, the compiled procedural code of the infix definition is actually
|
|||
|
executed to generate linked procedural code with explicit and defined types,
|
|||
|
which is part of the definition of the function or method in whose source code
|
|||
|
the infix operator was used.
|
|||
|
|
|||
|
One profoundly irritating feature of C++ code, probably caused by LL parsing,
|
|||
|
is that if the left hand side of an infix expression has an appropriate
|
|||
|
overloaded operator, it works, but if the right hand side, it fails. Here we
|
|||
|
see parsing having an incomprehensible and arbitrary influence on semantics.
|
|||
|
|
|||
|
C++ is a strongly typed language. With types, any procedure is has typed
|
|||
|
inputs and outputs, and should only do safe and sensible things for that type.
|
|||
|
C++ metacode manipulates types as first class objects, which implies that if we
|
|||
|
were to do the same thing procedurally, types need a representation, and
|
|||
|
procedural commands to make new types from old, and to garbage collect, or
|
|||
|
memory manage, operations on these data objects, as if they were strings,
|
|||
|
floating point numbers, or integers of known precision. So you could construct
|
|||
|
or destruct an object of type type, generate new types by doing type operations
|
|||
|
on old types, for example add two types or multiply two types to produce an
|
|||
|
algebraic type, or create a type that is a const type or pointer type to an
|
|||
|
existing type, which type actually lives in memory somewhere, in a variable
|
|||
|
like any other variable. And, after constructing an algebraic type by
|
|||
|
procedurally multiply two types, and perhaps storing in a variable of type
|
|||
|
type, or invoking a function (aka C++ template type) that returns a type
|
|||
|
dynamically, create an object of that type – or an array of objects of that
|
|||
|
type. For every actual object, the language interpreter knows the type,
|
|||
|
meaning the object of type X that you just constructed is somehow linked to the
|
|||
|
continuing existence of the object of type type that has the value type X that
|
|||
|
you used to construct it, and cannot be destroyed until all the obects created
|
|||
|
using it are destroyed. Since the interpreter knows the type of every object,
|
|||
|
including objects of type type, and since every command to do something with an
|
|||
|
object is type aware, this can prevent the interpreter from being commanded to
|
|||
|
do something stupid. Obviously type data has to be stored somewhere, and has
|
|||
|
to be immutable, at least until garbage collected because no longer referenced.
|
|||
|
|
|||
|
Can circular type references exist? Well, not if they are immutable, because
|
|||
|
if a type references a type, that type must already exist, and so cannot
|
|||
|
reference a type that does not yet exist. It could reference a function that
|
|||
|
generates types, but that reference is not circular. It could have values that
|
|||
|
are constexpr, and values that reference static variables. If no circular
|
|||
|
references possible, garbage collection by reference counting works
|
|||
|
|
|||
|
Types are algebraic types, sums and products of existing types, plus modifiers
|
|||
|
such as `const, *,` and `&`.
|
|||
|
|
|||
|
Type information is potentially massive, and if we are executing a routine that
|
|||
|
refers to a type by the function that generates it, we don’t want that
|
|||
|
equivalent of a C++ template invoked every time, generating a new immutable
|
|||
|
object every time that is an exact copy of what it produced the last time it
|
|||
|
went through the loop. Rather, the interpreter needs short circuit the
|
|||
|
construction by looking up a hash of that type constructing template call, to
|
|||
|
check it it has been called with those function inputs, to already produced an
|
|||
|
object of that type. And when a function that generates a type is executed,
|
|||
|
needs to look for duplications of existing types. A great many template
|
|||
|
invocations simply choose the right type out of a small set of possible types.
|
|||
|
It is frequently the case that the same template may be invoked with an
|
|||
|
enormous variety of variables, and come up with very few different concrete
|
|||
|
results.
|
|||
|
|
|||
|
When the interpreter compiles a loop or a recursive call, the type information
|
|||
|
is likely to be an invariant, which should get optimized out of the loops. But
|
|||
|
when it is directly executing source code commands which command it to compile
|
|||
|
source code. such optimization is impossible
|
|||
|
|
|||
|
But, as in forth, you can tell the interpreter to store the commands in a
|
|||
|
routine somewhere, and when they are stored, the types have already been
|
|||
|
resolved. Typically the interpreter is going to finish interpreting the source
|
|||
|
code, producing stored programs each containing a limited amount of type
|
|||
|
information.
|