40dc88a37b
Preparatory to creating a proper link browse structure
696 lines
40 KiB
Markdown
696 lines
40 KiB
Markdown
---
|
||
title: Parsers
|
||
...
|
||
This rambles a lot. Thoughts in progress: Summarizing my thoughts here at the top.
|
||
|
||
Linux scripts started off using lexing for parsing, resulting in complex and
|
||
incomprehensible semantics, producing unexpected results. (Try naming a
|
||
file `-r`, or a directory with spaces in the name.)
|
||
|
||
They are rapidly converging in actual usage to the operator precedence
|
||
syntax and semantics\
|
||
`command1 subcommand arg1 … argn infixoperator command2 subcommand …`
|
||
|
||
Which is parsed as\
|
||
`((staticclass1.staticmethod( arg1 … argn)) infixoperator ((staticclass2.staticmethod(…)))`
|
||
|
||
With line feed acting as `}{` operator, start of file acting as a `{` operator, end
|
||
of file acting as a `}` operator, suggesting that in a sane language, indent
|
||
increase should act as `{` operator, indent decrease should act as a `}`
|
||
operator.
|
||
|
||
Command line syntax sucks, because programs interpret their command
|
||
lines using a simple lexer, which lexes on spaces. Universal resource
|
||
identifier syntax sucks, because it was originally constructed so that it
|
||
could be a command line argument, hence no spaces, and because it was
|
||
designed to be parsed by a lexer.
|
||
|
||
But EBNF parsers also suck, because they do not parse the same way
|
||
humans do. Most actual programs can be parsed by a simple parser, even
|
||
though the language in principle requires a more powerful parser, becaus
|
||
humans do not use the nightmarish full power of a grammer that an EBNF
|
||
definition winds up defining.
|
||
|
||
Note that [LLVM language creation tools](https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/)
|
||
tutorial does not user an EBNF
|
||
parser. These tools also make creating a new language with JIT semantics
|
||
very easy.
|
||
|
||
We are programming in languages that are not parsed the way the
|
||
programmer is parsing them.
|
||
|
||
Programming languages ignore whitespace, because programmers tend to
|
||
express their meaning with whitespace for the human reader, and
|
||
whitespace grammer is not altogether parallel to the EBNF grammer.
|
||
There is a mismatch in grammers.
|
||
|
||
Seems to me that human parsing is combination of low level lexing, Pratt
|
||
parsing on operator right and left binding power, and a higher level of
|
||
grouping that works like lexing. Words are lexed by spaces and
|
||
punctuation, grouped by operator binding power, with operator
|
||
recognition taking into account the types on the stack, groups of parsed
|
||
words are bounded by statement separators, which can be lexed out,
|
||
groups of statements are grouped and bounded by indenting.
|
||
|
||
Some levels in the hierarchy are lexed out, others are operator binding
|
||
power parsed out. There are some “operators” that mean group separator
|
||
for a given hierarchical level, which is a tell that reveals lex style parsing,
|
||
for example semi colon in C++, full stop and paragraph break in text.
|
||
|
||
The never ending problems from mixing tab and spaces indenting can be
|
||
detected by making a increase or decrease of ident by a space a bracket
|
||
operator, and an increase or decrease by a tab a non matching bracket
|
||
operator.
|
||
|
||
Pratt parsing parsers operators by their left and right binding power –
|
||
which is a superset of operator precedence parsing. EBNF does not
|
||
directly express this concept, and programming this concept into EBNF is
|
||
complicated, indirect, and imperfect – because it is too powerful a
|
||
superset, that can express anything, including things that do not make
|
||
sense to the human writing the stuff to be parsed.
|
||
|
||
Pratt parsing finalizes an expression by visiting the operators in reverse
|
||
polish order, thus implicitly executing a stack of run time typed operands,
|
||
which eventually get compiled and eventually executed as just-in-time typed
|
||
or statically typed operands and operators.
|
||
|
||
For [identity](names/identity.html), we need Cryptographic Resource Identifiers,
|
||
which cannot conform the “Universal” Resource Identifier syntax and semantics.
|
||
|
||
Lexers are not powerful enough, and the fact that they are still used
|
||
for uniform resource identifiers, relative resource identifiers, and command
|
||
line arguments is a disgrace.
|
||
|
||
Advanced parsers, however, are too powerful, resulting in syntax that is
|
||
counter intuitive. That ninety percent of the time a program file can be
|
||
parsed by a simple parser incapable of recognizing the full set of
|
||
syntactically correct expressions that the language allows indicates that the
|
||
programmer’s mental model of the language has a more simple structure.
|
||
|
||
# Pratt Parsing
|
||
|
||
I really love the Pratt Parser, because it is short and simple, because if you
|
||
add to the symbol table you can add new syntax during compilation,
|
||
because what it recognizes corresponds to human intuition and human
|
||
reading.
|
||
|
||
But it is just not actually a parser. Given a source with invalid expressions
|
||
such as unary multiplication and unbalanced parentheses, it will cheerfully
|
||
generate a parse. It also lacks the concept out of which all the standard
|
||
parsers are constructed, that expressions are of different kinds, different
|
||
nonterminals.
|
||
|
||
To fix Pratt parsing, it would have to recognize operators as bracketing, as
|
||
prefix undery, postfix unary, or infix, and that some operators do not have an
|
||
infix kinds, and it would have to recognize that operands have types, and that
|
||
an operator produces a type from its inputs. It would have to attribute a
|
||
nonterminal to a subtree. It would have to recognize ternary operators as
|
||
operators.
|
||
|
||
And that is a major rewrite and reinvention.
|
||
|
||
Lalr parsers appear to be closer to the programmer mental model, but looking at
|
||
Pratt Parsing, there is a striking resemblance between C and what falls out
|
||
Pratt’s model:
|
||
|
||
The kind of “lexing” the Pratt parser does seems to have a natural
|
||
correspondence to the kind of parsing the programmer does as his eye rolls
|
||
over the code. Pratt’s deviations from what would be correct behavior in
|
||
simple arithmetic expressions composed of numerals and single character
|
||
symbols seem to strikingly resemble expressions that engineers find
|
||
comfortable.
|
||
|
||
When `expr` is called, it is provided the right binding power of the
|
||
token that called it. It consumes tokens until it meets a token whose left
|
||
binding power is equal or lower than the right binding power of the operator
|
||
that called it. It collects all tokens that bind together into a tree before
|
||
returning to the operator that called it.
|
||
|
||
The Pratt `peek` peeks ahead to see if what is coming up is an
|
||
operator, therefore needs to check what is coming up against a symbol table,
|
||
which existing implementations fail to explicitly implement.
|
||
|
||
The Pratt algorithm, as implemented by Pratt and followers, assumes that all
|
||
operators can be unary prefix or infix (hence the nud/led distinction). It
|
||
should get the nature of the upcoming operator from the symbol table (infix,
|
||
unary, or both, and if unary, prefix or postfix.
|
||
|
||
Although implementers have not realized it, they are treating all “non
|
||
operator” tokens as unary posfix operators. Instead of, or as well as, they
|
||
need to treat all tokens (where items recognized from a symbol table are
|
||
pre-aggregated) as operators, with ordinary characters as postfix unary,
|
||
spaces as postfix unary with weaker binding power, and a token consisting of
|
||
a utf8 iterator plus a byte count as equivalent to a left tree with right
|
||
single character leaves and a terminal left leaf.
|
||
|
||
Pratt parsing is like lexing, breaking a stream of characters into groups,
|
||
but the grouping is hierarchical. The algorithm annotates a linear text with
|
||
hierarchy.
|
||
|
||
Operators are characterized by a global order of left precedence, a global
|
||
order of right precedence (the difference giving us left associativity and
|
||
right associativity)
|
||
|
||
If we extend the Pratt algorithm with the concept of unitary postfix
|
||
operators, we see it is treating each ordinary unrecognized character as a
|
||
unitary postfix operator, and each whitespace character as a unitary postfix
|
||
operator of weaker binding power.
|
||
|
||
[Apodaca]:https://dev.to/jrop/pratt-parsing
|
||
|
||
Pratt and [Apodaca] are primarily interested in the case of unary minus, so
|
||
they handle the case of a tree with a potentially null token by
|
||
distinguishing between nud (no left context) and led (the right hand side of
|
||
an operator with left context).
|
||
|
||
Pratt assumes that in correct source text, `nud` is only going to encounter an
|
||
atomic token, in which case it consumes the token, constructs a leaf vertex
|
||
which points into the source, and returns, or a unary prefixoperator, or an
|
||
opening bracket. If it encounters an operator, it calls `expr` with the right
|
||
binding power of that operator, and when `expr`has finished parsing, returns
|
||
a corresponding vertex.
|
||
|
||
Not at all clear to me how it handles brackets. Pratt gets by without the
|
||
concept of matching tokens, or hides it implicitly. Seems to me that correct
|
||
parsing is that a correct vertex has to contain all matching tokens, and the
|
||
expressions cotained therein, so a vertex corresponding to a bracketed
|
||
expression has to point to the open and closing bracket terminals, and the
|
||
contained expression. I would guess that his algorithm winds up with a
|
||
tree that just happens to contain matching tokens in related positions in the tree.
|
||
|
||
Suppose the typical case, a tree of binary operators inside a tree of binary
|
||
operators: In that case, when `expr` is called, the source pointer is pointing
|
||
to the start of an expression. `expr` calls `nud` to parse the expression, and if
|
||
that is all she wrote (because ` peek` reveals an operator with lower left
|
||
binding power than the right binding power that `expr` was called with)
|
||
returns the edge to the vertext constructed by `nud`. Otherise, it parses out
|
||
the operator, and calls `led` with the right binding power of the operator it has encountered, to get the right hand argument of the binary operator. It
|
||
then constructs a vertex containing the operator, whose left edge points to
|
||
the node constructed by `nud` and whose right hand edge points to the node
|
||
constructed by `led`. If that is all she wrote, returns, otherwise iterates
|
||
its while loop, constructing the ever higher root of a right leaning tree of
|
||
all previous roots, whose ultimate left most leaf is the vertex constructed by
|
||
`nud`, and whose right hand vertexes were constructed by `led`.
|
||
|
||
The nud/led distinction is not sufficiently general. They did not realize
|
||
that they were treating ordinary characters as postfix unitary operators.
|
||
|
||
Trouble is, I want to use the parser as the lexer, which ensures that as the
|
||
human eye slides over the text, the text reads the way it is in fact
|
||
structured. But if we do Pratt parsing on single characters to group them
|
||
into larger aggregates, `p*--q*s` is going to be misaggregated by the parser
|
||
to `(( (p*)−) − (q*s)`, which is meaningless.
|
||
|
||
And, if we employ Pratt’s trick of nud/led distinction, will evaluate as
|
||
`p*(-(-q*s))` which gives us a meaningful but wrong result ` p*q*s`
|
||
|
||
If we allow multicharacter operators then they have to be lexed out at the
|
||
earliest stage of the process – the Pratt algorithm has to be augmented by
|
||
aggregate tokens, found by attempting to the following text against a symbol
|
||
table. Existing Pratt algorithms tend to have an implicit symbol table of
|
||
one character symbols, everything in the symbol table being assumed to be
|
||
potentially either infix or unary prefix, and everything else outside the
|
||
implicit symbol table unary postfix.
|
||
|
||
If we extend the Pratt algorithm with the concept of unitary postfix
|
||
operators, we see it is treating each ordinary unrecognized character as a
|
||
unitary postfix operator, and each whitespace character as a unitary postfix
|
||
operator of weaker binding power.
|
||
|
||
Suppose a token consists of a utf8 iterator and a byte count.
|
||
|
||
So, all the entities we work with are trees, but recursion terminates because
|
||
some nodes of the tree have been collapsed to variables that consist of a
|
||
utf8 iterator and a byte count, *and some parts of the tree have been
|
||
partially collapsed to vertexes that consist of a ut8 iterator, a byte count,
|
||
and an array of trees*.
|
||
|
||
C++ forbids `“foo bar()”` to match `“foobar()”`, but
|
||
allows `“foobar ()”` to match, which is arguably an error.
|
||
|
||
`“foobar(”` has to lex out as a prefix operator. But is not
|
||
really a prefix unitary operator. It is a set of matching operators, like
|
||
brackets and the tenary operatro bool?value:value The commas and the closing
|
||
bracket are also part of it. Which brings us to recognizing ternary
|
||
operators. The naive single character Pratt algorithm handles ternary
|
||
operators correctly (assuming that the input text is valid) which is
|
||
surprising. So it should simply also match the commas and right bracket as a
|
||
particular case of ternary and higher operators in the initial symbol search,
|
||
albeit doing that so that it is simple and correct and naturally falls out of
|
||
the algorithm is not necessarily obvious.
|
||
|
||
Operator precedence gets you a long way, but it messed up because it did not
|
||
recognize the distinction between right binding power and left binding
|
||
power. Pratt gets you a long way further.
|
||
|
||
But Pratt messes up because it does not explicitly recognize the difference
|
||
between unitary prefix and unitary postfix, nor does it explicitly recognize
|
||
operator matching – that a group of operators are one big multi argument
|
||
operator. It does not recognize that brackets are expressions of the form
|
||
symbol-expression-match, let alone that ternary operators are expressions of
|
||
the form expression-symbol-match-expression.
|
||
|
||
Needs to be able to recognize that expressions of the form
|
||
expression-symbol-expression-match-expression-match\...expression are
|
||
expressions, and convert the tree into prefix form (polish notation with
|
||
arguments bracketed) and into postfix form (reverse polish) with a count of
|
||
the stack size.
|
||
|
||
Needs to have a stack of symbols that need left matches.
|
||
|
||
# Lalr
|
||
|
||
Bison and yacc are
|
||
[joined at the hip](https://tomassetti.me/why-you-should-not-use-flex-yacc-and-bison/) to seven bit ascii and BNF, (through flex and lex)
|
||
whereas [ANTLR](https://tomassetti.me/ebnf/)
|
||
recognizes unicode and the far more concise and intelligible EBNF. ANTLR
|
||
generates ALL parsers, which allow syntax that allows statements that are ugly
|
||
and humanly unintelligible, while Bison when restricted to LALR parsers allows
|
||
only grammars that forbid certain excesses, but generates unintelligible error
|
||
messages when you specify a grammar that allows such excesses.
|
||
|
||
You could hand write your own lexer, and use it with BisonC++. Which seemingly
|
||
everyone does.
|
||
|
||
ANTLR allows expressions that take long time to parse, but only polynomially
|
||
long, fifth power, and prays that humans seldom use such expressions, which in
|
||
practice they seldom do. But sometimes they do, resulting in hideously bad
|
||
parser performance, where the parser runs out of memory or time. Because
|
||
he parser allows non LALR syntax, it may find many potential meanings
|
||
halfway through a straightforward lengthy expression that is entirely clear
|
||
to humans because the non LALR syntax would never occur to the human. In
|
||
ninety percent of files, there is not a single expression that cannot be
|
||
parsed by very short lookahead, because even if the language allows it,
|
||
people just do not use it, finding it unintelligible. Thus, a language that
|
||
allows non LALR syntax locks you in against subsequent syntax extension,
|
||
because the extension you would like to make already has some strange and non
|
||
obvious meaning in the existing syntax.
|
||
|
||
This makes it advisable to use a parser that can enforce a syntax definition
|
||
that does not permit non LALR expressions.
|
||
|
||
On the other hand, LALR parsers walk the tree in Reverse Polish
|
||
order, from the bottom up. This makes it hard to debug your grammar, and
|
||
hard to report syntax errors intelligibly. And sometimes you just cannot
|
||
express the grammar you want as LALR, and you wind up writing a superset of
|
||
the grammar you want, and then ad-hoc forbidding otherwise legitimate
|
||
constructions, in which case you have abandoned the simplicity and
|
||
directness of LALR, and the fact that it naturally tends to restrict you to
|
||
humanly intelligible syntax.
|
||
|
||
Top down makes debugging your syntax easier, and issuing useful error
|
||
messages a great deal easier. It is hard to provide any LALR handling of
|
||
syntax errors other than just stop at the first error, but top down makes it
|
||
a lot harder to implement semantics, because Reverse Polish order directly
|
||
expresses the actions you want to take in the order that you need to take
|
||
them.
|
||
|
||
LALR allows left recursion, so that you can naturally make minus and divide
|
||
associate in the correct and expected order, while with LL, you wind up
|
||
doing something weird and complicated – you build the tree, then you have
|
||
another pass to get it into the correct order.
|
||
|
||
Most top down parsers, such as ANTLR, have a workaround to allow left
|
||
recursion. They internally turn it into right recursion by the standard
|
||
transformation, and then optimize out the ensuing tail recursion. But that
|
||
is a hack, which increases the distance between your expression tree and
|
||
your abstract syntax tree, still increases the distance between your grammar
|
||
and your semantics during parser execution. You are walking the hack,
|
||
instead of walking your own grammar’s syntax tree in Reverse Polish order.
|
||
Implementing semantics becomes more complex. You still wind up with added
|
||
complexity when doing left recursion, just moved around a bit.
|
||
|
||
LALR allows you to more directly express the grammar you want to express. With
|
||
top down parsers, you can accomplish the same thing, but you have to take a
|
||
more roundabout route to express the same grammar, and again you are likely
|
||
to find you have allowed expressions that you do not want and which do not
|
||
naturally have reasonable and expected semantics.
|
||
|
||
ANTLR performs top down generation of the expression tree. Your code called by
|
||
ANTLR converts the expression tree into the Abstract Syntax tree, and the
|
||
abstract syntax tree into the High Level Intermediate Representation.
|
||
|
||
The ANTLR algorithm can be slow as a week of sundays, or wind up eating
|
||
polynomially large amounts of memory till it crashes. To protect against
|
||
this problem, [he
|
||
suggests using the fast SLL algorithm first, and should it fail, then use
|
||
the full on potentially slow and memory hungry LL\* algorithm.](https://github.com/antlr/antlr4/issues/374) Ninety
|
||
percent of language files can be parsed by the fast algorithm, because people
|
||
just do not use too clever by half constructions. But it appears to me that
|
||
anything that cannot be parsed by SLL, but can be parsed by LL\*, is not good
|
||
code – that what confuses an SLL parser also confuses a human, that the
|
||
alternate readings permitted by the larger syntax are never code that people
|
||
want to use.
|
||
|
||
Antlr does not know or care if your grammar makes any sense until it tries to
|
||
analyze particular texts. But you would like to know up front if your
|
||
grammar is valid.
|
||
|
||
LALR parsers are bottom up, so have terrible error messages when they analyze
|
||
a particular example of the text, but they have the enormous advantage that
|
||
they will analyze your grammar up front and guarantee that any grammatically
|
||
correct statement is LALR. If a LALR parser can analyze it, chances are that
|
||
a human can also. ANTLR permits grammars that permit unintelligible statements.
|
||
|
||
The [LRX parser](http://lrxpg.com/downloads.html) looks the most
|
||
suitable for your purpose. It has a restrictive license and only runs in the
|
||
visual studio environment, but you only need to distribute the source code it
|
||
builds the compiler from as open source, not the compiler compiler. It halts
|
||
at the first error message, since incapable of building intelligible multiple
|
||
error messages. The compiler it generates builds a syntax tree and a symbol
|
||
table.
|
||
|
||
The generically named [lalr](https://github.com/cwbaker/lalr)
|
||
looks elegantly simple, and not joined at the hip to all sorts of strange
|
||
environment. Unlike Bison C++, should be able to handle unicode strings,
|
||
with its regular expressionsrx pa. It only handles BNF, not EBNF, but that
|
||
is a relatively minor detail. Its regular expressions are under documented,
|
||
but regular expression syntax is pretty standard. It does not build a symbol
|
||
table.
|
||
|
||
And for full generality, you really need a symbol table where the symbols get
|
||
syntax, which is a major extension to any existing parser. That starts to
|
||
look like hard work. The lalr algorithm does not add syntax on the fly. The
|
||
lrxpg parser does build a symbol tree one on the fly, but not syntax on the
|
||
fly – but its website just went down. No one has attempted to write a
|
||
language that can add syntax on the fly. They build a syntax capable of
|
||
expressing an arbitrary graph with symbolic links, and then give the graph
|
||
extensible semantics. The declaration/definition semantic is not full
|
||
parsing on the definition, but rather operates on the tree.
|
||
|
||
In practice, LALR parsers need to be extended beyond LALR with operator
|
||
precedence. Expressing operator precedence within strict LALR is apt to be
|
||
messy. And, because LALR walks the tree in reverse polish order, you want
|
||
the action that gets executed at parse time to return a value that the
|
||
generated parser puts on a stack managed by the parser, which stack is
|
||
available when the action of the operator that consumes it is called. In
|
||
which case the definition/declaration semantic declares a symbol that has a
|
||
directed graph associated with it, which graph is then walked to interpret
|
||
what is on the parse stack. The data of the declaration defines metacode
|
||
that is executed when the symbol is invoked, the directed graph associated
|
||
with the symbol definition being metacode executed by the action that parser
|
||
performs when the symbol is used. The definition/declaration semantic allows
|
||
arbitrary graphs containing cycles (full recursion) to be defined, by the
|
||
declaration adding indirections to a previously constructed directed graph.
|
||
|
||
The operator-precedence parser can parse all LR(1) grammars where two
|
||
consecutive nonterminals and epsilon never appear in the right-hand side of any
|
||
rule. They are simple enough to write by hand, which is not generally the case
|
||
with more sophisticated right shift-reduce parsers. Second, they can be written
|
||
to consult an operator table at run time. Considering that “universal” resource
|
||
locators and command lines are parsed with mere lexers, perhaps a hand written
|
||
operator-precedence parser is good enough. After all, Forth and Lisp have less.
|
||
|
||
C++ variadic templates are a purely functional metalanguage operating on the
|
||
that stack. Purely functional languages suck, as demonstrated by the fact
|
||
that we are now retroactively shoehorning procedural code (if constexpr) into
|
||
C++ template meta language. Really, you need the parse stack of previously
|
||
encountered arguments to potentially contain arbitrary objects.
|
||
|
||
When a lalr parser parses an if-then-else statement, then if the parser
|
||
grammer defines “if” as the nonterminal, which may contain an “else”
|
||
clause, it is going to execute the associated actions in the reverse order.
|
||
But if you define “else” as the nonterminal, which must be preceded by an
|
||
“if” clause, then the parser will execute the associated actions in the
|
||
expected order. But suppose you have an else clause in curly brackets
|
||
inside an if-then-else. Then the parse action order is necessarily going to
|
||
be different from the procedural. Further, the very definition of an if-then-else clause implies a parse time in which all actions are performed, and a procedural time in which only one action is performed.
|
||
|
||
Definition code metacode must operate on the parser stack, but declaration
|
||
metacode may operate on a different stack, implying a coroutine relationship
|
||
between declaration metacode and definition metacode. The parser, to be
|
||
intelligible, has to perform actions in as close to left to right order as
|
||
possible hence my comment that the “else” nonterminal must contain the “if”
|
||
nonterminal, not the other way around – but what if the else nonterminal
|
||
contains an “if then else” inside curly braces? The parser actions can and
|
||
will happen in different order to the run time actions. Every term of the
|
||
if-then-else structure is going to have its action performed in syntax order,
|
||
but the syntax order has to be capable of implying a different procedural
|
||
order, during which not all actions of an if-then-else structure will be
|
||
performed. And similarly with loops, where every term of the loop causes a
|
||
parse time action to be performed once in parse time order, but procedural
|
||
time actions in a different order, and performed many times.
|
||
|
||
This implies that any fragment of source code in a language that uses the
|
||
declaration/definition syntax and semantic gets to do stuff in three phases
|
||
(Hence in C, you can define a variable or a function without declaring it,
|
||
resulting in link time errors, and in C++ define a class without declaring
|
||
its methods and data, resulting in compilation errors at a stage of
|
||
compilation that is ill defined and inexplicit)
|
||
|
||
The parser action of the declaration statement constructs a declaration data
|
||
structure, which is metacode, possibly invoking the metacode generated by
|
||
previous declarations and definitions. When the term declared is then used,
|
||
then the metacode of the definition is executed. And the usage may well
|
||
invoke the metacode generated by the action associated at parse time with the
|
||
declaration statement, but attempting to do so causes an error in the parser
|
||
action if the declaration action has not yet been encountered in parse action
|
||
order.
|
||
|
||
So, we get parser actions which construct definition and declaration metacode
|
||
and subsequent parser actions, performed later during the parse of subsequent
|
||
source code that invoke that metacode by name to construct metacode. But, as
|
||
we see in the case of the if-then-else and do-while constructions, there must
|
||
be a third execution phase, in which the explicitly procedural code
|
||
constructed, but not executed, by the metacode, is actually executed
|
||
procedurally. Which, of course, in C++ is performed after the link and load
|
||
phase. But we want procedural metacode. And since procedural metacode must
|
||
contain conditional and loops, there has to be a third phase during parsing,
|
||
executed as a result of parse time actions, that procedurally performs ifs
|
||
and loops in metacode. So a declaration can invoke the metacode constructed
|
||
by previous declarations – meaning that a parse time action executes metacode
|
||
constructed by previous parse time actions. But, to invoke procedural
|
||
metacode from a parse time action, a previous parse time action has to have
|
||
invoked metacode constructed by an even earlier parse time action to
|
||
construct procedural metacode.
|
||
|
||
Of course all three phases can be collapsed into one, as a definition can act
|
||
as both a declaration and a definition, two phases in one, but there have to
|
||
be three phases, that can be the result parser actions widely separated in
|
||
time, triggered by code widely separated in the source, and thinking of the
|
||
common and normal case is going to result in mental confusion, collapsing
|
||
things that are distinct, because the distinction is commonly uniportant and
|
||
elided. Hence the thick syntactic soup with which I have struggling when I
|
||
write C++ templates defining classes that define operators and then attempt
|
||
to use the operators.
|
||
|
||
In the language of C we have parse time actions, link time actions, and
|
||
execution time actions, and only at execution time is procedural code
|
||
constructed as a result of earlier actions actually performed procedurally.
|
||
|
||
We want procedural metacode that can construct procedural metacode. So we
|
||
want execution time actions performed during parsing. So let us call the
|
||
actions definitional actions, linking actions, and execution actions. And if
|
||
we ware going to have procedural actions during parsing, we are going to have
|
||
linking actions during parsing. (Of course, in actually existent C++, second
|
||
stage compilation does a whole lot of linker actions, resulting in
|
||
excessively tight coupling between linker and compiler, and the inability of
|
||
other languages to link to C++, and the syntax soup that ensues when I define
|
||
a template class containing inline operators.
|
||
|
||
# Forth the model
|
||
|
||
We assume the equivalent of Forth, where the interpreter directly interprets
|
||
and executes human readable and writeable text, by looking the symbols in the
|
||
text and performing the actions they comand, which commands may command the
|
||
interpreter to generate compiled and linked code, including compiled code that
|
||
generates compiled and linked code, commands the interpreter to add names for
|
||
what it has compiled to the name table, and then commands the interpreter to
|
||
execute those routines by name.
|
||
|
||
Except that Forth is absolutely typeless, or has only one type, fixed
|
||
precision integers that are also pointers, while we want a language in which
|
||
types are first class values, as manipulable as integers, except that they
|
||
are immutable, a language where a pointer to a pointer to an integer cannot
|
||
be added to a pointer, and subtraction of one pointer from another pointer of
|
||
the same type pointing into the same object produces an integer, where you
|
||
cannot point a pointer out of the range of the object it refers to, nor
|
||
increment a reference, only the referenced value.
|
||
|
||
Lexing merely needs symbols to be listed. Parsing merely needs them to be, in C++ terminology, declared but not defined. Pratt parsing puts operators in forth order, but knows and cares nothing about types, so is naturally adapted to a Forth like language which has only one type, or values have run time types, or generating an intermediate language which undergoes a second state compilation that produces statically typed code.
|
||
|
||
In forth, symbols pointed to memory addresses, and it was up to the command whether it would load an integer from an address, stored an integer at that address, execute a subroutine at that address, or go to that address, the ultimate in unsafe typelessness.
|
||
|
||
Pratt parsing is an outstandingly elegant solution to parsing, and allows compile time extension to the parser, though it needs a lexer driven by the symbol table if you have multi character operators, but I am still lost in the problem of type safety.
|
||
|
||
Metaprogramming in C++ is done a lazily evaluated purely functional language
|
||
where a template is usually used to construct a type from type arguments. I
|
||
want to construct types procedurally, and generate code procedurally, rather
|
||
than by invoking pure functions.
|
||
|
||
In Pratt parsing, the the language is parserd sequentially in parser order, but
|
||
the parser maintains a tree of recursive calls, and builds a tree of pointers
|
||
into the source, such that it enters each operator in polish order, and
|
||
finishes up each operator in reverse polish order.
|
||
|
||
On entering in polish order, this may be an operand with a variable number of
|
||
arguments (unary minus or infix minus) so it cannot know the number of operands
|
||
coming up, but on exiting in reverse polish order, it knows the number and
|
||
something about the type of the arguments, so it has to look for an
|
||
interpretation of the operator that can handle that many arguments of those
|
||
type. Which may not necessarily be a concrete type.
|
||
|
||
Operators that change the behavior of the lexer or the parser are typically
|
||
acted upon in polish order. Compilation to byte code that does not yet have
|
||
concrete types is done in reverse polish order, so operators that alter the
|
||
compilation to byte code are executed at that point. Operators that manipulate
|
||
that byte code during the linking to concrete types act at link time, when the
|
||
typeless byte code is invoked with concrete types.
|
||
|
||
Naming puts a symbol in the lexer symbol table.
|
||
|
||
Declaring puts a symbol in the parser symbol table
|
||
|
||
Defining compiles, and possibly links, the definition, and attaches that data
|
||
to the symbol where it may be used or executed in subsequent compilation and
|
||
linking steps when that symbol is subsequently invoked. If the definition
|
||
contains procedural code, it is not going to be executed procedurally until
|
||
compiled and linked, which will likely occur when the symbol is invoked later.
|
||
|
||
An ordinary procedure definition without concrete types is the equivalent of an
|
||
ordinary C++ template. When it is used with concrete types, the linker will
|
||
interet to the operations it invokes in terms of those concrete types, and fail
|
||
if they don’t support those operations.
|
||
|
||
A metacode procedure gets put into the lexer symbol table when it is named,
|
||
into the parser symbol table when it is defined. When it is declared, its
|
||
definition may be used when its symbol is encountered in polish order by the
|
||
parser, and may be executed at that time to modify the behavior of parser and
|
||
linker. When a named, declared, and defined symbol is encountered by the
|
||
parser in reverse polish order, its compiled code may be used to generate
|
||
linked code, and its linked and compiled code may manipulate the compiled code
|
||
preparator to linking.
|
||
|
||
When a symbol is declared, it gets added to the parser and lexer symbol table. When it is defined, it gets added to the linker symbol table. When defined with a concrete type, also gets added to the linker symbol table with those concrete types, as an optimization.
|
||
|
||
If an operation could produce an output of variant type, then it is an additive
|
||
algebraic type, which then has to handled by a switch statement.
|
||
|
||
There are five steps: Lexing, parsing, compiling, linking, and running, and
|
||
any fragment of source code may experience some or all of these steps, with the
|
||
resulting entries in the symbol table then being available to the next code
|
||
fragment, Forth style. Thus `77+9`gets lexed into `77, +, `, parsed into `+(77, 9)`, compiled into `77 9 +`,
|
||
linked into `77, 9 +<int, int>` and executed into `int(86` and the rest of the source code proceeds to parse, compile, link, and
|
||
run as if you had written `86`.
|
||
|
||
Further the source code can create run time code, code that gets declared,
|
||
defined, and linked during the compile that is executed during the compile,
|
||
modifying the behavior of the lexer, the parser, the compiler, and the linker
|
||
over the course of a single compile and link. This enables a forth style
|
||
bootstrapping, where the lexer, parser, compiler and linker lexes, compiles,
|
||
and links, most of its own potentially modifiable source code in every compile,
|
||
much as every c++ compile includes the header files for the standard template
|
||
library, so that much of your program is rewritten by template metacode that
|
||
you included at the start of the program.
|
||
|
||
Compiled but not linked code could potentially operate on variables of any
|
||
type, though if the variables did not have a type required by an operator, you
|
||
would get a link time error, not get a compile time error. This is OK because
|
||
linking of a fragment of source is not a separate step, but usually happens
|
||
before the lexer has gotten much further through the source code, happens as
|
||
soon as the code fragment is invoked with variables of defined type, though
|
||
usually of as yet undefined value.
|
||
|
||
A console program is an operator whose values are of the type iostream, it gets
|
||
linked as soon as the variable type is defined, and executed when you assign
|
||
defined values to iostream.
|
||
|
||
Because C++ metacode is purely functional, it gets lazily evaluated, so the
|
||
syntax and compiler can cheerfully leave it undefined when, or even if, it
|
||
gets executed. Purely functional languages only terminate by laziness. But
|
||
if we want to do the same things with procedural metacode, no option but to
|
||
explicitly define what get executed when. In which case pure lalr syntax is
|
||
going to impact the semantics, since lalr syntax defines the order of parse
|
||
time actions, and order of execution impacts the semantics. I am not
|
||
altogether certain as to whether the result is going to be intellibile and
|
||
predictable. Pratt syntax, however, is going to result in predictzble execution order.
|
||
|
||
The declaration, obviously, defines code that can be executed by a subsequent
|
||
parse action after the declaration parse action has been performed, and the
|
||
definition code that can be compiled after the definition parse action
|
||
performed.
|
||
|
||
The compiled code can be linked when when invoked with variables of defined
|
||
type and undefined value, and executed when invoked with variables of defined
|
||
type and an value.
|
||
|
||
Consider what happens when the definition defines an overload for an infix
|
||
operator. The definition of the infix operator can only be procedurally
|
||
executed when the parser calls the infix action with the arguments on the parse
|
||
stack, which happens long after the infix operator is overloaded.
|
||
|
||
The definition has to be parsed when the parser encounters it. But it is
|
||
procedural code, which cannot be procedurally executed until later, much
|
||
later. So the definition has to compile, not execute, procedural code, then
|
||
cause the data structure created by the declaration to point to that compiled
|
||
code. And then later when the parser encounters an actual use of the infix
|
||
operator, the compiled procedural code of the infix definition is actually
|
||
executed to generate linked procedural code with explicit and defined types,
|
||
which is part of the definition of the function or method in whose source code
|
||
the infix operator was used.
|
||
|
||
One profoundly irritating feature of C++ code, probably caused by LL parsing,
|
||
is that if the left hand side of an infix expression has an appropriate
|
||
overloaded operator, it works, but if the right hand side, it fails. Here we
|
||
see parsing having an incomprehensible and arbitrary influence on semantics.
|
||
|
||
C++ is a strongly typed language. With types, any procedure is has typed
|
||
inputs and outputs, and should only do safe and sensible things for that type.
|
||
C++ metacode manipulates types as first class objects, which implies that if we
|
||
were to do the same thing procedurally, types need a representation, and
|
||
procedural commands to make new types from old, and to garbage collect, or
|
||
memory manage, operations on these data objects, as if they were strings,
|
||
floating point numbers, or integers of known precision. So you could construct
|
||
or destruct an object of type type, generate new types by doing type operations
|
||
on old types, for example add two types or multiply two types to produce an
|
||
algebraic type, or create a type that is a const type or pointer type to an
|
||
existing type, which type actually lives in memory somewhere, in a variable
|
||
like any other variable. And, after constructing an algebraic type by
|
||
procedurally multiply two types, and perhaps storing in a variable of type
|
||
type, or invoking a function (aka C++ template type) that returns a type
|
||
dynamically, create an object of that type – or an array of objects of that
|
||
type. For every actual object, the language interpreter knows the type,
|
||
meaning the object of type X that you just constructed is somehow linked to the
|
||
continuing existence of the object of type type that has the value type X that
|
||
you used to construct it, and cannot be destroyed until all the obects created
|
||
using it are destroyed. Since the interpreter knows the type of every object,
|
||
including objects of type type, and since every command to do something with an
|
||
object is type aware, this can prevent the interpreter from being commanded to
|
||
do something stupid. Obviously type data has to be stored somewhere, and has
|
||
to be immutable, at least until garbage collected because no longer referenced.
|
||
|
||
Can circular type references exist? Well, not if they are immutable, because
|
||
if a type references a type, that type must already exist, and so cannot
|
||
reference a type that does not yet exist. It could reference a function that
|
||
generates types, but that reference is not circular. It could have values that
|
||
are constexpr, and values that reference static variables. If no circular
|
||
references possible, garbage collection by reference counting works
|
||
|
||
Types are algebraic types, sums and products of existing types, plus modifiers
|
||
such as `const, *,` and `&`.
|
||
|
||
Type information is potentially massive, and if we are executing a routine that
|
||
refers to a type by the function that generates it, we don’t want that
|
||
equivalent of a C++ template invoked every time, generating a new immutable
|
||
object every time that is an exact copy of what it produced the last time it
|
||
went through the loop. Rather, the interpreter needs short circuit the
|
||
construction by looking up a hash of that type constructing template call, to
|
||
check it it has been called with those function inputs, to already produced an
|
||
object of that type. And when a function that generates a type is executed,
|
||
needs to look for duplications of existing types. A great many template
|
||
invocations simply choose the right type out of a small set of possible types.
|
||
It is frequently the case that the same template may be invoked with an
|
||
enormous variety of variables, and come up with very few different concrete
|
||
results.
|
||
|
||
When the interpreter compiles a loop or a recursive call, the type information
|
||
is likely to be an invariant, which should get optimized out of the loops. But
|
||
when it is directly executing source code commands which command it to compile
|
||
source code. such optimization is impossible
|
||
|
||
But, as in forth, you can tell the interpreter to store the commands in a
|
||
routine somewhere, and when they are stored, the types have already been
|
||
resolved. Typically the interpreter is going to finish interpreting the source
|
||
code, producing stored programs each containing a limited amount of type
|
||
information.
|