1
0
forked from cheng/wallet
wallet/docs/parsers.md

696 lines
40 KiB
Markdown
Raw Normal View History

---
title: Parsers
...
This rambles a lot. Thoughts in progress: Summarizing my thoughts here at the top.
Linux scripts started off using lexing for parsing, resulting in complex and
incomprehensible semantics, producing unexpected results. (Try naming a
file `-r`, or a directory with spaces in the name.)
They are rapidly converging in actual usage to the operator precedence
syntax and semantics\
`command1 subcommand arg1 … argn infixoperator command2 subcommand …`
Which is parsed as\
`((staticclass1.staticmethod( arg1 … argn)) infixoperator ((staticclass2.staticmethod(…)))`
With line feed acting as `}{` operator, start of file acting as a `{` operator, end
of file acting as a `}` operator, suggesting that in a sane language, indent
increase should act as `{` operator, indent decrease should act as a `}`
operator.
Command line syntax sucks, because programs interpret their command
lines using a simple lexer, which lexes on spaces. Universal resource
identifier syntax sucks, because it was originally constructed so that it
could be a command line argument, hence no spaces, and because it was
designed to be parsed by a lexer.
But EBNF parsers also suck, because they do not parse the same way
humans do. Most actual programs can be parsed by a simple parser, even
though the language in principle requires a more powerful parser, becaus
humans do not use the nightmarish full power of a grammer that an EBNF
definition winds up defining.
Note that [LLVM language creation tools](https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/)
tutorial does not user an EBNF
parser. These tools also make creating a new language with JIT semantics
very easy.
We are programming in languages that are not parsed the way the
programmer is parsing them.
Programming languages ignore whitespace, because programmers tend to
express their meaning with whitespace for the human reader, and
whitespace grammer is not altogether parallel to the EBNF grammer.
There is a mismatch in grammers.
Seems to me that human parsing is combination of low level lexing, Pratt
parsing on operator right and left binding power, and a higher level of
grouping that works like lexing. Words are lexed by spaces and
punctuation, grouped by operator binding power, with operator
recognition taking into account the types on the stack, groups of parsed
words are bounded by statement separators, which can be lexed out,
groups of statements are grouped and bounded by indenting.
Some levels in the hierarchy are lexed out, others are operator binding
power parsed out. There are some “operators” that mean group separator
for a given hierarchical level, which is a tell that reveals lex style parsing,
for example semi colon in C++, full stop and paragraph break in text.
The never ending problems from mixing tab and spaces indenting can be
detected by making a increase or decrease of ident by a space a bracket
operator, and an increase or decrease by a tab a non matching bracket
operator.
Pratt parsing parsers operators by their left and right binding power
which is a superset of operator precedence parsing. EBNF does not
directly express this concept, and programming this concept into EBNF is
complicated, indirect, and imperfect because it is too powerful a
superset, that can express anything, including things that do not make
sense to the human writing the stuff to be parsed.
Pratt parsing finalizes an expression by visiting the operators in reverse
polish order, thus implicitly executing a stack of run time typed operands,
which eventually get compiled and eventually executed as just-in-time typed
or statically typed operands and operators.
For [identity](names/identity.html), we need Cryptographic Resource Identifiers,
which cannot conform the “Universal” Resource Identifier syntax and semantics.
Lexers are not powerful enough, and the fact that they are still used
for uniform resource identifiers, relative resource identifiers, and command
line arguments is a disgrace.
Advanced parsers, however, are too powerful, resulting in syntax that is
counter intuitive. That ninety percent of the time a program file can be
parsed by a simple parser incapable of recognizing the full set of
syntactically correct expressions that the language allows indicates that the
programmers mental model of the language has a more simple structure.
# Pratt Parsing
I really love the Pratt Parser, because it is short and simple, because if you
add to the symbol table you can add new syntax during compilation,
because what it recognizes corresponds to human intuition and human
reading.
But it is just not actually a parser. Given a source with invalid expressions
such as unary multiplication and unbalanced parentheses, it will cheerfully
generate a parse. It also lacks the concept out of which all the standard
parsers are constructed, that expressions are of different kinds, different
nonterminals.
To fix Pratt parsing, it would have to recognize operators as bracketing, as
prefix undery, postfix unary, or infix, and that some operators do not have an
infix kinds, and it would have to recognize that operands have types, and that
an operator produces a type from its inputs. It would have to attribute a
nonterminal to a subtree. It would have to recognize ternary operators as
operators.
And that is a major rewrite and reinvention.
Lalr parsers appear to be closer to the programmer mental model, but looking at
Pratt Parsing, there is a striking resemblance between C and what falls out
Pratts model:
The kind of “lexing” the Pratt parser does seems to have a natural
correspondence to the kind of parsing the programmer does as his eye rolls
over the code. Pratts deviations from what would be correct behavior in
simple arithmetic expressions composed of numerals and single character
symbols seem to strikingly resemble expressions that engineers find
comfortable.
When `expr` is called, it is provided the right binding power of the
token that called it. It consumes tokens until it meets a token whose left
binding power is equal or lower than the right binding power of the operator
that called it. It collects all tokens that bind together into a tree before
returning to the operator that called it.
The Pratt `peek` peeks ahead to see if what is coming up is an
operator, therefore needs to check what is coming up against a symbol table,
which existing implementations fail to explicitly implement.
The Pratt algorithm, as implemented by Pratt and followers, assumes that all
operators can be unary prefix or infix (hence the nud/led distinction). It
should get the nature of the upcoming operator from the symbol table (infix,
unary, or both, and if unary, prefix or postfix.
Although implementers have not realized it, they are treating all “non
operator” tokens as unary posfix operators. Instead of, or as well as, they
need to treat all tokens (where items recognized from a symbol table are
pre-aggregated) as operators, with ordinary characters as postfix unary,
spaces as postfix unary with weaker binding power, and a token consisting of
a utf8 iterator plus a byte count as equivalent to a left tree with right
single character leaves and a terminal left leaf.
Pratt parsing is like lexing, breaking a stream of characters into groups,
but the grouping is hierarchical. The algorithm annotates a linear text with
hierarchy.
Operators are characterized by a global order of left precedence, a global
order of right precedence (the difference giving us left associativity and
right associativity)
If we extend the Pratt algorithm with the concept of unitary postfix
operators, we see it is treating each ordinary unrecognized character as a
unitary postfix operator, and each whitespace character as a unitary postfix
operator of weaker binding power.
[Apodaca]:https://dev.to/jrop/pratt-parsing
Pratt and [Apodaca] are primarily interested in the case of unary minus, so
they handle the case of a tree with a potentially null token by
distinguishing between nud (no left context) and led (the right hand side of
an operator with left context).
Pratt assumes that in correct source text, `nud` is only going to encounter an
atomic token, in which case it consumes the token, constructs a leaf vertex
which points into the source, and returns, or a unary prefixoperator, or an
opening bracket. If it encounters an operator, it calls `expr` with the right
binding power of that operator, and when `expr`has finished parsing, returns
a corresponding vertex.
Not at all clear to me how it handles brackets. Pratt gets by without the
concept of matching tokens, or hides it implicitly. Seems to me that correct
parsing is that a correct vertex has to contain all matching tokens, and the
expressions cotained therein, so a vertex corresponding to a bracketed
expression has to point to the open and closing bracket terminals, and the
contained expression. I would guess that his algorithm winds up with a
tree that just happens to contain matching tokens in related positions in the tree.
Suppose the typical case, a tree of binary operators inside a tree of binary
operators: In that case, when `expr` is called, the source pointer is pointing
to the start of an expression. `expr` calls `nud` to parse the expression, and if
that is all she wrote (because ` peek` reveals an operator with lower left
binding power than the right binding power that `expr` was called with)
returns the edge to the vertext constructed by `nud`. Otherise, it parses out
the operator, and calls `led` with the right binding power of the operator it has encountered, to get the right hand argument of the binary operator. It
then constructs a vertex containing the operator, whose left edge points to
the node constructed by `nud` and whose right hand edge points to the node
constructed by `led`. If that is all she wrote, returns, otherwise iterates
its while loop, constructing the ever higher root of a right leaning tree of
all previous roots, whose ultimate left most leaf is the vertex constructed by
`nud`, and whose right hand vertexes were constructed by `led`.
The nud/led distinction is not sufficiently general. They did not realize
that they were treating ordinary characters as postfix unitary operators.
Trouble is, I want to use the parser as the lexer, which ensures that as the
human eye slides over the text, the text reads the way it is in fact
structured. But if we do Pratt parsing on single characters to group them
into larger aggregates, `p*--q*s` is going to be misaggregated by the parser
to `(( (p*)) (q*s)`, which is meaningless.
And, if we employ Pratts trick of nud/led distinction, will evaluate as
`p*(-(-q*s))` which gives us a meaningful but wrong result ` p*q*s`
If we allow multicharacter operators then they have to be lexed out at the
earliest stage of the process the Pratt algorithm has to be augmented by
aggregate tokens, found by attempting to the following text against a symbol
table. Existing Pratt algorithms tend to have an implicit symbol table of
one character symbols, everything in the symbol table being assumed to be
potentially either infix or unary prefix, and everything else outside the
implicit symbol table unary postfix.
If we extend the Pratt algorithm with the concept of unitary postfix
operators, we see it is treating each ordinary unrecognized character as a
unitary postfix operator, and each whitespace character as a unitary postfix
operator of weaker binding power.
Suppose a token consists of a utf8 iterator and a byte count.
So, all the entities we work with are trees, but recursion terminates because
some nodes of the tree have been collapsed to variables that consist of a
utf8 iterator and a byte count, *and some parts of the tree have been
partially collapsed to vertexes that consist of a ut8 iterator, a byte count,
and an array of trees*.
C++ forbids `“foo bar()”` to match `“foobar()”`, but
allows `“foobar ()”` to match, which is arguably an error.
`“foobar(”` has to lex out as a prefix operator. But is not
really a prefix unitary operator. It is a set of matching operators, like
brackets and the tenary operatro bool?value:value The commas and the closing
bracket are also part of it. Which brings us to recognizing ternary
operators. The naive single character Pratt algorithm handles ternary
operators correctly (assuming that the input text is valid) which is
surprising. So it should simply also match the commas and right bracket as a
particular case of ternary and higher operators in the initial symbol search,
albeit doing that so that it is simple and correct and naturally falls out of
the algorithm is not necessarily obvious.
Operator precedence gets you a long way, but it messed up because it did not
recognize the distinction between right binding power and left binding
power. Pratt gets you a long way further.
But Pratt messes up because it does not explicitly recognize the difference
between unitary prefix and unitary postfix, nor does it explicitly recognize
operator matching that a group of operators are one big multi argument
operator. It does not recognize that brackets are expressions of the form
symbol-expression-match, let alone that ternary operators are expressions of
the form expression-symbol-match-expression.
Needs to be able to recognize that expressions of the form
expression-symbol-expression-match-expression-match\...expression are
expressions, and convert the tree into prefix form (polish notation with
arguments bracketed) and into postfix form (reverse polish) with a count of
the stack size.
Needs to have a stack of symbols that need left matches.
# Lalr
Bison and yacc are
[joined at the hip](https://tomassetti.me/why-you-should-not-use-flex-yacc-and-bison/) to seven bit ascii and BNF, (through flex and lex)
whereas [ANTLR](https://tomassetti.me/ebnf/)
recognizes unicode and the far more concise and intelligible EBNF. ANTLR
generates ALL parsers, which allow syntax that allows statements that are ugly
and humanly unintelligible, while Bison when restricted to LALR parsers allows
only grammars that forbid certain excesses, but generates unintelligible error
messages when you specify a grammar that allows such excesses.
You could hand write your own lexer, and use it with BisonC++. Which seemingly
everyone does.
ANTLR allows expressions that take long time to parse, but only polynomially
long, fifth power, and prays that humans seldom use such expressions, which in
practice they seldom do. But sometimes they do, resulting in hideously bad
parser performance, where the parser runs out of memory or time. Because
he parser allows non LALR syntax, it may find many potential meanings
halfway through a straightforward lengthy expression that is entirely clear
to humans because the non LALR syntax would never occur to the human. In
ninety percent of files, there is not a single expression that cannot be
parsed by very short lookahead, because even if the language allows it,
people just do not use it, finding it unintelligible. Thus, a language that
allows non LALR syntax locks you in against subsequent syntax extension,
because the extension you would like to make already has some strange and non
obvious meaning in the existing syntax.
This makes it advisable to use a parser that can enforce a syntax definition
that does not permit non LALR expressions.
On the other hand, LALR parsers walk the tree in Reverse Polish
order, from the bottom up. This makes it hard to debug your grammar, and
hard to report syntax errors intelligibly. And sometimes you just cannot
express the grammar you want as LALR, and you wind up writing a superset of
the grammar you want, and then ad-hoc forbidding otherwise legitimate
constructions, in which case you have abandoned the simplicity and
directness of LALR, and the fact that it naturally tends to restrict you to
humanly intelligible syntax.
Top down makes debugging your syntax easier, and issuing useful error
messages a great deal easier. It is hard to provide any LALR handling of
syntax errors other than just stop at the first error, but top down makes it
a lot harder to implement semantics, because Reverse Polish order directly
expresses the actions you want to take in the order that you need to take
them.
LALR allows left recursion, so that you can naturally make minus and divide
associate in the correct and expected order, while with LL, you wind up
doing something weird and complicated you build the tree, then you have
another pass to get it into the correct order.
Most top down parsers, such as ANTLR, have a workaround to allow left
recursion. They internally turn it into right recursion by the standard
transformation, and then optimize out the ensuing tail recursion. But that
is a hack, which increases the distance between your expression tree and
your abstract syntax tree, still increases the distance between your grammar
and your semantics during parser execution. You are walking the hack,
instead of walking your own grammars syntax tree in Reverse Polish order.
Implementing semantics becomes more complex. You still wind up with added
complexity when doing left recursion, just moved around a bit.
LALR allows you to more directly express the grammar you want to express. With
top down parsers, you can accomplish the same thing, but you have to take a
more roundabout route to express the same grammar, and again you are likely
to find you have allowed expressions that you do not want and which do not
naturally have reasonable and expected semantics.
ANTLR performs top down generation of the expression tree. Your code called by
ANTLR converts the expression tree into the Abstract Syntax tree, and the
abstract syntax tree into the High Level Intermediate Representation.
The ANTLR algorithm can be slow as a week of sundays, or wind up eating
polynomially large amounts of memory till it crashes. To protect against
this problem, [he
suggests using the fast SLL algorithm first, and should it fail, then use
the full on potentially slow and memory hungry LL\* algorithm.](https://github.com/antlr/antlr4/issues/374) Ninety
percent of language files can be parsed by the fast algorithm, because people
just do not use too clever by half constructions. But it appears to me that
anything that cannot be parsed by SLL, but can be parsed by LL\*, is not good
code that what confuses an SLL parser also confuses a human, that the
alternate readings permitted by the larger syntax are never code that people
want to use.
Antlr does not know or care if your grammar makes any sense until it tries to
analyze particular texts. But you would like to know up front if your
grammar is valid.
LALR parsers are bottom up, so have terrible error messages when they analyze
a particular example of the text, but they have the enormous advantage that
they will analyze your grammar up front and guarantee that any grammatically
correct statement is LALR. If a LALR parser can analyze it, chances are that
a human can also. ANTLR permits grammars that permit unintelligible statements.
The [LRX parser](http://lrxpg.com/downloads.html) looks the most
suitable for your purpose. It has a restrictive license and only runs in the
visual studio environment, but you only need to distribute the source code it
builds the compiler from as open source, not the compiler compiler. It halts
at the first error message, since incapable of building intelligible multiple
error messages. The compiler it generates builds a syntax tree and a symbol
table.
The generically named [lalr](https://github.com/cwbaker/lalr)
looks elegantly simple, and not joined at the hip to all sorts of strange
environment. Unlike Bison C++, should be able to handle unicode strings,
with its regular expressionsrx pa. It only handles BNF, not EBNF, but that
is a relatively minor detail. Its regular expressions are under documented,
but regular expression syntax is pretty standard. It does not build a symbol
table.
And for full generality, you really need a symbol table where the symbols get
syntax, which is a major extension to any existing parser. That starts to
look like hard work. The lalr algorithm does not add syntax on the fly. The
lrxpg parser does build a symbol tree one on the fly, but not syntax on the
fly but its website just went down. No one has attempted to write a
language that can add syntax on the fly. They build a syntax capable of
expressing an arbitrary graph with symbolic links, and then give the graph
extensible semantics. The declaration/definition semantic is not full
parsing on the definition, but rather operates on the tree.
In practice, LALR parsers need to be extended beyond LALR with operator
precedence. Expressing operator precedence within strict LALR is apt to be
messy. And, because LALR walks the tree in reverse polish order, you want
the action that gets executed at parse time to return a value that the
generated parser puts on a stack managed by the parser, which stack is
available when the action of the operator that consumes it is called. In
which case the definition/declaration semantic declares a symbol that has a
directed graph associated with it, which graph is then walked to interpret
what is on the parse stack. The data of the declaration defines metacode
that is executed when the symbol is invoked, the directed graph associated
with the symbol definition being metacode executed by the action that parser
performs when the symbol is used. The definition/declaration semantic allows
arbitrary graphs containing cycles (full recursion) to be defined, by the
declaration adding indirections to a previously constructed directed graph.
The operator-precedence parser can parse all LR(1) grammars where two
consecutive nonterminals and epsilon never appear in the right-hand side of any
rule. They are simple enough to write by hand, which is not generally the case
with more sophisticated right shift-reduce parsers. Second, they can be written
to consult an operator table at run time. Considering that “universal” resource
locators and command lines are parsed with mere lexers, perhaps a hand written
operator-precedence parser is good enough. After all, Forth and Lisp have less.
C++ variadic templates are a purely functional metalanguage operating on the
that stack. Purely functional languages suck, as demonstrated by the fact
that we are now retroactively shoehorning procedural code (if constexpr) into
C++ template meta language. Really, you need the parse stack of previously
encountered arguments to potentially contain arbitrary objects.
When a lalr parser parses an if-then-else statement, then if the parser
grammer defines “if” as the nonterminal, which may contain an “else”
clause, it is going to execute the associated actions in the reverse order.
But if you define “else” as the nonterminal, which must be preceded by an
“if” clause, then the parser will execute the associated actions in the
expected order. But suppose you have an else clause in curly brackets
inside an if-then-else. Then the parse action order is necessarily going to
be different from the procedural. Further, the very definition of an if-then-else clause implies a parse time in which all actions are performed, and a procedural time in which only one action is performed.
Definition code metacode must operate on the parser stack, but declaration
metacode may operate on a different stack, implying a coroutine relationship
between declaration metacode and definition metacode. The parser, to be
intelligible, has to perform actions in as close to left to right order as
possible hence my comment that the “else” nonterminal must contain the “if”
nonterminal, not the other way around but what if the else nonterminal
contains an “if then else” inside curly braces? The parser actions can and
will happen in different order to the run time actions. Every term of the
if-then-else structure is going to have its action performed in syntax order,
but the syntax order has to be capable of implying a different procedural
order, during which not all actions of an if-then-else structure will be
performed. And similarly with loops, where every term of the loop causes a
parse time action to be performed once in parse time order, but procedural
time actions in a different order, and performed many times.
This implies that any fragment of source code in a language that uses the
declaration/definition syntax and semantic gets to do stuff in three phases
(Hence in C, you can define a variable or a function without declaring it,
resulting in link time errors, and in C++ define a class without declaring
its methods and data, resulting in compilation errors at a stage of
compilation that is ill defined and inexplicit)
The parser action of the declaration statement constructs a declaration data
structure, which is metacode, possibly invoking the metacode generated by
previous declarations and definitions. When the term declared is then used,
then the metacode of the definition is executed. And the usage may well
invoke the metacode generated by the action associated at parse time with the
declaration statement, but attempting to do so causes an error in the parser
action if the declaration action has not yet been encountered in parse action
order.
So, we get parser actions which construct definition and declaration metacode
and subsequent parser actions, performed later during the parse of subsequent
source code that invoke that metacode by name to construct metacode. But, as
we see in the case of the if-then-else and do-while constructions, there must
be a third execution phase, in which the explicitly procedural code
constructed, but not executed, by the metacode, is actually executed
procedurally. Which, of course, in C++ is performed after the link and load
phase. But we want procedural metacode. And since procedural metacode must
contain conditional and loops, there has to be a third phase during parsing,
executed as a result of parse time actions, that procedurally performs ifs
and loops in metacode. So a declaration can invoke the metacode constructed
by previous declarations meaning that a parse time action executes metacode
constructed by previous parse time actions. But, to invoke procedural
metacode from a parse time action, a previous parse time action has to have
invoked metacode constructed by an even earlier parse time action to
construct procedural metacode.
Of course all three phases can be collapsed into one, as a definition can act
as both a declaration and a definition, two phases in one, but there have to
be three phases, that can be the result parser actions widely separated in
time, triggered by code widely separated in the source, and thinking of the
common and normal case is going to result in mental confusion, collapsing
things that are distinct, because the distinction is commonly uniportant and
elided. Hence the thick syntactic soup with which I have struggling when I
write C++ templates defining classes that define operators and then attempt
to use the operators.
In the language of C we have parse time actions, link time actions, and
execution time actions, and only at execution time is procedural code
constructed as a result of earlier actions actually performed procedurally.
We want procedural metacode that can construct procedural metacode. So we
want execution time actions performed during parsing. So let us call the
actions definitional actions, linking actions, and execution actions. And if
we ware going to have procedural actions during parsing, we are going to have
linking actions during parsing. (Of course, in actually existent C++, second
stage compilation does a whole lot of linker actions, resulting in
excessively tight coupling between linker and compiler, and the inability of
other languages to link to C++, and the syntax soup that ensues when I define
a template class containing inline operators.
# Forth the model
We assume the equivalent of Forth, where the interpreter directly interprets
and executes human readable and writeable text, by looking the symbols in the
text and performing the actions they comand, which commands may command the
interpreter to generate compiled and linked code, including compiled code that
generates compiled and linked code, commands the interpreter to add names for
what it has compiled to the name table, and then commands the interpreter to
execute those routines by name.
Except that Forth is absolutely typeless, or has only one type, fixed
precision integers that are also pointers, while we want a language in which
types are first class values, as manipulable as integers, except that they
are immutable, a language where a pointer to a pointer to an integer cannot
be added to a pointer, and subtraction of one pointer from another pointer of
the same type pointing into the same object produces an integer, where you
cannot point a pointer out of the range of the object it refers to, nor
increment a reference, only the referenced value.
Lexing merely needs symbols to be listed. Parsing merely needs them to be, in C++ terminology, declared but not defined. Pratt parsing puts operators in forth order, but knows and cares nothing about types, so is naturally adapted to a Forth like language which has only one type, or values have run time types, or generating an intermediate language which undergoes a second state compilation that produces statically typed code.
In forth, symbols pointed to memory addresses, and it was up to the command whether it would load an integer from an address, stored an integer at that address, execute a subroutine at that address, or go to that address, the ultimate in unsafe typelessness.
Pratt parsing is an outstandingly elegant solution to parsing, and allows compile time extension to the parser, though it needs a lexer driven by the symbol table if you have multi character operators, but I am still lost in the problem of type safety.
Metaprogramming in C++ is done a lazily evaluated purely functional language
where a template is usually used to construct a type from type arguments. I
want to construct types procedurally, and generate code procedurally, rather
than by invoking pure functions.
In Pratt parsing, the the language is parserd sequentially in parser order, but
the parser maintains a tree of recursive calls, and builds a tree of pointers
into the source, such that it enters each operator in polish order, and
finishes up each operator in reverse polish order.
On entering in polish order, this may be an operand with a variable number of
arguments (unary minus or infix minus) so it cannot know the number of operands
coming up, but on exiting in reverse polish order, it knows the number and
something about the type of the arguments, so it has to look for an
interpretation of the operator that can handle that many arguments of those
type. Which may not necessarily be a concrete type.
Operators that change the behavior of the lexer or the parser are typically
acted upon in polish order. Compilation to byte code that does not yet have
concrete types is done in reverse polish order, so operators that alter the
compilation to byte code are executed at that point. Operators that manipulate
that byte code during the linking to concrete types act at link time, when the
typeless byte code is invoked with concrete types.
Naming puts a symbol in the lexer symbol table.
Declaring puts a symbol in the parser symbol table
Defining compiles, and possibly links, the definition, and attaches that data
to the symbol where it may be used or executed in subsequent compilation and
linking steps when that symbol is subsequently invoked. If the definition
contains procedural code, it is not going to be executed procedurally until
compiled and linked, which will likely occur when the symbol is invoked later.
An ordinary procedure definition without concrete types is the equivalent of an
ordinary C++ template. When it is used with concrete types, the linker will
interet to the operations it invokes in terms of those concrete types, and fail
if they dont support those operations.
A metacode procedure gets put into the lexer symbol table when it is named,
into the parser symbol table when it is defined. When it is declared, its
definition may be used when its symbol is encountered in polish order by the
parser, and may be executed at that time to modify the behavior of parser and
linker. When a named, declared, and defined symbol is encountered by the
parser in reverse polish order, its compiled code may be used to generate
linked code, and its linked and compiled code may manipulate the compiled code
preparator to linking.
When a symbol is declared, it gets added to the parser and lexer symbol table. When it is defined, it gets added to the linker symbol table. When defined with a concrete type, also gets added to the linker symbol table with those concrete types, as an optimization.
If an operation could produce an output of variant type, then it is an additive
algebraic type, which then has to handled by a switch statement.
There are five steps: Lexing, parsing, compiling, linking, and running, and
any fragment of source code may experience some or all of these steps, with the
resulting entries in the symbol table then being available to the next code
fragment, Forth style. Thus `77+9`gets lexed into `77, +, `, parsed into `+(77, 9)`, compiled into `77 9 +`,
linked into `77, 9 +<int, int>` and executed into `int(86` and the rest of the source code proceeds to parse, compile, link, and
run as if you had written `86`.
Further the source code can create run time code, code that gets declared,
defined, and linked during the compile that is executed during the compile,
modifying the behavior of the lexer, the parser, the compiler, and the linker
over the course of a single compile and link. This enables a forth style
bootstrapping, where the lexer, parser, compiler and linker lexes, compiles,
and links, most of its own potentially modifiable source code in every compile,
much as every c++ compile includes the header files for the standard template
library, so that much of your program is rewritten by template metacode that
you included at the start of the program.
Compiled but not linked code could potentially operate on variables of any
type, though if the variables did not have a type required by an operator, you
would get a link time error, not get a compile time error. This is OK because
linking of a fragment of source is not a separate step, but usually happens
before the lexer has gotten much further through the source code, happens as
soon as the code fragment is invoked with variables of defined type, though
usually of as yet undefined value.
A console program is an operator whose values are of the type iostream, it gets
linked as soon as the variable type is defined, and executed when you assign
defined values to iostream.
Because C++ metacode is purely functional, it gets lazily evaluated, so the
syntax and compiler can cheerfully leave it undefined when, or even if, it
gets executed. Purely functional languages only terminate by laziness. But
if we want to do the same things with procedural metacode, no option but to
explicitly define what get executed when. In which case pure lalr syntax is
going to impact the semantics, since lalr syntax defines the order of parse
time actions, and order of execution impacts the semantics. I am not
altogether certain as to whether the result is going to be intellibile and
predictable. Pratt syntax, however, is going to result in predictzble execution order.
The declaration, obviously, defines code that can be executed by a subsequent
parse action after the declaration parse action has been performed, and the
definition code that can be compiled after the definition parse action
performed.
The compiled code can be linked when when invoked with variables of defined
type and undefined value, and executed when invoked with variables of defined
type and an value.
Consider what happens when the definition defines an overload for an infix
operator. The definition of the infix operator can only be procedurally
executed when the parser calls the infix action with the arguments on the parse
stack, which happens long after the infix operator is overloaded.
The definition has to be parsed when the parser encounters it. But it is
procedural code, which cannot be procedurally executed until later, much
later. So the definition has to compile, not execute, procedural code, then
cause the data structure created by the declaration to point to that compiled
code. And then later when the parser encounters an actual use of the infix
operator, the compiled procedural code of the infix definition is actually
executed to generate linked procedural code with explicit and defined types,
which is part of the definition of the function or method in whose source code
the infix operator was used.
One profoundly irritating feature of C++ code, probably caused by LL parsing,
is that if the left hand side of an infix expression has an appropriate
overloaded operator, it works, but if the right hand side, it fails. Here we
see parsing having an incomprehensible and arbitrary influence on semantics.
C++ is a strongly typed language. With types, any procedure is has typed
inputs and outputs, and should only do safe and sensible things for that type.
C++ metacode manipulates types as first class objects, which implies that if we
were to do the same thing procedurally, types need a representation, and
procedural commands to make new types from old, and to garbage collect, or
memory manage, operations on these data objects, as if they were strings,
floating point numbers, or integers of known precision. So you could construct
or destruct an object of type type, generate new types by doing type operations
on old types, for example add two types or multiply two types to produce an
algebraic type, or create a type that is a const type or pointer type to an
existing type, which type actually lives in memory somewhere, in a variable
like any other variable. And, after constructing an algebraic type by
procedurally multiply two types, and perhaps storing in a variable of type
type, or invoking a function (aka C++ template type) that returns a type
dynamically, create an object of that type or an array of objects of that
type. For every actual object, the language interpreter knows the type,
meaning the object of type X that you just constructed is somehow linked to the
continuing existence of the object of type type that has the value type X that
you used to construct it, and cannot be destroyed until all the obects created
using it are destroyed. Since the interpreter knows the type of every object,
including objects of type type, and since every command to do something with an
object is type aware, this can prevent the interpreter from being commanded to
do something stupid. Obviously type data has to be stored somewhere, and has
to be immutable, at least until garbage collected because no longer referenced.
Can circular type references exist? Well, not if they are immutable, because
if a type references a type, that type must already exist, and so cannot
reference a type that does not yet exist. It could reference a function that
generates types, but that reference is not circular. It could have values that
are constexpr, and values that reference static variables. If no circular
references possible, garbage collection by reference counting works
Types are algebraic types, sums and products of existing types, plus modifiers
such as `const, *,` and `&`.
Type information is potentially massive, and if we are executing a routine that
refers to a type by the function that generates it, we dont want that
equivalent of a C++ template invoked every time, generating a new immutable
object every time that is an exact copy of what it produced the last time it
went through the loop. Rather, the interpreter needs short circuit the
construction by looking up a hash of that type constructing template call, to
check it it has been called with those function inputs, to already produced an
object of that type. And when a function that generates a type is executed,
needs to look for duplications of existing types. A great many template
invocations simply choose the right type out of a small set of possible types.
It is frequently the case that the same template may be invoked with an
enormous variety of variables, and come up with very few different concrete
results.
When the interpreter compiles a loop or a recursive call, the type information
is likely to be an invariant, which should get optimized out of the loops. But
when it is directly executing source code commands which command it to compile
source code. such optimization is impossible
But, as in forth, you can tell the interpreter to store the commands in a
routine somewhere, and when they are stored, the types have already been
resolved. Typically the interpreter is going to finish interpreting the source
code, producing stored programs each containing a limited amount of type
information.