--- title: Parsers --- This rambles a lot. Thoughts in progress: Summarizing my thoughts here at the top. Linux scripts started off using lexing for parsing, resulting in complex and incomprehensible semantics, producing unexpected results. (Try naming a file `-r`, or a directory with spaces in the name.) They are rapidly converging in actual usage to the operator precedence syntax and semantics\ `command1 subcommand arg1 … argn infixoperator command2 subcommand …` Which is parsed as\ `((staticclass1.staticmethod( arg1 … argn)) infixoperator ((staticclass2.staticmethod(…)))` With line feed acting as `}{` operator, start of file acting as a `{` operator, end of file acting as a `}` operator, suggesting that in a sane language, indent increase should act as `{` operator, indent decrease should act as a `}` operator. Command line syntax sucks, because programs interpret their command lines using a simple lexer, which lexes on spaces. Universal resource identifier syntax sucks, because it was originally constructed so that it could be a command line argument, hence no spaces, and because it was designed to be parsed by a lexer. But EBNF parsers also suck, because they do not parse the same way humans do. Most actual programs can be parsed by a simple parser, even though the language in principle requires a more powerful parser, becaus humans do not use the nightmarish full power of a grammer that an EBNF definition winds up defining. Note that [LLVM language creation tools](https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/) tutorial does not user an EBNF parser. These tools also make creating a new language with JIT semantics very easy. We are programming in languages that are not parsed the way the programmer is parsing them. Programming languages ignore whitespace, because programmers tend to express their meaning with whitespace for the human reader, and whitespace grammer is not altogether parallel to the EBNF grammer. There is a mismatch in grammers. Seems to me that human parsing is combination of low level lexing, Pratt parsing on operator right and left binding power, and a higher level of grouping that works like lexing. Words are lexed by spaces and punctuation, grouped by operator binding power, with operator recognition taking into account the types on the stack, groups of parsed words are bounded by statement separators, which can be lexed out, groups of statements are grouped and bounded by indenting. Some levels in the hierarchy are lexed out, others are operator binding power parsed out. There are some “operators” that mean group separator for a given hierarchical level, which is a tell that reveals lex style parsing, for example semi colon in C++, full stop and paragraph break in text. The never ending problems from mixing tab and spaces indenting can be detected by making a increase or decrease of ident by a space a bracket operator, and an increase or decrease by a tab a non matching bracket operator. Pratt parsing parsers operators by their left and right binding power – which is a superset of operator precedence parsing. EBNF does not directly express this concept, and programming this concept into EBNF is complicated, indirect, and imperfect – because it is too powerful a superset, that can express anything, including things that do not make sense to the human writing the stuff to be parsed. Pratt parsing finalizes an expression by visiting the operators in reverse polish order, thus implicitly executing a stack of run time typed operands, which eventually get compiled and eventually executed as just-in-time typed or statically typed operands and operators. For [identity](identity.html), we need Cryptographic Resource Identifiers, which cannot conform the “Universal” Resource Identifier syntax and semantics. Lexers are not powerful enough, and the fact that they are still used for uniform resource identifiers, relative resource identifiers, and command line arguments is a disgrace. Advanced parsers, however, are too powerful, resulting in syntax that is counter intuitive. That ninety percent of the time a program file can be parsed by a simple parser incapable of recognizing the full set of syntactically correct expressions that the language allows indicates that the programmer’s mental model of the language has a more simple structure. # Pratt Parsing I really love the Pratt Parser, because it is short and simple, because if you add to the symbol table you can add new syntax during compilation, because what it recognizes corresponds to human intuition and human reading. But it is just not actually a parser. Given a source with invalid expressions such as unary multiplication and unbalanced parentheses, it will cheerfully generate a parse. It also lacks the concept out of which all the standard parsers are constructed, that expressions are of different kinds, different nonterminals. To fix Pratt parsing, it would have to recognize operators as bracketing, as prefix undery, postfix unary, or infix, and that some operators do not have an infix kinds, and it would have to recognize that operands have types, and that an operator produces a type from its inputs. It would have to attribute a nonterminal to a subtree. It would have to recognize ternary operators as operators. And that is a major rewrite and reinvention. Lalr parsers appear to be closer to the programmer mental model, but looking at Pratt Parsing, there is a striking resemblance between C and what falls out Pratt’s model: The kind of “lexing” the Pratt parser does seems to have a natural correspondence to the kind of parsing the programmer does as his eye rolls over the code. Pratt’s deviations from what would be correct behavior in simple arithmetic expressions composed of numerals and single character symbols seem to strikingly resemble expressions that engineers find comfortable. When `expr` is called, it is provided the right binding power of the token that called it. It consumes tokens until it meets a token whose left binding power is equal or lower than the right binding power of the operator that called it. It collects all tokens that bind together into a tree before returning to the operator that called it. The Pratt `peek` peeks ahead to see if what is coming up is an operator, therefore needs to check what is coming up against a symbol table, which existing implementations fail to explicitly implement. The Pratt algorithm, as implemented by Pratt and followers, assumes that all operators can be unary prefix or infix (hence the nud/led distinction). It should get the nature of the upcoming operator from the symbol table (infix, unary, or both, and if unary, prefix or postfix. Although implementers have not realized it, they are treating all “non operator” tokens as unary posfix operators. Instead of, or as well as, they need to treat all tokens (where items recognized from a symbol table are pre-aggregated) as operators, with ordinary characters as postfix unary, spaces as postfix unary with weaker binding power, and a token consisting of a utf8 iterator plus a byte count as equivalent to a left tree with right single character leaves and a terminal left leaf. Pratt parsing is like lexing, breaking a stream of characters into groups, but the grouping is hierarchical. The algorithm annotates a linear text with hierarchy. Operators are characterized by a global order of left precedence, a global order of right precedence (the difference giving us left associativity and right associativity) If we extend the Pratt algorithm with the concept of unitary postfix operators, we see it is treating each ordinary unrecognized character as a unitary postfix operator, and each whitespace character as a unitary postfix operator of weaker binding power. [Apodaca]:https://dev.to/jrop/pratt-parsing Pratt and [Apodaca] are primarily interested in the case of unary minus, so they handle the case of a tree with a potentially null token by distinguishing between nud (no left context) and led (the right hand side of an operator with left context). Pratt assumes that in correct source text, `nud` is only going to encounter an atomic token, in which case it consumes the token, constructs a leaf vertex which points into the source, and returns, or a unary prefixoperator, or an opening bracket. If it encounters an operator, it calls `expr` with the right binding power of that operator, and when `expr`has finished parsing, returns a corresponding vertex. Not at all clear to me how it handles brackets. Pratt gets by without the concept of matching tokens, or hides it implicitly. Seems to me that correct parsing is that a correct vertex has to contain all matching tokens, and the expressions cotained therein, so a vertex corresponding to a bracketed expression has to point to the open and closing bracket terminals, and the contained expression. I would guess that his algorithm winds up with a tree that just happens to contain matching tokens in related positions in the tree. Suppose the typical case, a tree of binary operators inside a tree of binary operators: In that case, when `expr` is called, the source pointer is pointing to the start of an expression. `expr` calls `nud` to parse the expression, and if that is all she wrote (because ` peek` reveals an operator with lower left binding power than the right binding power that `expr` was called with) returns the edge to the vertext constructed by `nud`. Otherise, it parses out the operator, and calls `led` with the right binding power of the operator it has encountered, to get the right hand argument of the binary operator. It then constructs a vertex containing the operator, whose left edge points to the node constructed by `nud` and whose right hand edge points to the node constructed by `led`. If that is all she wrote, returns, otherwise iterates its while loop, constructing the ever higher root of a right leaning tree of all previous roots, whose ultimate left most leaf is the vertex constructed by `nud`, and whose right hand vertexes were constructed by `led`. The nud/led distinction is not sufficiently general. They did not realize that they were treating ordinary characters as postfix unitary operators. Trouble is, I want to use the parser as the lexer, which ensures that as the human eye slides over the text, the text reads the way it is in fact structured. But if we do Pratt parsing on single characters to group them into larger aggregates, `p*--q*s` is going to be misaggregated by the parser to `(( (p*)−) − (q*s)`, which is meaningless. And, if we employ Pratt’s trick of nud/led distinction, will evaluate as `p*(-(-q*s))` which gives us a meaningful but wrong result ` p*q*s` If we allow multicharacter operators then they have to be lexed out at the earliest stage of the process – the Pratt algorithm has to be augmented by aggregate tokens, found by attempting to the following text against a symbol table. Existing Pratt algorithms tend to have an implicit symbol table of one character symbols, everything in the symbol table being assumed to be potentially either infix or unary prefix, and everything else outside the implicit symbol table unary postfix. If we extend the Pratt algorithm with the concept of unitary postfix operators, we see it is treating each ordinary unrecognized character as a unitary postfix operator, and each whitespace character as a unitary postfix operator of weaker binding power. Suppose a token consists of a utf8 iterator and a byte count. So, all the entities we work with are trees, but recursion terminates because some nodes of the tree have been collapsed to variables that consist of a utf8 iterator and a byte count, *and some parts of the tree have been partially collapsed to vertexes that consist of a ut8 iterator, a byte count, and an array of trees*. C++ forbids `“foo bar()”` to match `“foobar()”`, but allows `“foobar ()”` to match, which is arguably an error. `“foobar(”` has to lex out as a prefix operator. But is not really a prefix unitary operator. It is a set of matching operators, like brackets and the tenary operatro bool?value:value The commas and the closing bracket are also part of it. Which brings us to recognizing ternary operators. The naive single character Pratt algorithm handles ternary operators correctly (assuming that the input text is valid) which is surprising. So it should simply also match the commas and right bracket as a particular case of ternary and higher operators in the initial symbol search, albeit doing that so that it is simple and correct and naturally falls out of the algorithm is not necessarily obvious. Operator precedence gets you a long way, but it messed up because it did not recognize the distinction between right binding power and left binding power. Pratt gets you a long way further. But Pratt messes up because it does not explicitly recognize the difference between unitary prefix and unitary postfix, nor does it explicitly recognize operator matching – that a group of operators are one big multi argument operator. It does not recognize that brackets are expressions of the form symbol-expression-match, let alone that ternary operators are expressions of the form expression-symbol-match-expression. Needs to be able to recognize that expressions of the form expression-symbol-expression-match-expression-match\...expression are expressions, and convert the tree into prefix form (polish notation with arguments bracketed) and into postfix form (reverse polish) with a count of the stack size. Needs to have a stack of symbols that need left matches. # Lalr Bison and yacc are [joined at the hip](https://tomassetti.me/why-you-should-not-use-flex-yacc-and-bison/) to seven bit ascii and BNF, (through flex and lex) whereas [ANTLR](https://tomassetti.me/ebnf/) recognizes unicode and the far more concise and intelligible EBNF. ANTLR generates ALL parsers, which allow syntax that allows statements that are ugly and humanly unintelligible, while Bison when restricted to LALR parsers allows only grammars that forbid certain excesses, but generates unintelligible error messages when you specify a grammar that allows such excesses. You could hand write your own lexer, and use it with BisonC++. Which seemingly everyone does. ANTLR allows expressions that take long time to parse, but only polynomially long, fifth power, and prays that humans seldom use such expressions, which in practice they seldom do. But sometimes they do, resulting in hideously bad parser performance, where the parser runs out of memory or time. Because he parser allows non LALR syntax, it may find many potential meanings halfway through a straightforward lengthy expression that is entirely clear to humans because the non LALR syntax would never occur to the human. In ninety percent of files, there is not a single expression that cannot be parsed by very short lookahead, because even if the language allows it, people just do not use it, finding it unintelligible. Thus, a language that allows non LALR syntax locks you in against subsequent syntax extension, because the extension you would like to make already has some strange and non obvious meaning in the existing syntax. This makes it advisable to use a parser that can enforce a syntax definition that does not permit non LALR expressions. On the other hand, LALR parsers walk the tree in Reverse Polish order, from the bottom up. This makes it hard to debug your grammar, and hard to report syntax errors intelligibly. And sometimes you just cannot express the grammar you want as LALR, and you wind up writing a superset of the grammar you want, and then ad-hoc forbidding otherwise legitimate constructions, in which case you have abandoned the simplicity and directness of LALR, and the fact that it naturally tends to restrict you to humanly intelligible syntax. Top down makes debugging your syntax easier, and issuing useful error messages a great deal easier. It is hard to provide any LALR handling of syntax errors other than just stop at the first error, but top down makes it a lot harder to implement semantics, because Reverse Polish order directly expresses the actions you want to take in the order that you need to take them. LALR allows left recursion, so that you can naturally make minus and divide associate in the correct and expected order, while with LL, you wind up doing something weird and complicated – you build the tree, then you have another pass to get it into the correct order. Most top down parsers, such as ANTLR, have a workaround to allow left recursion. They internally turn it into right recursion by the standard transformation, and then optimize out the ensuing tail recursion. But that is a hack, which increases the distance between your expression tree and your abstract syntax tree, still increases the distance between your grammar and your semantics during parser execution. You are walking the hack, instead of walking your own grammar’s syntax tree in Reverse Polish order. Implementing semantics becomes more complex. You still wind up with added complexity when doing left recursion, just moved around a bit. LALR allows you to more directly express the grammar you want to express. With top down parsers, you can accomplish the same thing, but you have to take a more roundabout route to express the same grammar, and again you are likely to find you have allowed expressions that you do not want and which do not naturally have reasonable and expected semantics. ANTLR performs top down generation of the expression tree. Your code called by ANTLR converts the expression tree into the Abstract Syntax tree, and the abstract syntax tree into the High Level Intermediate Representation. The ANTLR algorithm can be slow as a week of sundays, or wind up eating polynomially large amounts of memory till it crashes. To protect against this problem, [he suggests using the fast SLL algorithm first, and should it fail, then use the full on potentially slow and memory hungry LL\* algorithm.](https://github.com/antlr/antlr4/issues/374) Ninety percent of language files can be parsed by the fast algorithm, because people just do not use too clever by half constructions. But it appears to me that anything that cannot be parsed by SLL, but can be parsed by LL\*, is not good code – that what confuses an SLL parser also confuses a human, that the alternate readings permitted by the larger syntax are never code that people want to use. Antlr does not know or care if your grammar makes any sense until it tries to analyze particular texts. But you would like to know up front if your grammar is valid. LALR parsers are bottom up, so have terrible error messages when they analyze a particular example of the text, but they have the enormous advantage that they will analyze your grammar up front and guarantee that any grammatically correct statement is LALR. If a LALR parser can analyze it, chances are that a human can also. ANTLR permits grammars that permit unintelligible statements. The [LRX parser](http://lrxpg.com/downloads.html) looks the most suitable for your purpose. It has a restrictive license and only runs in the visual studio environment, but you only need to distribute the source code it builds the compiler from as open source, not the compiler compiler. It halts at the first error message, since incapable of building intelligible multiple error messages. The compiler it generates builds a syntax tree and a symbol table. The generically named [lalr](https://github.com/cwbaker/lalr) looks elegantly simple, and not joined at the hip to all sorts of strange environment. Unlike Bison C++, should be able to handle unicode strings, with its regular expressionsrx pa. It only handles BNF, not EBNF, but that is a relatively minor detail. Its regular expressions are under documented, but regular expression syntax is pretty standard. It does not build a symbol table. And for full generality, you really need a symbol table where the symbols get syntax, which is a major extension to any existing parser. That starts to look like hard work. The lalr algorithm does not add syntax on the fly. The lrxpg parser does build a symbol tree one on the fly, but not syntax on the fly – but its website just went down. No one has attempted to write a language that can add syntax on the fly. They build a syntax capable of expressing an arbitrary graph with symbolic links, and then give the graph extensible semantics. The declaration/definition semantic is not full parsing on the definition, but rather operates on the tree. In practice, LALR parsers need to be extended beyond LALR with operator precedence. Expressing operator precedence within strict LALR is apt to be messy. And, because LALR walks the tree in reverse polish order, you want the action that gets executed at parse time to return a value that the generated parser puts on a stack managed by the parser, which stack is available when the action of the operator that consumes it is called. In which case the definition/declaration semantic declares a symbol that has a directed graph associated with it, which graph is then walked to interpret what is on the parse stack. The data of the declaration defines metacode that is executed when the symbol is invoked, the directed graph associated with the symbol definition being metacode executed by the action that parser performs when the symbol is used. The definition/declaration semantic allows arbitrary graphs containing cycles (full recursion) to be defined, by the declaration adding indirections to a previously constructed directed graph. The operator-precedence parser can parse all LR(1) grammars where two consecutive nonterminals and epsilon never appear in the right-hand side of any rule. They are simple enough to write by hand, which is not generally the case with more sophisticated right shift-reduce parsers. Second, they can be written to consult an operator table at run time. Considering that “universal” resource locators and command lines are parsed with mere lexers, perhaps a hand written operator-precedence parser is good enough. After all, Forth and Lisp have less. C++ variadic templates are a purely functional metalanguage operating on the that stack. Purely functional languages suck, as demonstrated by the fact that we are now retroactively shoehorning procedural code (if constexpr) into C++ template meta language. Really, you need the parse stack of previously encountered arguments to potentially contain arbitrary objects. When a lalr parser parses an if-then-else statement, then if the parser grammer defines “if” as the nonterminal, which may contain an “else” clause, it is going to execute the associated actions in the reverse order. But if you define “else” as the nonterminal, which must be preceded by an “if” clause, then the parser will execute the associated actions in the expected order. But suppose you have an else clause in curly brackets inside an if-then-else. Then the parse action order is necessarily going to be different from the procedural. Further, the very definition of an if-then-else clause implies a parse time in which all actions are performed, and a procedural time in which only one action is performed. Definition code metacode must operate on the parser stack, but declaration metacode may operate on a different stack, implying a coroutine relationship between declaration metacode and definition metacode. The parser, to be intelligible, has to perform actions in as close to left to right order as possible hence my comment that the “else” nonterminal must contain the “if” nonterminal, not the other way around – but what if the else nonterminal contains an “if then else” inside curly braces? The parser actions can and will happen in different order to the run time actions. Every term of the if-then-else structure is going to have its action performed in syntax order, but the syntax order has to be capable of implying a different procedural order, during which not all actions of an if-then-else structure will be performed. And similarly with loops, where every term of the loop causes a parse time action to be performed once in parse time order, but procedural time actions in a different order, and performed many times. This implies that any fragment of source code in a language that uses the declaration/definition syntax and semantic gets to do stuff in three phases (Hence in C, you can define a variable or a function without declaring it, resulting in link time errors, and in C++ define a class without declaring its methods and data, resulting in compilation errors at a stage of compilation that is ill defined and inexplicit) The parser action of the declaration statement constructs a declaration data structure, which is metacode, possibly invoking the metacode generated by previous declarations and definitions. When the term declared is then used, then the metacode of the definition is executed. And the usage may well invoke the metacode generated by the action associated at parse time with the declaration statement, but attempting to do so causes an error in the parser action if the declaration action has not yet been encountered in parse action order. So, we get parser actions which construct definition and declaration metacode and subsequent parser actions, performed later during the parse of subsequent source code that invoke that metacode by name to construct metacode. But, as we see in the case of the if-then-else and do-while constructions, there must be a third execution phase, in which the explicitly procedural code constructed, but not executed, by the metacode, is actually executed procedurally. Which, of course, in C++ is performed after the link and load phase. But we want procedural metacode. And since procedural metacode must contain conditional and loops, there has to be a third phase during parsing, executed as a result of parse time actions, that procedurally performs ifs and loops in metacode. So a declaration can invoke the metacode constructed by previous declarations – meaning that a parse time action executes metacode constructed by previous parse time actions. But, to invoke procedural metacode from a parse time action, a previous parse time action has to have invoked metacode constructed by an even earlier parse time action to construct procedural metacode. Of course all three phases can be collapsed into one, as a definition can act as both a declaration and a definition, two phases in one, but there have to be three phases, that can be the result parser actions widely separated in time, triggered by code widely separated in the source, and thinking of the common and normal case is going to result in mental confusion, collapsing things that are distinct, because the distinction is commonly uniportant and elided. Hence the thick syntactic soup with which I have struggling when I write C++ templates defining classes that define operators and then attempt to use the operators. In the language of C we have parse time actions, link time actions, and execution time actions, and only at execution time is procedural code constructed as a result of earlier actions actually performed procedurally. We want procedural metacode that can construct procedural metacode. So we want execution time actions performed during parsing. So let us call the actions definitional actions, linking actions, and execution actions. And if we ware going to have procedural actions during parsing, we are going to have linking actions during parsing. (Of course, in actually existent C++, second stage compilation does a whole lot of linker actions, resulting in excessively tight coupling between linker and compiler, and the inability of other languages to link to C++, and the syntax soup that ensues when I define a template class containing inline operators. # Forth the model We assume the equivalent of Forth, where the interpreter directly interprets and executes human readable and writeable text, by looking the symbols in the text and performing the actions they comand, which commands may command the interpreter to generate compiled and linked code, including compiled code that generates compiled and linked code, commands the interpreter to add names for what it has compiled to the name table, and then commands the interpreter to execute those routines by name. Except that Forth is absolutely typeless, or has only one type, fixed precision integers that are also pointers, while we want a language in which types are first class values, as manipulable as integers, except that they are immutable, a language where a pointer to a pointer to an integer cannot be added to a pointer, and subtraction of one pointer from another pointer of the same type pointing into the same object produces an integer, where you cannot point a pointer out of the range of the object it refers to, nor increment a reference, only the referenced value. Lexing merely needs symbols to be listed. Parsing merely needs them to be, in C++ terminology, declared but not defined. Pratt parsing puts operators in forth order, but knows and cares nothing about types, so is naturally adapted to a Forth like language which has only one type, or values have run time types, or generating an intermediate language which undergoes a second state compilation that produces statically typed code. In forth, symbols pointed to memory addresses, and it was up to the command whether it would load an integer from an address, stored an integer at that address, execute a subroutine at that address, or go to that address, the ultimate in unsafe typelessness. Pratt parsing is an outstandingly elegant solution to parsing, and allows compile time extension to the parser, though it needs a lexer driven by the symbol table if you have multi character operators, but I am still lost in the problem of type safety. Metaprogramming in C++ is done a lazily evaluated purely functional language where a template is usually used to construct a type from type arguments. I want to construct types procedurally, and generate code procedurally, rather than by invoking pure functions. In Pratt parsing, the the language is parserd sequentially in parser order, but the parser maintains a tree of recursive calls, and builds a tree of pointers into the source, such that it enters each operator in polish order, and finishes up each operator in reverse polish order. On entering in polish order, this may be an operand with a variable number of arguments (unary minus or infix minus) so it cannot know the number of operands coming up, but on exiting in reverse polish order, it knows the number and something about the type of the arguments, so it has to look for an interpretation of the operator that can handle that many arguments of those type. Which may not necessarily be a concrete type. Operators that change the behavior of the lexer or the parser are typically acted upon in polish order. Compilation to byte code that does not yet have concrete types is done in reverse polish order, so operators that alter the compilation to byte code are executed at that point. Operators that manipulate that byte code during the linking to concrete types act at link time, when the typeless byte code is invoked with concrete types. Naming puts a symbol in the lexer symbol table. Declaring puts a symbol in the parser symbol table Defining compiles, and possibly links, the definition, and attaches that data to the symbol where it may be used or executed in subsequent compilation and linking steps when that symbol is subsequently invoked. If the definition contains procedural code, it is not going to be executed procedurally until compiled and linked, which will likely occur when the symbol is invoked later. An ordinary procedure definition without concrete types is the equivalent of an ordinary C++ template. When it is used with concrete types, the linker will interet to the operations it invokes in terms of those concrete types, and fail if they don’t support those operations. A metacode procedure gets put into the lexer symbol table when it is named, into the parser symbol table when it is defined. When it is declared, its definition may be used when its symbol is encountered in polish order by the parser, and may be executed at that time to modify the behavior of parser and linker. When a named, declared, and defined symbol is encountered by the parser in reverse polish order, its compiled code may be used to generate linked code, and its linked and compiled code may manipulate the compiled code preparator to linking. When a symbol is declared, it gets added to the parser and lexer symbol table. When it is defined, it gets added to the linker symbol table. When defined with a concrete type, also gets added to the linker symbol table with those concrete types, as an optimization. If an operation could produce an output of variant type, then it is an additive algebraic type, which then has to handled by a switch statement. There are five steps: Lexing, parsing, compiling, linking, and running, and any fragment of source code may experience some or all of these steps, with the resulting entries in the symbol table then being available to the next code fragment, Forth style. Thus `77+9`gets lexed into `77, +, `, parsed into `+(77, 9)`, compiled into `77 9 +`, linked into `77, 9 +` and executed into `int(86` and the rest of the source code proceeds to parse, compile, link, and run as if you had written `86`. Further the source code can create run time code, code that gets declared, defined, and linked during the compile that is executed during the compile, modifying the behavior of the lexer, the parser, the compiler, and the linker over the course of a single compile and link. This enables a forth style bootstrapping, where the lexer, parser, compiler and linker lexes, compiles, and links, most of its own potentially modifiable source code in every compile, much as every c++ compile includes the header files for the standard template library, so that much of your program is rewritten by template metacode that you included at the start of the program. Compiled but not linked code could potentially operate on variables of any type, though if the variables did not have a type required by an operator, you would get a link time error, not get a compile time error. This is OK because linking of a fragment of source is not a separate step, but usually happens before the lexer has gotten much further through the source code, happens as soon as the code fragment is invoked with variables of defined type, though usually of as yet undefined value. A console program is an operator whose values are of the type iostream, it gets linked as soon as the variable type is defined, and executed when you assign defined values to iostream. Because C++ metacode is purely functional, it gets lazily evaluated, so the syntax and compiler can cheerfully leave it undefined when, or even if, it gets executed. Purely functional languages only terminate by laziness. But if we want to do the same things with procedural metacode, no option but to explicitly define what get executed when. In which case pure lalr syntax is going to impact the semantics, since lalr syntax defines the order of parse time actions, and order of execution impacts the semantics. I am not altogether certain as to whether the result is going to be intellibile and predictable. Pratt syntax, however, is going to result in predictzble execution order. The declaration, obviously, defines code that can be executed by a subsequent parse action after the declaration parse action has been performed, and the definition code that can be compiled after the definition parse action performed. The compiled code can be linked when when invoked with variables of defined type and undefined value, and executed when invoked with variables of defined type and an value. Consider what happens when the definition defines an overload for an infix operator. The definition of the infix operator can only be procedurally executed when the parser calls the infix action with the arguments on the parse stack, which happens long after the infix operator is overloaded. The definition has to be parsed when the parser encounters it. But it is procedural code, which cannot be procedurally executed until later, much later. So the definition has to compile, not execute, procedural code, then cause the data structure created by the declaration to point to that compiled code. And then later when the parser encounters an actual use of the infix operator, the compiled procedural code of the infix definition is actually executed to generate linked procedural code with explicit and defined types, which is part of the definition of the function or method in whose source code the infix operator was used. One profoundly irritating feature of C++ code, probably caused by LL parsing, is that if the left hand side of an infix expression has an appropriate overloaded operator, it works, but if the right hand side, it fails. Here we see parsing having an incomprehensible and arbitrary influence on semantics. C++ is a strongly typed language. With types, any procedure is has typed inputs and outputs, and should only do safe and sensible things for that type. C++ metacode manipulates types as first class objects, which implies that if we were to do the same thing procedurally, types need a representation, and procedural commands to make new types from old, and to garbage collect, or memory manage, operations on these data objects, as if they were strings, floating point numbers, or integers of known precision. So you could construct or destruct an object of type type, generate new types by doing type operations on old types, for example add two types or multiply two types to produce an algebraic type, or create a type that is a const type or pointer type to an existing type, which type actually lives in memory somewhere, in a variable like any other variable. And, after constructing an algebraic type by procedurally multiply two types, and perhaps storing in a variable of type type, or invoking a function (aka C++ template type) that returns a type dynamically, create an object of that type – or an array of objects of that type. For every actual object, the language interpreter knows the type, meaning the object of type X that you just constructed is somehow linked to the continuing existence of the object of type type that has the value type X that you used to construct it, and cannot be destroyed until all the obects created using it are destroyed. Since the interpreter knows the type of every object, including objects of type type, and since every command to do something with an object is type aware, this can prevent the interpreter from being commanded to do something stupid. Obviously type data has to be stored somewhere, and has to be immutable, at least until garbage collected because no longer referenced. Can circular type references exist? Well, not if they are immutable, because if a type references a type, that type must already exist, and so cannot reference a type that does not yet exist. It could reference a function that generates types, but that reference is not circular. It could have values that are constexpr, and values that reference static variables. If no circular references possible, garbage collection by reference counting works Types are algebraic types, sums and products of existing types, plus modifiers such as `const, *,` and `&`. Type information is potentially massive, and if we are executing a routine that refers to a type by the function that generates it, we don’t want that equivalent of a C++ template invoked every time, generating a new immutable object every time that is an exact copy of what it produced the last time it went through the loop. Rather, the interpreter needs short circuit the construction by looking up a hash of that type constructing template call, to check it it has been called with those function inputs, to already produced an object of that type. And when a function that generates a type is executed, needs to look for duplications of existing types. A great many template invocations simply choose the right type out of a small set of possible types. It is frequently the case that the same template may be invoked with an enormous variety of variables, and come up with very few different concrete results. When the interpreter compiles a loop or a recursive call, the type information is likely to be an invariant, which should get optimized out of the loops. But when it is directly executing source code commands which command it to compile source code. such optimization is impossible But, as in forth, you can tell the interpreter to store the commands in a routine somewhere, and when they are stored, the types have already been resolved. Typically the interpreter is going to finish interpreting the source code, producing stored programs each containing a limited amount of type information.