Go Go Go🎯

Unit 10.1 Compiler I: Roadmap

roadmap

Unit 10.2 Lexical Analysis

Tokenizing(first approximation)

Tokenizing = grouping characters into tokens.
A token is a string of characters that has a meaning.
A programming language specification must document(among other things) it’s allowable tokens.

Jack tokens

keywords
symbols
integers
strings
identifiers

Jack_tokens

Tokenizer:

Handles the compiler’s input
Allows advancing the input
Supplies the current token’s value and type

Unit 10.3 Grammars

A grammar is a set of rules, describing how tokens can be combined to create valid language constructs
Each rule consists of a left-hand side, listing a template’s name, and a right-hand side, describing how the template is composed.

Jack_grammar_subset

Parsing:

Determining if a given input conforms to a grammar
In the process, uncovering the grammatical structure of the given input.

Unit 10.4 Parse Trees

Parse tree is a data structure. If you have learned the course of data structure, it will be easy to understand parse tree.

Unit 10.5 Parser Logic

A. Parsing logic

Follow the rule, and parse the input accordingly
If get a non-terminal rule xxx, call compilexxx
Do this recursively

B. Init

Advance the tokenizer
Call compileWhileStatement(see the video for more details🔍)

C. Parser design

A set of compilexxx methods, one for each(non-trivial) rule xxx
The handing of some rules is embedded in other rules
The code of each compilexxx method follows the rule xxx
Each compilexxx method is responsible for advancing and handing it’s own part of the input

D. Some observations about grammars and parsing

LL grammar: can be parsed by a recursive descent parser without backtracking
LL(k) parser: a parser that needs to look ahead at most k tokens in order to determine which rule is applicable
The grammar that we saw so far is LL(1)(That means easier to implepement.🎉)

Unit 10.6 The Jack Grammar

grammar_notation

lexical elements

lexical_elements

program structure

program_structure

statements

expressions

expressions_1

expressions_2

Unit 10.7 The Jack Analyzer

1 2	parser input ---------> XML output

terminal_rules

non-terminal_rules_1

non-terminal_rules_2

Unit 10.8 The Jack Analyzer: Proposed Implementations

Implementation plan: JackTokenizer -> CompilationEngine -> JackAnalyzer(top-most module)

The JackAnalyzer:

Uses the services of a JackTokenizer
Written according to the Jack grammar
The top-most module
Input: a single fileName.jack, or a directory containing 0 or more such files
For each file, goes through the following logic
- Creates a JackTokenizer from fileName.jack
- Creates an output file named fileName.xml and prepares it for writing
- Creates and uses a CompilationEngine to compile the input JackTokenizer into the output file.

The JackTokenizer:

Handle the compiler’s input
Ignoring white space
Advancing the input, one token at a time
Getting the value and type of the current token

The CompilationEngine design:

Gets its input from a JackTokenizer, and emits its output to an output file
The output is generates by a series of compilexxx routines, one for (almost) every non-terminal rule xxx in the grammar
Each compilexxx routine is responsible for handling all the tokens that make up xxx, advancing the tokenizer exactly beyond these tokens, and outputing the parsing of xxx

Unit 10.9 Project: Building a Syntax Analyzer

A syntax Analyzer has three modules: JackAnalyzer, JackTokenizer and CompilationEngine. These modules are designed to handle the language’s syntax.

Here is the codes(top-down):

main.cpp

#include <iostream>
#include <string>

#include "jackanalyzer.h"

using namespace std;

int main(int argc, char** argv) {
    if(argc != 2) {
        cout << "Input error!\nUsage: .\\SyntaxAnalyzer.exe [filename or filepath]" << endl;
        return 0;
    }
    try {
        string path(argv[1]);
        JackAnalyzer jack(path);
        jack.doAnalyzing();
    } catch(exception& e) {
        cerr << e.what() << endl;
    }
    return 0;
}

✅ main.cpp: Start Analyzing and catch the exception for debugging.

jackanalyzer.h

#ifndef __JACK_ANALYZER_H__
#define __JACK_ANALYZER_H__

#include <vector>
#include <string>

using namespace std;

class JackAnalyzer {
public:
    JackAnalyzer(const string& filepath);    
    void doAnalyzing();
private:
    void parseFilePath(string path);
private:
    vector<string> m_filenames;
};

#endif

jackanalyzer.cpp

#include "jackanalyzer.h"
#include "compilationengine.h"

#include <fstream>
#include <iostream>
#include <stdexcept>
#include <windows.h>


JackAnalyzer::JackAnalyzer(const string& filepath) {
    parseFilePath(filepath);
}

void JackAnalyzer::doAnalyzing() {
    if(m_filenames.empty()) {
        throw runtime_error("Error! Input file or path are inappropriate");
        return;
    }
    for(auto& s: m_filenames) {
        // generate XxxT_.xml(tokenized) for test
        JackTokenizer jacktokenizer(s);
        string t1 = s.substr(0, s.find(".jack")) + "T_.xml";

        ofstream t_ofs(t1);
        t_ofs << "<tokens>\n";
        while(jacktokenizer.hasMoreTokens()) {
            jacktokenizer.advance();
            t_ofs << jacktokenizer.curTokenToXMLString() << '\n';
        }
        t_ofs << "</tokens>\n";
        t_ofs.close();
        
        // generate Xxx_.xml(compiled) for test
        string t2 = s.substr(0, s.find(".jack")) + "_.xml";
        CompilationEngine cengine(s, t2);
    }
}

void JackAnalyzer::parseFilePath(string path) {
    bool isDir = true;
    for(int i = path.length() - 1; i >= 0; --i) {
        if(path[i] == '\\') {
            break;
        } else if(path[i] == '.') {
            isDir = false;
        }
    }
    if(isDir) {
        WIN32_FIND_DATAA data;
        string tmp = path + "\\*";
        HANDLE hFind = FindFirstFileA(tmp.c_str(), &data);
        do {
            string t(data.cFileName);
            auto it = t.find(".jack");
            if(it != string::npos) {
                m_filenames.push_back(path + "\\" + t);
            }
        } while(FindNextFileA(hFind, &data));
    } else {
        m_filenames.push_back(path);
    }
}

✅ JackAnalyzer: Parse filepath for tokenizing and compiling.

jacktokenizer.h

#ifndef __JACK_TOKENIZER_H__
#define __JACK_TOKENIZER_H__

#include <string>
#include <sstream>

using namespace std;

class JackTokenizer {
public:
    enum TokenType {
        T_UNKNOWN,
        T_KEYWORD,
        T_SYMBOL,
        T_IDENTIFIER,
        T_INT_CONST,
        T_STRING_CONST,
    };
public:
    JackTokenizer(const string& filepath);
    bool hasMoreTokens();
    void advance();
    TokenType tokenType();
    string keyword();
    char symbol();
    string identifier();
    int intVal();
    string stringVal();

    string curTokenToXMLString();
    
private:
    stringstream m_iss;
    string m_curToken;
};

#endif

jacktokenizer.cpp

#include "jacktokenizer.h"

#include <iostream>
#include <fstream>
#include <regex>
#include <unordered_set>

static unordered_set<char> s_symbols = {
    '{', '}', '(', ')', '[', ']',
    '.', ',', ';', 
    '+', '-', '*', '/',
    '&', '|', '<', '>', '=', '~',
};

static unordered_set<string> s_keywords = {
    "class", "method", "int", "function",
    "boolean", "constructor", "char",
    "void", "var", "static", "field",
    "let", "do", "if", "else", "while",
    "return", "true", "false", "null", 
    "this",
};

static string removeComments(const string& source) {
    enum State {
        NORMAL,
        IN_STRING,
        IN_SINGLE_LINE_COMMENT,
        IN_MULT_LINE_COMMENT,
    };
    State state = NORMAL;
    string result;
    size_t i = 0;
    size_t len = source.length();

    while(i < len) {
        char current = source[i];
        char next = (i + 1 < len) ? source[i + 1] : '\0';
        switch(state) {
            case NORMAL:
                if(current == '"') {
                    result += current;
                    state = IN_STRING;
                } else if(current == '/' && next == '/') {
                    state = IN_SINGLE_LINE_COMMENT;
                    ++i;
                } else if(current == '/' && next == '*') {
                    state = IN_MULT_LINE_COMMENT;
                    ++i;
                } else {
                    result += current;
                }
                break;
            case IN_STRING:
                result += current;
                if(current == '"' && source[i - 1] != '\\') {
                    state = NORMAL;
                }
                break;
            case IN_SINGLE_LINE_COMMENT:
                if(current == '\n') {
                    result += '\n';
                    state = NORMAL;
                }
                break;
            case IN_MULT_LINE_COMMENT:
                if(current == '*' && next == '/') {
                    state = NORMAL;
                    ++i;
                }
                break;
        }
        ++i;
    }

    regex endl_re("\\r*\\n+");
    regex space_re("\\s+");
    regex tailSpace_re("\\s+$");
    result = regex_replace(result, endl_re, "");
    result = regex_replace(result, space_re, " ");
    result = regex_replace(result, tailSpace_re, "");

    return result;
}

JackTokenizer::JackTokenizer(const string& filepath) {
    ifstream ifs(filepath);
    if(!ifs) {
        throw runtime_error("JackTokenizer fails to open file: " + filepath);
    }
    string src_code((istreambuf_iterator<char>(ifs)), istreambuf_iterator<char>());
    ifs.close();
    string dst_code = removeComments(src_code);
    m_iss << dst_code;
}

bool JackTokenizer::hasMoreTokens() {
    return !m_iss.eof() && m_iss.peek() != EOF;
}

void JackTokenizer::advance() {
    if(!hasMoreTokens()) {
        throw runtime_error("JackTokenizer: expected a token");
        return;
    }

    m_curToken.clear();
    while(hasMoreTokens()) {
        char next = m_iss.peek();
        if(isalnum(next) || next == '"') {
            m_curToken.push_back(m_iss.get());
            if(m_curToken[0] == '"') {
                while(hasMoreTokens()) {
                    char c = m_iss.get();
                    m_curToken.push_back(c);
                    if(c == '"')
                        break;
                }
            }
        } else if(next == ' ') {
            m_iss.get();
            if(!m_curToken.empty())
                break;
        } else if(s_symbols.count(next)) {
            if(m_curToken.empty())
                m_curToken.push_back(m_iss.get());
            break;
        }
    }
}

JackTokenizer::TokenType JackTokenizer::tokenType() {
    TokenType ret = T_UNKNOWN;
    if(m_curToken.empty())
        return ret;
    else if(s_symbols.count(m_curToken[0]))
        ret = T_SYMBOL;
    else if(s_keywords.count(m_curToken))
        ret = T_KEYWORD;
    else if(all_of(m_curToken.begin(), m_curToken.end(), [](char c) { return isdigit(c); }))
        ret = T_INT_CONST;
    else if(m_curToken[0] == '"') 
        ret = T_STRING_CONST;
    else
        ret = T_IDENTIFIER;
    return ret;
}

string JackTokenizer::keyword() {
    return m_curToken;
}

char JackTokenizer::symbol() {
    return m_curToken[0];
}

string JackTokenizer::identifier() {
    return m_curToken;
}

int JackTokenizer::intVal() {
    return stoi(m_curToken);
}

string JackTokenizer::stringVal() {
    return m_curToken.substr(1, m_curToken.length() - 2);
}

string JackTokenizer::curTokenToXMLString() {
    stringstream ss;
    TokenType tt = tokenType();
    if(tt == JackTokenizer::TokenType::T_SYMBOL) {
        ss << "<symbol> ";
        if(m_curToken == "<")
            ss << "&lt;";
        else if(m_curToken == ">")
            ss << "&gt;";
        else if(m_curToken == "&")
            ss << "&amp;";
        else 
            ss << symbol();
        ss << " </symbol>";
    } else if(tt == JackTokenizer::TokenType::T_KEYWORD) {
        ss << "<keyword> " << keyword() << " </keyword>";
    } else if(tt == JackTokenizer::TokenType::T_INT_CONST) {
        ss << "<integerConstant> " << intVal() << " </integerConstant>";
    } else if(tt == JackTokenizer::TokenType::T_STRING_CONST) {
        ss << "<stringConstant> " << stringVal() << " </stringConstant>";
    } else if(tt == JackTokenizer::TokenType::T_IDENTIFIER) {
        ss << "<identifier> " << identifier() << " </identifier>";
    }
    return ss.str();
}

✅ JackTokenizer: Remove the comments from source file and tokenize the source code.

compliationengine.h

#ifndef __COMPILATION_ENGINE_H__
#define __COMPILATION_ENGINE_H__


#include <fstream>
#include <initializer_list>

#include "jacktokenizer.h"

using namespace std;

class CompilationEngine {
public:
    CompilationEngine(const string& input, const string& output);
    ~CompilationEngine();

private:    
    inline void outputAndAdvance();
    inline bool isType();
    inline bool isSymbol(char c);
    inline bool isSubroutine();
    
    void compileClass();
    void compileClassVarDec();
    void compileSubroutine();
    void compileParameterList();
    void compileVarDec();
    void compileStatements();
    void compileDo();
    void compileLet();
    void compileWhile();
    void compileReturn();
    void compileIf();
    void compileExpression();
    void compileTerm();
    void compileExpressionList();

private:
    ofstream m_ofs;
    JackTokenizer m_tokenizer;
};

#endif

compliationengine.cpp

#include "compilationengine.h"

#include <stdexcept>
#include <unordered_set>

CompilationEngine::CompilationEngine(const string& input, const string& output)
    : m_ofs(output), m_tokenizer(input) {
    if(!m_ofs) {
        throw(runtime_error("CompilationEngine fails to open file: " + output));
    }
    if(!m_tokenizer.hasMoreTokens())
        throw(runtime_error("The input file is empty"));

    m_tokenizer.advance();

    if(m_tokenizer.tokenType() == JackTokenizer::T_KEYWORD 
        && m_tokenizer.keyword() == "class") {
        compileClass();
    } else {
        throw(runtime_error("The first keyword is not class"));
    }
}

CompilationEngine::~CompilationEngine() {
    m_ofs.close();
}

inline void CompilationEngine::outputAndAdvance() {
    m_ofs << m_tokenizer.curTokenToXMLString() << '\n';
    m_tokenizer.advance();
}

inline bool CompilationEngine::isType() {
    return m_tokenizer.tokenType() == JackTokenizer::T_IDENTIFIER ||
          (m_tokenizer.tokenType() == JackTokenizer::T_KEYWORD && 
          (m_tokenizer.keyword() == "int" || 
           m_tokenizer.keyword() == "char" ||
           m_tokenizer.keyword() == "boolean"));
}

inline bool CompilationEngine::isSymbol(char c) {
    return m_tokenizer.tokenType() == JackTokenizer::T_SYMBOL &&
            m_tokenizer.symbol() == c;
}

inline bool CompilationEngine::isSubroutine() {
    return m_tokenizer.tokenType() == JackTokenizer::T_KEYWORD &&
          (m_tokenizer.keyword() == "constructor" ||
           m_tokenizer.keyword() == "function" ||
           m_tokenizer.keyword() == "method");
}

void CompilationEngine::compileClass() {
    m_ofs << "<class>\n";

    // class className { classVarDec* subroutineDec* }
    // class
    outputAndAdvance();
    // className
    if(m_tokenizer.tokenType() == JackTokenizer::T_IDENTIFIER) {
        outputAndAdvance();
    } else {
        throw runtime_error("(class)expected className");
    }
    // {
    if(isSymbol('{')) {
        outputAndAdvance();
    } else {
        throw runtime_error("(class)expected '{'");
    }
    // classVarDec*
    compileClassVarDec();
    // subroutineDec*
    compileSubroutine();
    // }
    if(isSymbol('}')) {
        m_ofs << m_tokenizer.curTokenToXMLString() << '\n';
    } else {
        throw runtime_error("(class)expected '}'");
    }

    m_ofs << "</class>\n";
}

void CompilationEngine::compileClassVarDec() {
    while(m_tokenizer.tokenType() == JackTokenizer::T_KEYWORD &&
         (m_tokenizer.keyword() == "static" || 
          m_tokenizer.keyword() == "field")) {
        m_ofs << "<classVarDec>\n";
        // (static | field) type varName (, varName)*;
        // static | field
        outputAndAdvance();
        // type
        if(isType()) {
            outputAndAdvance();
        } else {
            throw runtime_error("(classVarDec)expected type");
        }
        // varName
        if(m_tokenizer.tokenType() == JackTokenizer::T_IDENTIFIER) {
            outputAndAdvance();
            while(isSymbol(',')) {
                // ,
                outputAndAdvance();
                if(m_tokenizer.tokenType() == JackTokenizer::T_IDENTIFIER) {
                    outputAndAdvance();
                } else {
                    throw runtime_error("(classVarDec-,)expected varName");
                }
            }
        } else {
            throw runtime_error("(classVarDec)expected varName");
        }
        // ;
        if(isSymbol(';')) {
            outputAndAdvance();
        } else {
            throw runtime_error("(classVarDec)expected ';'");
        }

        m_ofs << "</classVarDec>\n";
    }
}



void CompilationEngine::compileSubroutine() {
    while(isSubroutine()) {
        m_ofs << "<subroutineDec>\n";
        // (constructor | function | method)
        // (void | type) subroutineName ( parameterList )
        // subroutineBody
        // (constructor | function | method)
        outputAndAdvance();
        // void | type    
        if((m_tokenizer.tokenType() == JackTokenizer::T_KEYWORD && m_tokenizer.keyword() == "void") ||
            isType()) {
            outputAndAdvance();
        } else {
            throw runtime_error("(subroutineDec)expected type");
        }
        // subroutineName
        if(m_tokenizer.tokenType() == JackTokenizer::T_IDENTIFIER) {
            outputAndAdvance();
        } else {
            throw runtime_error("(subroutineDec)expected subroutineName");
        }
        // (
        if(isSymbol('(')) {
            outputAndAdvance();
        } else {
            throw runtime_error("(subroutineDec)expected '('");
        }
        // parameterList
        compileParameterList();
        // )
        if(isSymbol(')')) {
            outputAndAdvance();
        } else {
            throw runtime_error("(subroutineDec)expected ')'");
        }
        // subroutineBody
        // { varDec* statements }
        m_ofs << "<subroutineBody>" << '\n';
        // {
        if(isSymbol('{')) {
            outputAndAdvance();
        } else {
            throw runtime_error("(subroutineBody)expected '{'");
        }
        // varDec
        compileVarDec();
        // statements
        compileStatements();
        // }
        if(isSymbol('}')) {
            outputAndAdvance();
        } else {
            throw runtime_error("(subroutineBody)expected '}'");
        }
        m_ofs << "</subroutineBody>" << '\n';

        m_ofs << "</subroutineDec>\n";
    }
}

void CompilationEngine::compileParameterList() {
    m_ofs << "<parameterList>\n";
    // ((type varName)(, type varName)*)?
    while(isType()) {
        // type
        outputAndAdvance();
        // varName
        if(m_tokenizer.tokenType() == JackTokenizer::T_IDENTIFIER) {
            outputAndAdvance();
            // ,
            if(isSymbol(',')) {
                outputAndAdvance();
            } else {
                break;
            }
        } else {
            throw runtime_error("(parameterlist)expected varName");
        }
    }

    m_ofs << "</parameterList>\n";
}

void CompilationEngine::compileVarDec() {
    // var type varName (, varName)* ;
    while(m_tokenizer.tokenType() == JackTokenizer::T_KEYWORD &&
            m_tokenizer.keyword() == "var") {
        m_ofs << "<varDec>\n";
        // var
        outputAndAdvance();
        if(isType()) {
            // type
            outputAndAdvance();
            while(m_tokenizer.tokenType() == JackTokenizer::T_IDENTIFIER) {
                // varName
                outputAndAdvance();
                // ,
                if(isSymbol(',')) {
                    outputAndAdvance();
                } else {
                    break;
                }
            }
            // ;
            if(isSymbol(';')) {
                outputAndAdvance();
            } else {
                throw runtime_error("(VarDec)expected ';'");
            }
        } else {
            throw runtime_error("(VarDec)expected type");
        }
        m_ofs << "</varDec>\n";
    }
}

void CompilationEngine::compileStatements() {
    m_ofs << "<statements>\n"; 

    while(m_tokenizer.tokenType() == JackTokenizer::T_KEYWORD) {
        if(m_tokenizer.keyword() == "let") {
            compileLet();
        } else if(m_tokenizer.keyword() == "if") {
            compileIf();
        } else if(m_tokenizer.keyword() == "while") {
            compileWhile();
        } else if(m_tokenizer.keyword() == "do") {
            compileDo();
        } else if(m_tokenizer.keyword() == "return") {
            compileReturn();
        } else {
            throw runtime_error("(statements)expected let or if or while or do or return");
            break;
        }
    }

    m_ofs << "</statements>\n";
}

void CompilationEngine::compileDo() {
    m_ofs << "<doStatement>\n";
    // do subroutineCall
    // do
    outputAndAdvance();
    // subroutineName | className | varName
    if(m_tokenizer.tokenType() == JackTokenizer::T_IDENTIFIER) {
        outputAndAdvance();
    } else {
        throw runtime_error("(Do)expected subroutineName");
    }
    // ( | .
    if(isSymbol('(')) {
        // (
        outputAndAdvance();
        // expressionList
        compileExpressionList();
        // )
        if(isSymbol(')')) {
            outputAndAdvance();
        } else {
            throw runtime_error("(SubroutineCall)expected ')'");
        }
    } else if(isSymbol('.')) {
        // .
        outputAndAdvance();
        // subroutineName
        if(m_tokenizer.tokenType() == JackTokenizer::T_IDENTIFIER) {
            outputAndAdvance();
        } else {
            throw runtime_error("(subroutineCall)expected 'subroutineName'");
        }
        // (
        if(isSymbol('(')) {
            outputAndAdvance();
        } else {
            throw runtime_error("(subroutineCall)expected '('");
        }
        // expressionList
        compileExpressionList();
        // )
        if(isSymbol(')')) {
            outputAndAdvance();
        } else {
            throw runtime_error("(subroutineCall)expected ')'");
        }
    } else {
        throw runtime_error("(subroutineCall)expected '(' or '.'");
    }
    // ;
    if(isSymbol(';')) {
        outputAndAdvance();
    } else {
        throw runtime_error("(Do)expected ';'");
    }

    m_ofs << "</doStatement>\n";
}

void CompilationEngine::compileLet() {
    m_ofs << "<letStatement>\n";

    // let
    outputAndAdvance();
    // varName
    if(m_tokenizer.tokenType() == JackTokenizer::T_IDENTIFIER) {
        outputAndAdvance();
    } else {
        throw runtime_error("(Let)expected varName");
    }
    // ([expression])?
    if(isSymbol('[')) {
        outputAndAdvance();
        compileExpression();
        if(isSymbol(']')) {
            outputAndAdvance();
        } else {
            throw runtime_error("(Let)expected ']'");
        }
    }
    // =
    if(isSymbol('=')) {
       outputAndAdvance();
    } else {
        throw runtime_error("(Let)expected '='");
    }
    // expression
    compileExpression();
    // ;
    if(isSymbol(';')) {
        outputAndAdvance();
    } else {
        throw runtime_error("(Let)expected ';'");
    }

    m_ofs << "</letStatement>\n";
}

void CompilationEngine::compileWhile() {
    m_ofs << "<whileStatement>\n";

    // while
    outputAndAdvance();
    // (
    if(isSymbol('(')) {
        outputAndAdvance();
    } else {
        throw runtime_error("(While)expected '('");
    }
    // expression
    compileExpression();
    // )
    if(isSymbol(')')) {
        outputAndAdvance();
    } else {
        throw runtime_error("(While)expected ')'");
    }
    // {
    if(isSymbol('{')) {
        outputAndAdvance();
    } else {
        throw runtime_error("(While)expected '{'");
    }
    // statements
    compileStatements();
    // }
    if(isSymbol('}')) {
        outputAndAdvance();
    } else {
        throw runtime_error("(While)expected '}'");
    }

    m_ofs << "</whileStatement>\n";
}

void CompilationEngine::compileReturn() {
    m_ofs << "<returnStatement>\n";

    // return expression?;
    // return
    outputAndAdvance();
    // expression?
    if(!isSymbol(';')) {
        compileExpression();
    }
    // ;
    if(isSymbol(';')) {
        outputAndAdvance();
    } else {
        throw runtime_error("(Return)expected ';'");
    }
    m_ofs << "</returnStatement>\n";
}

void CompilationEngine::compileIf() {
    m_ofs << "<ifStatement>\n";

    // if
    outputAndAdvance();
    // (
    if(isSymbol('(')) {
        outputAndAdvance();
    } else {
        throw runtime_error("(If)expected '('");
    }
    // expression
    compileExpression();
    // )
    if(isSymbol(')')) {
        outputAndAdvance();
    } else {
        throw runtime_error("(If)expected ')'");
    }
    // {
    if(isSymbol('{')) {
        outputAndAdvance();
    } else {
        throw runtime_error("(If)expected '{'");
    }
    // statements
    compileStatements();
    // }
    if(isSymbol('}')) {
        outputAndAdvance();
    } else {
        throw runtime_error("(If)expected '}'");
    }
    // (else { statements })?
    // else
    if(m_tokenizer.tokenType() == JackTokenizer::T_KEYWORD &&
        m_tokenizer.keyword() == "else") {
        outputAndAdvance();
        // {
        if(isSymbol('{')) {
            outputAndAdvance();
        } else {
            throw runtime_error("(If-else)expected '{'");
        }
        // statements
        compileStatements();
        // }
        if(isSymbol('}')) {
            outputAndAdvance();
        } else {
            throw runtime_error("(If-else)expected '}'");
        }
    }

    m_ofs << "</ifStatement>\n";
}

void CompilationEngine::compileExpression() {
    m_ofs << "<expression>\n";

    // term (op term)*
    // term
    compileTerm();
    static const unordered_set<char> s_op = {
        '+', '-', '*', '/', '&', '|', '<', '>', '=',
    };
    // op term
    while(m_tokenizer.tokenType() == JackTokenizer::T_SYMBOL &&
            s_op.count(m_tokenizer.symbol())) {
        outputAndAdvance();
        compileTerm();
    }

    m_ofs << "</expression>\n";
}

void CompilationEngine::compileTerm() {
    m_ofs << "<term>\n";
    // integerConstant | stringConstant | keywordConstant |
    // varName | varName [expression] | subroutineCall |
    // (expression) | unaryOp term
    JackTokenizer::TokenType tt = m_tokenizer.tokenType();
    if(tt == JackTokenizer::T_INT_CONST) {
        // integerConstant
        outputAndAdvance();
    } else if(tt == JackTokenizer::T_STRING_CONST) {
        // stringConstant
        outputAndAdvance();
    } else if(tt == JackTokenizer::T_KEYWORD) {
        if(m_tokenizer.keyword() == "true" || m_tokenizer.keyword() == "false" ||
            m_tokenizer.keyword() == "null" || m_tokenizer.keyword() == "this") {
            // kewordConstant
            outputAndAdvance();
        } else {
            throw runtime_error("(term)expected keyword: true, false, null or this");
        }
    } else if(tt == JackTokenizer::T_IDENTIFIER) {
        // varName | subroutineName
        outputAndAdvance();
        if(m_tokenizer.tokenType() == JackTokenizer::T_SYMBOL) {
            // [expression]
            if(m_tokenizer.symbol() == '[') {
                outputAndAdvance();
                compileExpression();
                if(isSymbol(']')) {
                    outputAndAdvance();
                } else {
                    throw runtime_error("(term)expected ']");
                }
            } else if(m_tokenizer.symbol() == '(') {
                // (expressionList)
                outputAndAdvance();
                compileExpressionList();
                if(isSymbol(')')) {
                    outputAndAdvance();
                } else {
                    throw runtime_error("(term)expected ')'");
                }
            } else if(m_tokenizer.symbol() == '.') {
                // . soubroutineName { expressionList }
                // .
                outputAndAdvance();
                if(m_tokenizer.tokenType() == JackTokenizer::T_IDENTIFIER) {
                    outputAndAdvance();
                } else {
                    throw runtime_error("(term)expected subroutineName");
                }
                // (
                if(isSymbol('(')) {
                    outputAndAdvance();
                } else {
                    throw runtime_error("(term)expected '('");
                }
                // expressionList
                compileExpressionList();
                // )
                if(isSymbol(')')) {
                    outputAndAdvance();
                } else {
                    throw runtime_error("(term)expected ')'");
                }
            }
        }
    } else if(tt == JackTokenizer::T_SYMBOL) {
        // ( expression )
        if(m_tokenizer.symbol() == '(') {
            outputAndAdvance();
            // expression
            compileExpression();
            // )
            if(isSymbol(')')) {
                outputAndAdvance();
            } else {
                throw runtime_error("(term)expected ')'");
            }
        } else if(m_tokenizer.symbol() == '-' ||
            m_tokenizer.symbol() == '~') {
            // unaryOp
            outputAndAdvance();
            // term
            compileTerm();
        }
    }

    m_ofs << "</term>\n";
}

void CompilationEngine::compileExpressionList() {
    m_ofs << "<expressionList>\n";

    if(!isSymbol(')')) {
        compileExpression();
        while(isSymbol(',')) {
            // ,
            outputAndAdvance();
            // expression
            compileExpression();
        }
    }

    m_ofs << "</expressionList>\n";
}

✅ CompilationEngine: use token parse code.

Unit 10.10 Perspective

Compilers don’t only translate programs, they also find and report errors. Are we going to handle errors in our Jack compiler?

We decided to completely sidestep in this module. It’s an optional feature for you.

Can we use the techniques that we learned in this module to develop parsers for other programming languages?

Parsers are quite useful not only for parsing programs but also for parsing any syntax-based text which happens a lot in the careers of many application programmers. But it’s not necessarily for language like Java and C++.

Why didn’t we use lex and yacc?

lex and yacc are two software tools that come from the world of Unix. Lex stands for Lexical Analyzer which actually refers to a tool which is capable of generating tokenizing code automatically. Yacc stands for the whimsical compiler compiler which actually refers to a tool which is capable of generating parsing code automatically. They generate syntax analysis code that can be customized and developed into a full scale syntax analysis tool. But this course is about doing everything from scratch from the ground up in your bare hands in order to explore and understand. Using black box tools like lex and yacc go against the Nand2tetris spirit.✨