Tuesday, December 21, 2010

A C++ Implementation of a Mathematical Expression Parser

The topic of a parser for mathematical expressions is one of the most basic in Computer Science and has been implemented probably in thousands of computer programs. Nevertheless, after some googling, I wasn’t able to find a simple code in object oriented C++. What I found was either code that was written in an old C-style procedural language (called C++ because of some new C++ constructs), kind of “write once, read never”, or the code was hidden, requiring users to pay hundreds of dollars to read the source.

The purpose of this article is to present a simple code for parsing mathematical expressions (in total it has just over 300 lines of code, including both implementation classes and a command line client) in an easy to read C++. The presented implementation has a slight modification of the general parsing algorithm. The evaluation stack contains structures, each of which contains a double value and a corresponding sign, found right after the value. The computation is done via creating a number of stacks and combining the above mentioned structures. The parsing string is scanned only once during processing. Another advantage of this implementation is the decoupling of the parser and function definitions, which can be easily added on the fly. The function definition data structure consists of a map from a string token to the pointer of the function implementing this token functionality.

Algorithm description

The parser has just two classes – EZParser, which contains all the parsing logic, and EZParserFunctions which contains function definitions (e.g. exp, sin, cos, pi, etc.). If there is a need to include a new function or functionality (e.g. atan, or differentiation), either EZParserFunctions class has to be augmented or a new class has to be derived from EZParserFunctions and passed to EZParser during initialization (the corresponding virtual methods of EZParserFunctions class should be overridden then).

The mathematical expression string can have the following items, ordered by priority:

  1. Parenthesis.
  2. Functions from EZParserFunctions class (exp, sin, cos, etc.).
  3. Power sign ‘^’.
  4. Multiplication ‘*’ and division ‘/’ signs.
  5. Addition ‘+’ and subtraction ‘-‘signs.
 The algorithm is O(n), where n is the string length, and the string is scanned only once.
Note that the parser (EZParser::load() function) expects the string without any white characters and with the same number of opening and closing parenthesis, so if the string may contain whites/uneven parenthesis, another scann through the string is needed (implemented in EZParser::preprocess function).

The string is scanned only once during processing because of the “size_t from” in and out function argument. It points to the current character in the parsing stringand it is incremented after each character is read from the parsing string.

This is the main parsing class definition. Note that only standard C++ header files are needed in addition to a function definition class (we’ll talk about it later):


#include <vector>
#include <map>
#include <cmath>
#include <cstdio>
#include <iostream>

using namespace std;

class EZParserFunctions;

class EZParser
{
  public:
    class  CalculationException : public exception
    {
          public:
            CalculationException(string err) : exception(), m_what(err) {}
            ~CalculationException() throw() {};
            const char* what() const throw() { return m_what.c_str(); }
          private:
            string m_what;
    };

    EZParser() { stack.reserve(16); } // an "intelligent" guess
    double calculate(const string& data);
    double load(const string& data, size_t& from, char c = '\n');

    static void preprocess(const string& from, string& to);
    static EZParserFunctions* theFunctions;

  private:
    struct Cell
    {
          Cell(double v, char ch) { value = v; sign = ch; }
          double value;
          char sign;
    };

    double calculate();
    void calculateNext(Cell& base, Cell& current, size_t& index);

    static char updateSign(const string& item, size_t& from, char ch);
    static bool validSign(char ch);

    static bool stillCollecting(const string& item, char ch);
    static void mergeCells(Cell& base, Cell& next);

    vector<Cell> stack;
};

The main entry point is the following function:

double EZParser::calculate(const string& data)
{
        string cleanData;
        EZParser::preprocess(data, cleanData);
        size_t iter = 0;
        return load(cleanData, iter);
}
which in turn calls the following main processing function:

double EZParser::load(const string& data, size_t& from, char end)
{
        if (from >= data.size() || data[from] == end)
        {
                throw CalculationException("Loaded invalid data " + data);
        }

        stack.clear();
        string item;

        do {  // main processing cycle
                char ch = data[from++];
                if (stillCollecting(item, ch))
                { // the char still  belongs to the previous operand
                        item += ch;
                        if (from < data.size() && data[from] != end)
                                continue;
                }
                double value = theFunctions->getValue(data, from, item, ch);
                char sign = (end != '\n' && validSign(ch)) ? ch : updateSign(data, from, ch);
                stack.push_back(Cell(value, sign));
                item.clear();
        }       while(from < data.size() && data[from] != end);

        if (from < data.size() && data[from] == ')')
        { // end of parenthesis: move one char forward
            from++;
        }

        return calculate();
}
Let’s see how it works via an example.
Suppose we have the following mathematical expression for evaluation:

exp(1+(15-2*2^3))

i.e. the load() function is called with the following arguments:

load(“exp(1+(15-2*2^3))”, 0, ‘\n’);

Inside of the load function we start parsing this string character till we find the end of token condition, which is the following (“item” is the newly created token, “ch” is the current character, i.e. we start with item = “”, ch = ‘e’, then, on the next loop, item = “e”, ch = ‘x’, then item = “ex”, ch = ‘p’, and so on):

bool EZParser::stillCollecting(const string& item, char ch)
{
        return (item.size() == 0 && (ch == '-' || ch == ')')) ||
          (ch != '*' && ch != '/' && ch != '+' && ch != '-' && ch != '^' && ch != '(' && ch != ')');
}

In other words we stop creating a new token as soon as we find parenthesis or an arithmetical operation. EZParser::stillCollecting will return false when item = “exp” and ch = ‘(‘.

Next we check if the new token is one of the predefined functions.
All the functions are defined in a different class, EZParserFunctions:

class EZParserFunctions
{
        public:
                typedef double (EZParserFunctions::*ptrFunc)(const string& arg, size_t& from);
                EZParserFunctions();

                virtual double getValue(const string& data, size_t& from, const string& item, char ch);

                double exp(const string& arg, size_t& from);
                double abs(const string& arg, size_t& from);
                double log(const string& arg, size_t& from);
                double avg(const string& arg, size_t& from);
                double sin(const string& arg, size_t& from);
                double cos(const string& arg, size_t& from);
                double tan(const string& arg, size_t& from);
                double sqrt(const string& arg, size_t& from);
                double round(const string& arg, size_t& from);
                double identity(const string& arg, size_t& from);
                double pi(const string& arg, size_t& from);

            static double strtod(const string& str);

        protected:
                map<string, ptrFunc> functions;
};

Here are some of the implementations:

EZParserFunctions::EZParserFunctions()
{
        functions["exp"]   = EZParserFunctions::exp;
        functions["log"]   = EZParserFunctions::log;
        functions["sin"]   = EZParserFunctions::sin;
        functions["cos"]   = EZParserFunctions::cos;
        functions["sqrt"]  = EZParserFunctions::sqrt;
        functions["abs"]   = EZParserFunctions::abs;
        functions["round"] = EZParserFunctions::round;
        functions["pi"]    = EZParserFunctions::pi;
}

double EZParserFunctions::identity(const string& arg, size_t& from)
{
        EZParser sub;
        return sub.load(arg, from, ')');
}
double EZParserFunctions::pi(const string& arg, size_t& from)
{
        return 3.141592653589793;
}
double EZParserFunctions::exp(const string& arg, size_t& from)
{
        EZParser sub;
        return ::exp(sub.load(arg, from, ')'));
}
The main advantage of this class is that it has a map of string tokens to a pointer to an actual function implementation. Note that a map from a string to a function pointer is quite efficient when the number of elements is small (besides of another great advantage of being a standard C++ data structure).

Returning back our example, we are in EZParser::load function, in the call on line 15:

    double value = theFunctions->getValue(“exp(1+(15-2*2^3))”, 4, “exp”, ‘(‘);
The implementation of the called virtual function is the following:

double EZParserFunctions::getValue(const string& data, size_t& from, const string& item, char ch)
{
        if (item.empty() && ch == '(')
        {
                return identity(data, from);
        }
        map<string, ptrFunc>::const_iterator it = functions.find(item);
        return it == functions.end() ? strtod(item) : (this->*(it->second))(data, from);
}
This function will find EZParserFunctions::exp function in the functions map, i.e. we will return the following call from the above function:

    return this->exp((“exp(1+(15-2*2^3))”, 4);  // last line of EZParserFunctions::getValue

Note that if we did not have a definition of the exp function in EZParserFunctions file, the call to EZParserFunctions::strtod(“exp”) would have been made, which would’ve thrown an exception:

double EZParserFunctions::strtod(const string& str)
{
    char* x = new char[str.size() + 1];
    double num = ::strtod(str.c_str(), &x);
    if (::strlen(x) > 0)
    {
          throw EZParser::CalculationException("Could not parse token [" + str + "]");
    }
    return num;
}
But in our case, inside of EZParserFunctions::exp() function, we will first need to load the argument for the exponent. This will be done via defining a new EZParser object and calling its load method with the following arguments:

    sub.load(“exp(1+(15-2*2^3))”, 4, ')');  // line 2 of EZParserFunctions::exp() function

i.e. load() function will start parsing the string from the 4th element (starting counting from 0), i.e. the working substring will be “1+(15-2*2^3))”.

Inside of the load() function we will get the first token as “1”, and the sign as ‘+’. So the following Cell object will be created and pushed to the stack:

    double value = strtod(“1”); // = 1; line 15 of EZParser::load() function
    stack.push_back(Cell(1, ‘+’)); // line 17 of EZParser::load() function
Let’s take a brief look at line 16 of the load function. What it does is the following: if the current character after the token is a valid sign (as it is in our case, since the sign ‘+’ follows immediately number 1) it just takes it. But if it were not a valid sign (e.g. consider expression “(1+2)+3”: after parsing the second token “2” the sign is ‘)’ which is an invalid sign, so the updateSign function would have been called which would’ve examined the string till it found a valid sign (‘+’ in our case)). Note that the last token in an expression always has an invalid sign, but as we will see later this is OK. Here are the definitions of the functions for finding a token’s sign:

char EZParser::updateSign(const string& item, size_t& from, char ch)
{
        if (from >= item.size() || item[from] == ')')
        {
                return ')';
        }
        size_t tmp = from;
        char res = ch;
        while (!validSign(res) && tmp < item.size())
        {
                res = item[tmp++];
        }

        from = validSign(res) ? tmp : tmp > from ? tmp - 1 : from;
        return res;
}

bool EZParser::validSign(char ch)
{
        return ch == '*' || ch == '/' || ch == '+' || ch == '-' || ch == '^';
}
Now let’s continue parsing our string in the EZParser::load() function. Since the condition on line 19 is true, we will continue the cycle and after line 8 we will have ch = ‘(‘ and from =  7. So stillCollecting(“”, ‘(‘) function on line 9 will return false.
Then the call to

    theFunctions->getValue(“exp(1+(15-2*2^3))”, 7, “”, ‘(‘) // line 17 of EZParser::load():
will return a special case, an “identity” function (line 3 of EZParserFunctions::getValue function). This function is used to process and change in place everything that is between parentheses (since they have the highest precedence).
Inside of EZParserFunctions::identity function we create a new EZParser for our substring and load it with the following call:

    sub.load(“exp(1+(15-2*2^3))””, 7, ')');
Basically that means that we are parsing “15-2*2^3))” substring. This substring now does not have any subfunctions so we’ll get a stack consisting of four Cell objects constructed via subsequent calls to the lines 20 and 21of the EZParser::load() function. The stack will contain these four elements:
Cell (15, ‘-‘), Cell(2, ‘*’), Cell(2, ‘^’), Cell(3, ‘)’).
Then the EZParser::calculate() function will be called from the line 26 of EZParser::load():

double EZParser::calculate()
{
        if (stack.size() == 0)
        {
                throw EZParser::CalculationException("Empty stack for calculation");
        }
        Cell base = stack.at(0);
        size_t index = 1;

        while(index < stack.size())
        {
                Cell next = stack.at(index++);
                if (!(base.sign == '^' || ((base.sign == '*' || base.sign == '/') && next.sign != '^') ||
                        next.sign == '+' || next.sign == '-' || next.sign == ')'))
                {
                        calculateNext(base, next, index);
                }

                mergeCells(base, next);
        }
        return base.value;
}
The main logic of this function is on lines 10-11. If we cannot perform a mathematical operation right away, i.e. if the second sign has a higher precedence than the current one, (as it is in our case when working with the first two cells, Cell (15, ‘-‘), Cell(2, ‘*’), where ‘*’ has a higher precedence as ‘-‘), we need to move on and try to merge the next two cells using the following function:

void EZParser::calculateNext(Cell& base, Cell& current, size_t& index)
{
        if (index >= stack.size())
        {
                return;
        }

        size_t tmp = index;
        Cell next = stack.at(index++);

        if (!(base.sign == '^' || ((base.sign == '*' || base.sign == '/') && next.sign != '^') ||
                next.sign == '+' || next.sign == '-' || next.sign == ')'))
        {
                calculateNext(current, next, index);
        }
        mergeCells(current, next);
}
So we won’t be able to merge cells when working with cells Cell (15, ‘-‘) and Cell(2, ‘*’) and with Cell(2, ‘*’) and  Cell(2, ‘^’). First time the function mergeCells will be called is when merging the last two cells: Cell(2, ‘^’), Cell(3, ‘)’). The mergeCells function is doing actual computation:

void EZParser::mergeCells(Cell& base, Cell& next)
{
        switch(base.sign)
        {
                case '^' : base.value = ::pow(base.value, next.value);
                        break;
                case '*' : base.value *= next.value;
                        break;
                case '/' :
                        if (next.value == 0)
                        {
                                throw CalculationException("Division by zero");
                        }
                        base.value /= next.value;
                        break;
                case '+' : base.value += next.value;
                        break;
                case '-' : base.value -= next.value;
                        break;
        }
        base.sign = next.sign;
}
The result of the calculation will be saved in the “base” cell – note that the “sign” of the base cell is updated after merging cells.
Basically that’s it. Now we will be “unwinding” the stack: calling mergeCells() with Cell(2, ‘^’) and Cell(3, ‘)’) will return the base Cell(8, ‘)’), then calling mergeCells() with Cell(2, ‘*’) and Cell(8, ‘)’) will return the base Cell(16, ‘)’), and finally calling mergeCells() with Cell (15, ‘-‘) and Cell(16, ‘)’) will return the base Cell(-1, ‘)’).

So our call to sub.load(“exp(1+(15-2*2^3))””, 7, ')'); from the EZParserFunctions::identity function will return -1.
Now we have just two cells in our stack: remember than we already have Cell(1, ‘+’) on the stack which combined with our resulting Cell(-1, ‘)’) will give us 1 -1 = 0.
That means that we found our argument for the exp function (line 2 of EZParserFunctions::exp function). So the final result will be ::exp(0) = 1.

Finally here is a simple main function to test the parser (the parsing string is the last argument to this function when calling it from the command line).

int main(int argc, char* argv[])
{
    string str = argv[argc - 1];
    try
    {
            EZParser ts;
                double res = ts.calculate(str);
                cout << "Got result: " << str << " -> " << res << endl;
        }
        catch(EZParser::CalculationException& e)
        {
                cout << "Calculation Exception for " << str << ": " << e.what() << endl;
        }
        catch(std::exception& e)
        {
                cout << "Standard Exception for " << str << ": " << e.what() << endl;
        }
        return 0;
}

The whole file is located here and is tested using a GNU compiler on Linux.

 

Conclusions


There are several things that should be considered in the above presented algorithm.
  • After getting a token first we checked that the token corresponds to a function name and if not we check that the token can be parsed to a number. Maybe the order should be switched if it is known in advance that the string contains mostly numbers.
  • There is a separate calculation of opened and closed brackets and removal of blanks from the passed string (function EZParser::preprocess). These operations can be combined inside of the EZParser::load() procedure for more efficient processing. But the disadvantage of this is that there will be a slight performance loss for strings that are known to be “OK” and don’t require pre-processing.
  • Currently adding a new functionality (e.g. new functions, constants, differentiation/integration, etc.) requires either supplying a new class derived from EZParserFunctions and overriding getValue() function or modifying EZParserFunctions: adding a new function definition and a new function mapping. This part could be made easier by supplying a new convenience function or idiom.

1 comment:

  1. Nice article!
    When implementing C++ and Java versions of a mathematical expression parser, I have found the following Wiki pages to be excellent and sufficiently thorough:

    http://en.wikipedia.org/wiki/Shunting-yard_algorithm
    http://en.wikipedia.org/wiki/Reverse_Polish_notation

    So long as you are reasonably familiar with the use of stacks and queues, the algorithms described on these Wiki pages are fairly straightforward to implement. For a C++ / Java implementation of the said algorithms for use in mathematical expression parsers, take a look at this link:

    http://www.technical-recipes.com/2011/a-mathematical-expression-parser-in-java/

    ReplyDelete