Accessing GCC’s Abstract Syntax Tree with a GCC Plugin

GCC provides a plugin system to extend GCC, in our use case we can access the abstract syntax tree. To do this, we need to create a plugin and compile it as shared library. This shared library can then be passed to gcc for a compilation unit. and pass it to GCC when compiling a file.

You can find the full example here on GitHub. I'll use always Docker for working with GCC plugins, so if you want to play around with this example, at best, use this Docker image from this repo too.

In GCC's plugin directory we can find the header files for its API. To Print GCC's plugin directory run:

$ gcc -print-file-name=plugin
/usr/local/lib/gcc/x86_64-linux-gnu/13.3.0/plugin

In this entire example we'll use GCC 13.3.0 from GCC's Docker image. In my case there was GMP (a library which the plugin needs from GCC) was missing so I added the package repositories and the additional package to the Dockerfile.

1. First Plugin

Let us now create the first plugin. To do so we define plugin_init which is the entry point for the plugin:

// we have to specify plugin_is_GPL_compatible for every plugin
// if it is not defined a compilation error will follow later.
int plugin_is_GPL_compatible;

int plugin_init (struct plugin_name_args *plugin_info,
        struct plugin_gcc_version *version)
{
  std::cout << "******** hello plugin ********* \n";
  return 0;
}

Compile this to a shared library:

$ g++ ./src/1_hello_plugin/plugin.cpp -I`gcc -print-file-name=plugin`/include -fPIC -shared -o plugin.so

And when compiling a source, apply the just created plugin with -fplugin=..

$ g++ -fplugin=./plugin.so -c ./src/1_hello_plugin/src.cpp
******** hello plugin *********

Was not too difficult. So let us continue to access something from the compilation unit.

2. Declarations

This brings us to the first callback event. There are several stages in the GCC compilation process. Now we register a callback after each declaration:

int plugin_is_GPL_compatible;

static void declaration_callback(void* gcc_data, void* user_data);

int plugin_init(struct plugin_name_args* plugin_info,
                struct plugin_gcc_version* version)
{
  // GCC offers this function to register a callback
  // the second argument is the EVENT when to invoke the callback
  // in this case we use PLUGIN_FINISH_DECL
  // declaration_callback is the function which gets called. 
  register_callback(plugin_info->base_name, PLUGIN_FINISH_DECL,
                    declaration_callback, NULL);
  return 0;

And now we can access the data in our callback. We'll only need the first argument gcc_data. In case we need some specific data we can also pass user_data to the callback. This is the fourth argument in register_callback, this will always be NULL in our examples.

Let's see the first example to access a variable name:

static void declaration_callback(void* gcc_data, void* user_data)
{
  std::cout << "Found a declaration:\n";
  // this c-style cast is apparently the way to go (as the internet suggests)
  tree node = (tree)gcc_data;
  // every tree node is part of a specific class in this case we're only interested in declarations
  // TREE_CODE_CLASS expects the tree code from our current node which we can access with TREE_CODE
  // if unclear, this will get better later. 
  if (TREE_CODE_CLASS(TREE_CODE(node)) != tcc_declaration)
  {
    std::cerr << "Tree node (" << TREE_CODE(node) << ") '" << get_tree_code_name(TREE_CODE(node)) << "' is no declaration.\n";
    return;
  }
  // now we can print the name, see the print_name below
  print_name(node);
  std::cout << '\n';
}


static void print_name(tree node)
{
  // we access the declaration name and check if it is not null
  // and working with the given defines in GCC's is how it works almost all the time
  tree name = DECL_NAME(node);
  if (!name)
  {
    return;
  }
  // and then we use the following define to access its name
  std::cout << IDENTIFIER_POINTER(name);
}

And get used to all the macros, this is the way as I found out. But anyway, if we now compile this into a shared library and apply the plugin to a compilation unit, we will be able to see the names of the variables from the compilation unit.

First we compile the plugin:

$ g++ ./src/2_declaration/plugin.cpp -I`gcc -print-file-name=plugin`/include -fPIC -shared -o plugin.so

Then we have this simple cpp file with a single global variable:

int some_variable = 1;

And when we compile this file, we see during compilation this output:

$ g++ -fplugin=./plugin.so -c ./src/2_declaration/src.cpp 
Found a declaration:
some_variable

Type, Scopes, Qualifiers and Initial Values

As the subchapter suggests, it would be quite interesting to get the type, scope, qualifiers, and initial values as well. So let us start with printing the type with its qualifiers:

static void print_qualifiers(tree node)
{
  // for the qualifiers we use a bitwise and to check if there 
  // qualifiers, for our example we'll check for const here 
  if (TYPE_QUALS(type) & TYPE_QUAL_CONST)  
  {
    std::cout << "const ";
  }
  // ... 
}
static void print_type(tree node)
{
  print_qualifiers(type);

  switch (TREE_CODE(type))
  {
    case INTEGER_TYPE:
    {
      if (TYPE_UNSIGNED(node))
      {
        std::cout << "unsigned ";
      }
      print_integer(type);
    }
    break;
  // and all the others, floating points, references, 
  // pointers, structs, arrays, function pointers, ... 
  }
}

// and to be precise, we need to know which integer type it is: 
static void print_integer(tree integer)
{
  // so depending on its precision 
  // we know what integer type it is
  switch (TYPE_PRECISION(integer))
  {
    case 8:
      std::cout << "char";
      break;
    case 16:
      std::cout << "short";
      break;
    case 32:
      std::cout << "int";
      break;
    case 64:
      std::cout << "long long";
      break;
    default:
      std::cout << "unknown integer type with precision: "
                << TYPE_PRECISION(integer);
  }
}

Let us now also print the scope in which we are:

// this is probably pretty much straight forward
static void print_name_with_scope(tree node)
{
  std::string str;
  // we iterate over the node and it's declaration name and append the declaration name to the string
  while (node && DECL_NAME(node) && TREE_CODE(node) != TRANSLATION_UNIT_DECL)
  {
    str = "::" + std::string(IDENTIFIER_POINTER(DECL_NAME(node))) + str;
    node = DECL_CONTEXT(node);
  }
  // finally we print the string
  // this will print the variable name too, so I remove the former print_name and
  // we'll use this function to print the name
  std::cout << str;
}

And finally, we want to print the initial value, if there is one. This is the hardest part for now. Because every type is treated differently and with initial values, like a function call, binary operations, references, pointers, arrays, and so on and so forth, there are a lot of cases. However, let us take a look at a simple integer value:

void print_value(tree var)
{
  // if there is a nullpointer, then the value 
  // is not initialized
  if (!var)
  {
    std::cout << " uninitialized";
    return;
  }
  std::cout << ' ';
  // and now we check the tree code
  switch (TREE_CODE(var))
  {
    // integers are straight forward and we only need to use 
    // the given define 
    case INTEGER_CST:
      std::cout << TREE_INT_CST_LOW(var);
      break;
    // for floating point values we have to create a buffer and 
    // call real_to_decimal (this is a function from gcc and not from us)
    case REAL_CST:
      char buffer[64];
      real_to_decimal(buffer, &TREE_REAL_CST(var), sizeof(buffer),
                      0, 1);
      std::cout << buffer;
      break;
    // ... 
    // and a lot of other cases
    // check out the full file in the github repo
  }
}

Now we can put everything together and the callback function looks like this:

static void declaration_callback(void* gcc_data, void* user_data)
{
  tree node = (tree)gcc_data;
  // ...
  // since we call print_type and print_value
  // inside those functions too, we use TREE_TYPE
  // to access the type and DECL_INITIAL to access 
  // the initial value 
  print_type(TREE_TYPE(node));
  print_name_with_scope(node);
  print_value(DECL_INITIAL(node));
  std::cout << '\n';
}

We can compile this to our shared library and now we can compile this source:

const char* p_c = "hello world";
const char* const p_cp = "hello world";
const char* c_arr[] = {"first", "second", "third"};

int i = 99;
int& ref = i;
int const ci = 123;
const unsigned int cui = 123;

long l;
long long ll;

float fl = 1.1f;
double d = 1.12;
unsigned int ui = 123;
const double cd = 1;

int* p_i;
int i_arr[] = {1, 2, 3, 4};
const int ci_arr[] = {1, 2, 3, 4};

namespace cwt
{
  int value = 1;
  namespace nested
  {
    int other_value = 2; 
  } // namespace nested 
}

which gives then this output:

$ g++ ./src/2_declaration/plugin.cpp -I`gcc -print-file-name=plugin`/include -fPIC -shared -o plugin.so
$ g++ -fplugin=./plugin.so -c ./src/2_declaration/src.cpp 
const char * ::p_c hello world
const const char * ::p_cp hello world
const char * [] ::c_arr { [0] first, [1] second, [2] third }
int ::i 99
int & ::ref  reference to: unknown type: var_decl  i 99
const int ::ci 123
const unsigned int ::cui 123
long long ::l uninitialized
long long ::ll uninitialized
float ::fl 1.10000002384185791015625e+0
double ::d 1.12000000000000010658141036401502788066864013671875e+0
unsigned int ::ui 123
const double ::cd 1.0e+0
int * ::p_i uninitialized
int [] ::i_arr { [0] 1, [1] 2, [2] 3, [3] 4 }
const int [] ::ci_arr { [0] 1, [1] 2, [2] 3, [3] 4 }
int ::cwt::value 1
int ::cwt::nested::other_value 2

3. Initialization & Call Expressions

Now we're able to print variables, but what happens for instance in this case:

int v1 = 1;
int v2 = v1;
int func();
int v3 = func();

When we compile and apply the latest plugin we get:

$ g++ -fplugin=./plugin.so -c ./src/3_expressions/src.cpp 
int ::v1 1
int ::v2 uninitialized
int  args: (void ) ::func uninitialized
int ::v3 uninitialized

Lets go through this output line by line:

  1. v1 is initialized with 1
  2. v2 is uninitialized because v1 is not const
  3. Declaration of func() (in this case uninitialized is probably a bit missleading, because as we know we can not have an initial value ...)
  4. v3 is uninitialized because we don't know what func() returns

This means we have now to deal with expressions. For v2 and v3 we have initialization expressions and the call to func() is a call expression. So lets create an example to access expressions.

Initialize expression

We'll create a new plugin in a new directory. In src/3_expressions we'll create a new plugin now. In order to parse the initialize expression we have to register a new callback with another event:

// our new call back 
static void pre_genericize_callback(void* gcc_data, void* user_data)
{
  tree node = (tree)gcc_data;
  std::cout << "-- pre genericize callback: current node = '"
            << get_tree_code_name(TREE_CODE(node)) << "'\n";
  // and now we traverse through a given tree 
  // implementation follows below
  traverse_tree(node);
}

int plugin_init(struct plugin_name_args* plugin_info,
                struct plugin_gcc_version* version)
{
  register_callback(plugin_info->base_name, PLUGIN_FINISH_DECL,
                    declaration_callback, NULL);
  // and now we use the event PLUGIN_PRE_GENERICIZE to access an abstract syntax tree 
  register_callback(plugin_info->base_name, PLUGIN_PRE_GENERICIZE,
                    pre_genericize_callback, NULL);
  return 0;
}

And now we come to the point where GCC creates an abstract syntax tree and we need to go through it. PLUGIN_PRE_GENERICIZE is the corresponding event we need. In general, GCC has tree codes to represent nodes in the AST and tree code classes. We use them to navigate through the AST and perform our operations accordingly. There are two macros for this, TREE_CODE and TREE_CODE_CLASS, which we'll use. But don't worry, we'll start with a simple example, and to be honest here, I didn't fully understand GCC's internals, for now it's working for me. If you are more familiar with it, or if I am writing something wrong, feel free to comment below :).

So let us start to implement our first traverse through a tree. Remember, we just want to parse an initialization of v2 (because v1 is non-const):

int v1 = 1;
int v2 = v1;

Now we expect an initialization expression at this point. As our entry point for traversing the tree nodes, we implement traverse_tree:

static void traverse_tree(tree node)
{
  // we check the class of each node 
  // and continue our navigation 
  // note: in this example are not all classes implemented
  switch (TREE_CODE_CLASS(TREE_CODE(node)))
  {
    case tcc_expression:
      expression(node);
      break;
    case tcc_declaration:
      declaration(node);
      break;
    case tcc_exceptional:
      exceptional(node);
      break;
    case tcc_unary:
      unary(node);
      break;
    // we have to deal with VOID_CST
    // find an explanation for this later
    case tcc_constant:
      constant(node);
      break;
    default:
      std::cout << "Unhandled tree code class: ("
                << TREE_CODE_CLASS(TREE_CODE(node)) << ") '"
                << get_tree_code_name(TREE_CODE(node)) << "'\n";
      break;
  }
}

Most tree code classes should be self-explanatory, like expressions or declarations, but some (at least to me) were not. For example, the exceptional tree class was rather confusing to me, but according to GCC's documentation, this is the tree code class when the tree code does not fit into any other category.

In our case, we're going to have an exceptional tree code. At one point we get a statement list that belongs to tcc_exceptional. But before we get to that point, we check the expression. So far I have always used the TREE_OPERAND(..) macro to access the nodes within an expression. We can refactor this into a loop later, because there is also TREE_OPERAND_LENGTH. For now, we will leave this as it is:

static void expression(tree node)
{
  switch (TREE_CODE(node))
  {
    case EXPR_STMT:
      std::cout << "-- accessing node in expression statement\n" ;
      traverse_tree(TREE_OPERAND(node, 0));
      break;
    case CLEANUP_POINT_EXPR:
      std::cout << "-- accessing node in clean up point\n"; 
      traverse_tree(TREE_OPERAND(node, 0));
      break;
    case INIT_EXPR:
    {
      tree lhs = TREE_OPERAND(node, 0);
      tree rhs = TREE_OPERAND(node, 1);

      std::cout << "-- intializing variable, lhs = " << get_tree_code_name(TREE_CODE(lhs)) << ", rhs = "
                << get_tree_code_name(TREE_CODE(rhs)) << '\n';

      traverse_tree(lhs);
      traverse_tree(rhs);
    }
    break;
    default:
      std::cout << "Unhandled expression: '"
                << get_tree_code_name(TREE_CODE(node)) << "'\n";
      break;
  }
}

Some explanation to the three nodes we have here:

  • EXPR_STMT: Represents an expression statement
  • CLEANUP_POINT_EXPR: Represents a clean up point, for instance when we leave a scope an we need to destroy all local objects
  • INIT_EXPR: Initialization expression to initialize a variable, where operand 0 is the variable to initialize and operand 1 the initializer

Next is unary, here we get a convert expression for all implicit and explicit conversions (I was a bit confused because this node is created in our case where I did not see any conversion at all). But if we don't check CONVERT_EXPR we will never get to our initialization.

In this case, we use again TREE_OPERAND to access the underlying node:

static void unary(tree node)
{
  switch (TREE_CODE(node))
  {
    case CONVERT_EXPR:
      std::cout << "-- accessing tree in convert expression\n";
      traverse_tree(TREE_OPERAND(node, 0));
      break;
    default:
      std::cout << "Unhandled unary: '" << get_tree_code_name(TREE_CODE(node)) 
                << "'\n";
      break;
  }
}

Now there is exceptional. As mentioned before, here we get a statement list. GCC provides a tree statement iterator that we can use to iterate through this list:

static void exceptional(tree node)
{
  switch (TREE_CODE(node))
  {
    case STATEMENT_LIST:
    {
      std::cout << "-- iterating over statement list\n";
      for (tree_stmt_iterator tsi = tsi_start(node); !tsi_end_p(tsi);
           tsi_next(&tsi))
      {
        traverse_tree(tsi_stmt(tsi));
      }
    }
    break;
    default:
      break;
  }
}

And, we have declaration, where we now just print the variable, using the print function from before:

static void declaration(tree node)
{
  switch (TREE_CODE(node))
  {
    case FUNCTION_DECL:
    {
      std::cout << "-- traversing through function tree\n";
      traverse_tree(DECL_SAVED_TREE(node));
    }
    break;
    case VAR_DECL:
      print_declaration(node);
      std::cout << '\n';
      break;
    default:
      std::cout << "Unhandled declaration: '"
                << get_tree_code_name(TREE_CODE(node)) << "'\n";
      break;
  }
}

And finally, we have one last tree code class tcc_constant. For now we have to deal with VOID_CST, a void constant. This is rather unusual, but it represents operations or expressions that do not return a value. For completeness and demonstration we add a simple print here:

static void constant(tree node)
{
  switch (TREE_CODE(node))
  {
    case VOID_CST: 
      std::cout << "-- void constant, we're done now\n";
      // nothing to do now
      break; 
    default:
      std::cout << "Unhandled declaration: '"
                << get_tree_code_name(TREE_CODE(node)) << "'\n";
      break;
  }
}

Now we can compile and apply this plugin to the next compilation unit. To demonstrate this we use the source from before:

int v1 = 1;
int v2 = v2;

And this gives the following output:

int ::v1 1
int ::v2 uninitialized
-- pre genericize callback: current node = 'function_decl'
-- traversing through function tree
-- iterating over statement list
-- accessing node in clean up point
-- accessing node in expression statement
-- accessing tree in convert expression
-- intializing variable, lhs = var_decl, rhs = var_decl
::v2
::v1
-- accessing node in expression statement
-- void constant, we're done now

Note: The final void constant is behind an expression statement.

Call Expression

For the call expression we now have to do two things, first we want to note that we're on a tree node of type CALL_EXPR and then we want to print it. To print it we basically need three different things, the return type, the function name and the arguments.

Lets extend expression:

static void expression(tree node)
{
  switch (TREE_CODE(node))
  {
    // as before  
    case CALL_EXPR:
      print_call_expression(node); 
    default:
      std::cout << "Unhandled expression: '"
                << get_tree_code_name(TREE_CODE(node)) << "'\n";
      break;
  }
}

During implementation, I noticed that CALL_EXPR belongs to a different tree code class than other expressions. GCC uses tcc_vl_exp for call expressions. Which is defined according to their documentation as this: "A function call or other expression with a variable-length operand vector"

For now I'll just extend traverse_tree to call expression:

static void traverse_tree(tree node)
{
  switch (TREE_CODE_CLASS(TREE_CODE(node)))
  {
    // as before 
    case tcc_vl_exp:
      expression(node);
      break;
    // ...
  }
}

And finally we need to print our call expression. We'll refactor this entire function later, right now this is good enough:

static void print_call_expression(tree node)
{
  // access the function and its name here 
  tree call_expr_fn = CALL_EXPR_FN(node);
  tree func_decl = TREE_OPERAND(call_expr_fn, 0);
  std::cout << "-- call: '" << IDENTIFIER_POINTER(DECL_NAME(func_decl))
            << "'\n";

  // with gcc's call_expr_nargs we can get the argument count of our fucntion          
  const std::size_t argc = call_expr_nargs(node);
  std::cout << "-- arg count: " << argc << '\n';
  for (int i = 0; i < call_expr_nargs(node); ++i)
  {
    // since we get a tree for each argument 
    // we just traverse through it
    traverse_tree(CALL_EXPR_ARG(node, i));
  }

  // and the returntype of our function 
  // the nested TREE_TYPE is confusing, but its intentionally 
  tree return_type = TREE_TYPE(TREE_TYPE(func_decl));
  std::cout << "-- returns: " << get_tree_code_name(TREE_CODE(return_type))
            << '\n';
}

Lets compile our plugin and apply it to following source:

int func();
int v3 = func();

Where we get this output:

-- pre genericize callback: current node = 'function_decl'
-- traversing through function tree
-- iterating over statement list
-- accessing node in clean up point
-- accessing node in expression statement
-- accessing tree in convert expression
-- intializing variable lhs = var_decl, rhs = call_expr
::v3
-- call: 'func'
-- arg count: 0
-- returns: integer_type
-- accessing node in expression statement
-- void constant, we're done now

Lets extend constant to print integer constants, so we can demonstrate some function arguments:

static void constant(tree node)
{
  switch (TREE_CODE(node))
  {
    // as before 
    case INTEGER_CST:
      std::cout << "-- integer constant: " << TREE_INT_CST_LOW(node) << '\n';
      break;
    // as before 
  }
}

And now we can apply this plugin to this source:

int func(int arg1, int arg2);
int v4 = func(99, v2);

Where we get for v4:

-- accessing node in expression statement
-- accessing tree in convert expression
-- intializing variable lhs = var_decl, rhs = call_expr
::v4
-- call: 'func'
-- arg count: 2
-- integer constant: 99
::v2
-- returns: integer_type
-- accessing node in expression statement
-- void constant, we're done now

Conclusion

So, here we are. There are several places to go if you want to access the abstract syntax tree of GCC.

Functions and Control Flow

First, you could continue to implement as we did, then a possible next step would be to access function bodies. To do this you will find BIND_EXPR and with extending expression you can traverse its tree:

static void expression(tree node)
{
  switch (TREE_CODE(node))
    // as before 
    case BIND_EXPR:
    {
      std::cout << "-- bind expression variables:\n";
      for (tree var = BIND_EXPR_VARS(node); var; var = TREE_CHAIN(var))
      {
        traverse_tree(var);
      }
      std::cout << "-- bind expression body:\n";
      traverse_tree(BIND_EXPR_BODY(node));
    }
    break;
    // ... 
    }
}

There you have all the local variables and the function body separately available. And when we get to statements and control flow, you could implement statement:

static void statement(tree node)
{
  switch (TREE_CODE(node))
  {
    case DECL_EXPR:
      traverse_tree(DECL_EXPR_DECL(node));
      break;
    case IF_STMT:
    {
      std::cout << "-- if stmt condition: \n";
      traverse_tree(COND_EXPR_COND(node));
      std::cout << "-- if stmt then: \n";
      traverse_tree(COND_EXPR_THEN(node));
    }
    break;
    case RETURN_EXPR:
      traverse_tree(TREE_OPERAND(node, 0));
      break;
    default:
      std::cout << "Unhandled statement: '"
                << get_tree_code_name(TREE_CODE(node)) << "'\n";
  }
}

walk_tree

Second, I found walk_tree, which appears to be a visitor pattern for traversing the tree, but I found little to no documentation about it. This may be another way to traverse the AST.

Gimple Representation

Third, there is Gimple, which is the next representation of a source after the AST. It is more simplified than the AST, and I just played around with a kind of "hello world", which you can find in the corresponding directory ./src/4_gimple in the repo.

So a lot to do, for me it was quite a success, because in the beginning it didn't look as promising as it does now.

I hope it helped you too, until the next one.

Cheers Thomas

Next
Next

How to Write a C++ Test Framework