CS 23001 Computer Science II: Data Structures & Abstraction
Project #4

Fall 2024

Objectives:
Problem:
Build a program profiler. Construct a program to instrument C++ source code to support program profiling.

It is often important to determine how many times a function or statement is executed. This is useful not only for debugging but for determining what parts of a program may need to be optimized. This process is called profiling. That is, a execution profile presents how many times each part of a program is executed using a given set of input data (or for some run time scenario). To compute a profile, specialized statements need to be inserted in to the code that keep track of how many times a function or statement is executed. The process of inserting these statements is called instrumenting the code.

To implement a profiler one must first parse the source code and generate an Abstract Syntax Tree (AST) of the code. Each node of the AST describes the syntactic category of the code stored within it (function, statement, while-statement, etc.). So at the top level is a syntactic category corresponding to a program, class, or function (such as in the case of a main). Under that are sub-trees that further detail the syntactic categories of each part of the code. Such things as declarations, parameter lists, while-statement, and expression statements will describe the various parts of the program.

After the AST is generated it can then be traversed and the appropriate syntactic structures can be found that need to be instrumented. Once a construct is found, say a function, new code can be inserted that keeps track of how many times that function is executed.

The most difficult part of constructing a profiler is correctly parsing the source code. Unfortunately, C++ is notoriously difficult to parse. So here we will use a parsing tool called srcML. This tool reads in C++ code and marks up the code with XML tags (e.g., block, if, while, condition, name, etc). That is, the output is an AST in XML. The XML representation is called srcML (source code markup language).

Given the code example:
    void foo(int x) {
        ++x;
    }
The following is srcML 1.0 with tags. It is indented for readability and to see the tree structurue. The function has 4 children, the function type, the name of the function, the parameter_list, and the block. Each <tag> has an end-tag </tag>, as in HTML.
<function>
    <type><name>void</name></type>
    <name>foo</name>
    <parameter_list>
        (
        <parameter><
            decl><type>
                <name>int</name>
                </type> <name>x</name>
            </decl>
        </parameter>
        )
    </parameter_list>
    <block>
        {<block_content>
        <expr_stmt>
            <expr>
                <operator>++</operator>
                <name>x</name>
            </expr>
            ;
        </expr_stmt>
        </block_content>}
    </block>
</function>
A number of srcML data files are provided for the project. However, you can use your own program as input. srcML is installed on wasp and hornet. To generate the srcML file for your own code use the following:
srcml main.cpp -o main.cpp.xml

You can also generate srcML for small snippets of code:
srcml --text "if (a == 10) { i=i+1; }" --language C++

Use the following for a list of all options:
srcml --help

More information about srcML can be found at www.srcML.org including a list of all the tag names (see Documentation). srcML can also be downloaded/installed locally on your machine.

The main srcML tag names needed for the project: The program needs to read in srcML files (in XML) and build an internal representation of the AST as a tree data structure. When using XML this internal representation is typically called a Document Object Model (DOM). A set of classes defining the data structure (DOM) is in shared (ASTree.hpp and ASTree.cpp). This part of the project is partially implemented (i.e., the read and write work). The given code compiles and runs. The program currently is able to read in a srcML file and construct the internal tree representation. It can also write the code back out (minus the XML) from the internal tree representation. The task is to complete the implementation and build the instrumenter for the profiler.

The specification for the profiler is as follows: In short, the profiler program will read in source code and write out source code that has been instrumented. Then compile and run the instrumented version of the code to get the profile information. Below is a full example of the tool chain. First convert main.cpp into srcML. Then run the profiler on that XML file. This will produce a instrumented version of main.cpp called p-main.cpp. Finally, compile and run this program.

srcml simple.cpp -o simple.cpp.xml
./profiler simple.cpp.xml
clang++ -Wall p-simple.cpp -c
clang++ -Wall profile.cpp -c
clang++ -Wall p-simple.o profile.o -o p-simple
./p-simple
 
2
Done
simple_cpp:
Line  Executed
14 search 1
19 1
28 main 1
32 1
33 1
35 1
36 1  


Requirements:

URL: https://data-structures.cs.kent.edu/projects/proj4.html
Last update: EST