GSoC @ LLVM 2018

Logo

View the Project on GitHub rnkovacs/gsoc2018

A checker for dangling string pointers in C++

Google Summer of Code 2018
@ The LLVM Compiler Infrastructure / Clang Static Analyzer

Réka Kovács
Eötvös Loránd University, Hungary
commits on GitHub / Phabricator revisions
rekanikolett at gmail dot com

Motivation

The C++ std::string class provides methods that return a raw pointer to a string’s inner character buffer. When a std::string is destroyed, the character buffer is deallocated. A common bug is to use such a raw pointer after string deallocation, which may lead to crashes or other unexpected behavior. A real-world example:

return std::to_string(size).c_str();

std::to_string() returns a string by value, a temporary object that will be destroyed at the end of the full-expression, before the return statement is executed. The caller will therefore receive a pointer to an already deallocated character buffer.

Apart from being destroyed, many other operations on the string can cause the inner buffer to be reallocated (most non-const member functions, and STL functions that take a string argument by a non-const reference), leading to similar problems. The goal of this project was to add a new checker to the Clang Static Analyzer to find when a dangling inner string pointer is used.

Even though the bug shown above might have been found by path-insensitive analysis (“c_str() is called on a temporary object expression”), the checker itself is path-sensitive for a number of reasons. One reason is that avoiding false positive warnings with path-insensitive analysis rapidly becomes more and more complicated.

For example,

strcpy(dest, string(...).c_str());

is valid because the destructor of string(...) is invoked after strcpy(), but avoiding a report here would require additional reasoning like “… and the result isn’t consumed before the end of the full-expression”. Another argument is that path-sensitive analysis enables us to catch errors like:

const char *foo(bool cond) {
    std::string s;
    if (cond)
        return s.c_str();
    else
        //...
}

Here, the local variable s is destroyed by an automatic destructor at the end of the function, but we are still able to find the bug.

There is another family of use-after-free errors involving strings that the summer project unfortunately haven’t covered, the incorrect usage of std::string_views. A string_view is an object that refers to the character array inside of an existing string but does not own it - it is the programmer’s responsibility to ensure that it does not outlive the character sequence it points to. More information about how the string_view-checking functionality could be implemented in the analyzer can be found in the “Future work” section.

Work outline

The basic idea of the checker is to keep record of raw pointers referring to the inner buffer of a string container (obtained by a c_str() or data() call) in the program state. The task is to recognize if any of the tracked pointers is used after a potentially invalidating operation.

Because this is a kind of a use-after-free problem, much of the functionality we wish to have has already been implemented in MallocChecker. MallocChecker finds general cases of use-after-free, double-free, and similar issues by holding information about symbols referring to memory returned by allocation functions (such as malloc(), alloca(), the C++ new expression, etc.) in its own data structure in the program state. Under symbols we mean special, fixed “values” in the analyzer that are unknown at compile time, but can be tracked to see how they are used, no matter where they are stored in the program. E.g. the value of a variable that would be read in from the standard input during normal execution is represented by a symbol during the analysis. (A great source of further information on how the analyzer represents values is the Checker Developer Manual.)

But the internals of containers like std::string are hard for MallocChecker to understand. An API-specific checker can however figure out exactly what to track in case of a particular container object, and it can then make good use of MallocChecker’s information about whether the memory associated with a symbol has been released. MallocChecker also provides a bug reporter visitor that attaches useful notes to the bug path leading to the use-after-free warning.

Based on this, the task can be broken down into two components:

  1. Implement a new checker (eventually named InnerPointerChecker) that tracks raw inner pointers of strings, and recognizes if any operation is made on the string that may re- or deallocate the inner buffer, invalidating the pointer. In such cases, “hand over” the corrupted pointer symbol to MallocChecker in a “released” state.

  2. Extend MallocChecker to recognize these special, inherently “released” symbols originating from InnerPointerChecker, so that it can issue an appropriate warning message and appropriate diagnostic notes along the bug path.

A more detailed progress description follows below, with links pointing to differential revisions on Phabricator. For those who prefer to browse the commits themselves, here is a commit list on GitHub.

Implementation steps

To sum up, in its present state, the checker is functional, enabled by default, and ready to be used on real-world projects. Any feedback is much appreciated!

Future work

There are a handful of possible improvements that could enhance the checker:

Evaluation

The checker has been evaluated on a number of open-source software projects (together with the libraries they depend on), most of which appear to use strings extensively: Bitcoin, Ceph, Harfbuzz, ICU, LibreOffice, LLVM, qBittorrent.

In its present form, the checker produces a very low number of reports. Among the analysis results of the above listed projects, no false positive findings have been observed, meaning that all reports seemed to be real bugs and have been reported to the respective developer communities:

Project maintainers responded extremely swiftly, and all of these bugs were fixed within a day!

How to use

Because all the development was made on the master branch, all you need to use the checker is a fresh copy of clang. The command to run it on a single file (together with the default set of checks):

$ clang --analyze file.cpp

Alternatively, it can be enabled explicitly:

$ clang -cc1 -analyze -analyzer-checker=cplusplus.InnerPointer file.cpp

It can also be used with a fresh copy of clang-tidy:

$ clang-tidy -checks=clang-analyzer-cplusplus.InnerPointer file.cpp --

When analyzing a project with scan-build, no further effort is needed, as the checker is turned on by default.

Acknowledgement

I would like to express my special thanks to Artem Dergachev and Gábor Horváth for being the awesome mentors they were. With their guidance, patient explanations and quick responses even during the weekends, I feel like I’ve learnt an immense amount during the summer. I am also super grateful to George Karpenkov and Devin Coughlin for their tips and comments on the patches.