Improving the Usability of Static Code Analysis Tools with Machine Learning

Ugur Koc

Ugur Koc

Unknown

1 0
  • 0 Collaborators

Static code analysis (SCA) tools often generate a large number of false warnings. These spurious warnings reduce practically of SCA tools in the industry. Our goal is to solve this problem by substantially reducing the number of false warnings developers encounter using machine learning techniques. ...learn more

Project status: Under Development

Artificial Intelligence

Intel Technologies
Intel Opt ML/DL Framework

Code Samples [1]Links [1]

Overview / Usage

Static code analysis tools promise to improve development by quickly identifying flaws that might jeopardize application security, performance, and integrity. Unfortunately, SCA tools are known to generate very large numbers of false positives (FP), which must be manually resolved. For example, running Facebook Infer on a simple C program (Toybox), we received 48 error reports, only 6 of which actually pointed to real problems (87.5% false positive). Simplifying greatly, this happens because of the inevitable approximations and assumptions made in implementing and scaling these tools, ultimately leading some developers to shy away from SCA tools.
Our research addresses this problem by using machine learning (ML) to predict and filter FPs, so that developers can focus on true error reports. Existing ML-based approaches generally focus learning on features of the SCA reports and black-box observations about the programs being analyzed. Despite some success, these approaches suffer from fundamental limitations. The feature identification process is manual and time-consuming, and it requires significant expertise in program analysis and other related domains (e.g., security). Second, the extracted black-box features are often inadequate for capturing the root causes of FPs, as those are often embedded in the code and its execution logic. Third, the extracted features are not truly generalizable; features defined for one kind of SCA analysis will likely be inapplicable for other kinds of SCA analyses.
To address these limitations, we developed a neural network-based learning approach which learns features from the analyzed code itself. First, we convert the code into a sequence of tokens by applying a set of transformations that summarize, abstract, and tokenize the code (preprocessing step). Then, using LSTM, we train SCA classification models on the sequenced data (learning step). This approach is promising because it focuses on the code's structure, and does not require pre-identification of features, or analysis domain expertise. Our initial experimental results bear this out. In a case study of Java FindSecBugs, we achieved 90% accuracy in classification and filtered out 81% of FPs, while misclassifying only 2.7% of true positives using the OWASP Benchmark as the dataset. Still, several improvements are needed to use this approach on an arbitrary real-world code. For example, by converting programs into sequences of tokens we lose information about code semantics and can misclassify some FPs due to this loss of information.
We will improve the preprocessing step and learning model and will compose these steps into a semi-supervised online system which can be used as a post-validator for the SCA findings in the development workflow of companies.

Methodology / Approach

We will improve the preprocessing step and learning model and will compose these steps into a semi-supervised online system which can be used as a post-validator for the SCA findings in the Facebook ecosystem.
The Preprocessing Step Our current approach uses the WALA program analysis framework for computing a lightweight backward slice of the program, given the error location called out in a given FP report. After computing the slice we perform summarization, abstraction, and tokenization to remove information from the slice that is likely to be irrelevant for a given FP report or lead to overfitting. Our goal is to develop a customizable code transformation tool consisting of well-defined transformations for preprocessing programs for a given type of SCA task (e.g., kind of program analysis). This tool will take a program as input, and by performing the transformations enabled will compute a summary of the program that can be used for training neural networks for the SCA task at hand.
Summarizing the code for the learning step increases performance and generalizability because most program statements are irrelevant for a given finding. Summarization also helps with scalability as 1) programs size jeopardizes learning performance; 2) irrelevant statements will act as a noise which will jeopardize both the performance and generalizability of learning. In addition to the program slicing technique we already use, we will design alternative transformations that further reduce input programs by removing additional irrelevant statements. Abstracting certain components of code is important for the generalizability of the approach. For instance, developers often choose arbitrary characters as identifiers. We will develop transformations that abstract away arbitrary program-specific identifiers and constant literals, while keeping API-related terms, that often provide more generalizable information. Tokenizing the code allows it to be used as input to the neural networks. We will explore various levels of granularity in the tokenization process. We expect this will have a significant impact on the performance of learning.
The Learning Step builds classifiers (or rankers) using the data provided by the preprocessing step.
Pre-training. Acquiring a large ground-truth dataset on which to train is important, but difficult. To overcome this problem, other NLP applications use pre-training on unlabeled data to learn vector representations for words called “embeddings”. One important benefit of doing this is that these embeddings capture more generic aspects of words. Therefore, the classifiers which use these representations can be more generalizable as well. We will perform similar pre-training using large, unlabelled datasets of programs gathered from open-source repositories for learning such embeddings for code components. Learning on Graphs. We currently use sequential representations of programs which have fundamental limitations in terms of preserving a program’s structural and semantic information. To better account for and learn from this type of information, we will use graph representations of programs (e.g., program dependency graph) as input to Graph NNs (GNN). In particular, we will experiment with Gated and Convolutional GNNs. To use these architectures for our problem, we will create latent vector representations for the compound nodes of the program graphs using the embeddings of words learned with pre-training.

Technologies Used

Long-short term memories (LSTM) implemented in Tensorflow and Theano
Gated-graph Neural Networks (GGNNs) implemented in Tensorflow
WALA/Joana for preprocessing Java programs in the corpus

Repository

https://bitbucket.org/ugur_koc/mangrove/src/master/

Comments (0)