[Master’s Thesis] Multi-Channel Convolutional Neural Network for Software Defect Prediction

Our lab member Chen Lang presented his master thesis.

Title: Multi-Channel Convolutional Neural Network for Software Defect Prediction

In software development, bugs are inevitably produced during the entire lifecycle of a software system. Additionally, the number of bugs grows with the increasing complexity of the modern software system. To eliminate the cost caused by software defects, Software Defect Prediction (SDP) has been proposed as an effective way of detecting hidden bugs since a few decades ago.

A typical SDP approach consists of two steps: extracting buggy features from the source code and designing an effective classifier to predict defects. By applying SDP techniques, software developers can shrink the inspecting area of the project and spot bugs more efficiently to save more time and labor costs.

In the past, SDP relies on hand-crafted traditional software metrics, whereas they only show limited performance in predicting bugs.

To improve prediction accuracy, recent researchers focus on extracting advanced characteristics from the source code and propose deep learning-based SDP approaches. However, the existing methods ignore the abundant information hidden in the discarded source code structures that can help improve prediction accuracy.

In this master thesis, we proposed a novel approach named Multi-Channel Convolutional Neural Network for Software Defect Prediction (MC-CNN for SDP) that can leverage more information hidden in the source code to improve prediction performance without being affected by the accompanying data noise. Specifically, we first extract different sub-trees from the Abstract Syntax Tree (AST) based on a set of pre-defined root nodes. Then, these sub-trees are serialized as sequences of tokens by layer-level traversal, and then each token sequence is encoded to a serial of numerical values. Next, we define each number sequence as a channel and feed all channels into the CNN model as multi-channel input. Finally, our MC-CNN model can automatically learn the structural and semantical features hidden in the source code. These features generated by the MC-CNN model can be utilized to perform file-level defect prediction.

We examine the performance of our proposed method by using seven projects from the PROMISE dataset. Based on F1 scores, our approach can improve the prediction performances compared with state-of-the-art methods. Furthermore, we also prove the effectiveness of multi-channel CNN compared with single-channel CNN by conducting a pair of contrast experiments.