[Master Thesis] Untangling Composite Commits Using Tree-based Convolution Neural Network

Master’s thesis defense of our lab member Cong Li will be held on February 1st.

Title:Untangling Composite Commits Using Tree-based Convolution Neural Network

Developers often bundle unrelated changes in a single commit, thus creating a so-called composite commit. Composite commit is problematic because it makes code review, reversion, and integration of these commits harder. Moreover, composite commits hinder some researches of code history mining. That is because most of the mining software repository approaches are designed under the assumption that every commit includes only related changes for a single task.

Various approaches of commit untangling have been proposed over the last decade, to use various features to untangling a composite commit into multiple single task commit. However, they all ignore the rich structural information contained in the code.

In this master thesis, we proposed an approach that make full use of the structural information hidden in source code to untangle a composite commit into multiple single task commits. Specifically, we first extract Abstract Syntax Tree(AST) corresponding to the changed code fragment in the commit. Then we use four embedding methods to embed each node of the AST. Next, we use Tree-based Convolution Neural Network to predict the relationship between code fragments by capturing both structural information of a code fragment from its AST and the lexical information from code identifiers. Then we cluster these code fragments in the commit according to the output derived from Tree-based CNN.

We conducted experiment on four Java projects, evaluate the performance of relationship prediction and clustering of our approach. We compared our approach with the state-of-art approach. And the result shows that our approach can improve the performance of prediction and clustering compared with the state-of-art method. And We proved the embedding method we proposed achieved the best results among the four methods. We also proved the usefulness of the structural information for commit untangling in this thesis.