2022 Theses Doctoral
Learning to Edit Code : Towards Building General Purpose Models for Source Code Editing
The way software developers edit code day-to-day tends to be repetitive, often using existing code elements. Many researchers have tried to automate the repetitive code editing process by mining specific change templates. However, such templates are often manually implemented for automated applications. Consequently, such template-based automated code editing is very tedious to implement. In addition, template-based code editing is often narrowly-scoped and low noise tolerant. Machine Learning, specially deep learning-based techniques, could help us solve these problems because of their generalization and noise tolerance capacities.
The advancement of deep neural networks and the availability of vast open-source evolutionary data opens up the possibility of automatically learning those templates from the wild and applying those in the appropriate context. However, deep neural network-based modeling for code changes, and code, in general, introduces some specific problems that need specific attention from the research community. For instance, source code exhibit strictly defined syntax and semantics inherited from the properties of Programming Language (PL). In addition, source code vocabulary (possible number of tokens) can be arbitrarily large.
This dissertation formulates the problem of automated code editing as a multi-modal translation problem, where, given a piece of code, the context, and some guidance, the objective is to generate edited code. In particular, we divide the problem into two sub-problems — source code understanding and generation. We empirically show that the deep neural networks (models in general) for these problems should be aware of the PL-properties (i.e., syntax, semantics). This dissertation investigates two primary directions of endowing the models with knowledge about PL-properties — (i) explicit encoding: where we design models catering to a specific property, and (ii) implicit encoding: where we train a very-large model to learn these properties from very large corpus of source code in unsupervised ways.
With implicit encoding, we custom design the model to cater to the need for that property. As an example of such models, we developed CODIT — a tree-based neural model for syntactic correctness. We design CODIT based on the Context Free Grammar of the programming language. Instead of generating source code, CODIT first generates the tree structure by sampling the production rule from the CFG. Such a mechanism prohibits infeasible production rule selection. In the later stage, CODIT generates the edited code conditioned on the tree generated earlier. Suchconditioning makes the edited code syntactically correct. CODIT showed promise in learning code edit patterns in the wild and effectiveness in automatic program repair. In another empirical study, we showed that a graph-based model is better suitable for source code understanding tasks such as vulnerability detection.
On the other hand, with implicit encoding, we use a very large (with several hundred million parameters) yet generic model. However, we pre-train these models on a super-large (usually hundreds of gigabytes) collection of source code and code metadata. We empirically show that if sufficiently pre-trained, such models are capable enough to learn PL properties such as syntax and semantics. In this dissertation, we developed two such pre-trained models, with two different learning objectives. First, we developed PLBART— the first-ever pre-trained encoder-decoder-based model for source code and show that such pre-train enables the model to generate syntactically and semantically correct code. Further, we show an in-depth empirical study on using PLBART in automated code editing. Finally, we develop another pre-trained model — NatGen to encode the natural coding convention followed by developers into the model. To design NatGen, we first deliberately modify the code from the developers’ written version preserving the original semantics. We call such transformations ‘de-naturalizing’ transformations. Following the previous studies on induced unnaturalness in code, we defined several such ‘de-naturalizing’ transformations and applied those to developer-written code. We pre-train NatGen to reverse the effect of these transformations. That way, NatGen learns to generate code similar to the developers’ written by undoing any unnaturalness induced by our forceful ‘de-naturalizing‘ transformations. NatGen has performed well in code editing and other source code generation tasks.
The models and empirical studies we performed while writing this dissertation go beyond the scope of automated code editing and are applicable to other software engineering automation problems such as Code translation, Code summarization, Code generation, Vulnerability detection,Clone detection, etc. Thus, we believe this dissertation will influence and contribute to the advancement of AI4SE and PLP.
- Chakraborty_columbia_0054D_17445.pdf application/pdf 1.66 MB Download File
More About This Work
- Academic Units
- Computer Science
- Thesis Advisors
- Ray, Baishakhi
- Ph.D., Columbia University
- Published Here
- September 7, 2022