An exciting application area of machine learning and deep learning methods is completion, repair, synthesis, and automatic explanation of program code. This field has received a fair amount of attention in the last decade (see e.g. Oda et al. (2015); Balog et al. (2017); Allamanis et al. (2018)), yet arguably the recent application of large scale language modelling techniques to the domain of code holds a tremendous promise to completely revolutionize this area (Chen et al., 2021; Austin et al., 2021). The new large pretrained models excel at completing code and synthesizing code from natural language descriptions; they work across a wide range of domains, tasks, and programming languages. The excitement about new possibilities is spurring tremendous interest in both industry and academia. Yet, we are just beginning to explore the potential of large-scale deep learning for code, and state-of-the-art models still struggle with correctness and generalization. This calls for platforms to exchange ideas and discuss the challenges in this line of work. The second Deep Learning for Code (DL4C) workshop will provide such a platform at ICLR 2023.
Fri 5:30 a.m. - 5:45 a.m.
|
Opening Remarks
(
remarks
)
SlidesLive Video » |
🔗 |
Fri 5:45 a.m. - 6:30 a.m.
|
Neurosymbolic reasoning for better program synthesis
(
Invited Talk
)
SlidesLive Video » In this talk, Prof. Solar-Lezama will describe how the combination of deep learning and symbolic reasoning can help improve on the capabilities of purely neural systems. The talk will also describe some open problems around how to make this combination even more capable. |
Armando Solar-Lezama 🔗 |
Fri 6:30 a.m. - 7:15 a.m.
|
Fantastic LLMs for (Data Science) Code and How To Adapt Them
(
Invited Talk
)
SlidesLive Video » Large Language Models are rapidly emerging as the foundational technology to assist numerous software engineering pains. From code generation to bug fixing to migration and maintenance, they hold the potential to aid every part of the application development lifecycle. However, with great opportunities come great product responsibilities. LLMs need to be adapted to maximize the quality of every application, grounded to the user's context, generate code in a way that respects and uplifts open-source developments upon which they build. I will discuss some of the practical challenges and approaches to real-life LLM adaptation and Code AI product development, using data science as a motivating application. |
Alex Polozov 🔗 |
Fri 7:15 a.m. - 7:30 a.m.
|
Break
|
🔗 |
Fri 7:30 a.m. - 8:15 a.m.
|
BigCode: open and responsible development of LLMs for code
(
Invited Talk
)
SlidesLive Video » In this presentation, we will share several accomplishments of the BigCode project, a community effort working on the responsible development of LLMs for code generation through open-science and open-governance. These include:
|
Harm de Vries · Leandro von Werra 🔗 |
Fri 8:15 a.m. - 9:15 a.m.
|
Panel Discussion
SlidesLive Video » |
Armando Solar-Lezama · Alex Polozov · Danny Tarlow · Harm de Vries · Leandro von Werra 🔗 |
Fri 9:15 a.m. - 9:45 a.m.
|
Lunch Break
|
🔗 |
Fri 9:45 a.m. - 10:00 a.m.
|
R-U-SURE? Uncertainty-Aware Code Suggestions By Maximizing Utility Across Random User Intents
(
Paper Presentation
)
SlidesLive Video » Large language models show impressive results at predicting structured text such as code, but also commonly introduce errors and hallucinations in their output. When used to assist software developers, these models may make mistakes that users must go back and fix, or worse, introduce subtle bugs that users may miss entirely. We propose Randomized Utility-driven Synthesis of Uncertain REgions (R-U-SURE), an approach for building uncertainty-aware suggestions based on a decision-theoretic model of goal-conditioned utility, using random samples from a generative model as a proxy for the unobserved possible intents of the end user. Our technique combines minimum-Bayes-risk decoding, dual decomposition, and decision diagrams in order to efficiently produce structured uncertainty summaries, given only sample access to an arbitrary generative model of code and an optional syntax tree parser. We demonstrate R-U-SURE on three developer-assistance tasks, and show that it leads to more useful uncertainty estimates than per-token probability baselines without requiring model retraining or fine-tuning. |
🔗 |
Fri 10:00 a.m. - 10:15 a.m.
|
SantaCoder: don't reach for the stars!
(
Paper Presentation
)
SlidesLive Video » The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL license at https://hf.co/bigcode. |
🔗 |
Fri 10:15 a.m. - 10:25 a.m.
|
Conversational Automated Program Repair
(
Paper Presentation
)
SlidesLive Video » Automated Program Repair (APR) can help developers automatically generate patches for bugs. Due to the impressive performance obtained using Large Pre-Trained Language Models (LLMs) on many code related tasks, researchers have started to directly use LLMs for APR. However, prior approaches simply repeatedly sample the LLM given the same constructed input/prompt created from the original buggy code, which not only leads to generating the same incorrect patches repeatedly but also miss the critical information in testcases. To address these limitations, we propose conversational APR, a new paradigm for program repair that alternates between patch generation and validation in a conversational manner. In conversational APR, we iteratively build the input to the model by combining previously generated patches with validation feedback. As such, we leverage the long-term context window of LLMs to not only avoid generating previously incorrect patches but also incorporate validation feedback to help the model understand the semantic meaning of the program under test. We evaluate 10 different LLMs including the newly developed ChatGPT model to demonstrate the improvement of conversational APR over the prior LLM for APR approaches. |
🔗 |
Fri 10:25 a.m. - 10:35 a.m.
|
CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
(
Paper Presentation
)
SlidesLive Video » Since the rise of neural models of code that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an automatic evaluation metric for code generation, which builds on BERTScore. Instead of measuring exact token matching as BLEU, CodeBERTScorecomputes a soft similarity score between each token in the generated code and in the reference code, using the contextual encodings of large pretrained models. Further, instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the programmatic context surrounding the generated code. We perform an extensive evaluation of CodeBERTScore across four programming languages. We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by CodeBERTScore is more likely to be preferred by humans, as well as to function correctly when executed. Finally, while CodeBERTScore can be used with a multilingual CodeBERT as its base model, we release five language-specific pretrained models to use with our publicly available code. |
🔗 |
Fri 10:35 a.m. - 10:45 a.m.
|
Break
|
🔗 |
Fri 10:45 a.m. - 11:30 a.m.
|
Information needs in assisting software engineers with deep learning
(
Invited Talk
)
SlidesLive Video » |
Danny Tarlow 🔗 |
Fri 11:30 a.m. - 12:15 p.m.
|
How Programmers Interact with AI Assistants
(
Invited Talk
)
SlidesLive Video » Powered by recent advances in code-generating models, AI assistants like Github Copilot promise to change the face of programming forever. But what is this new face of programming? And how can we help programmers use these assistants more effectively? In the first part of the talk, I will present the first grounded theory study of how programmers interact with Copilot, based on observing 20 participants with varying levels of experience. Our main finding is that interactions with programming assistants are bimodal, with programmers using Copilot either in acceleration mode or exploration mode. Based on the observations of this first study, we designed a new interaction model, dubbed Live Exploration of AI-generated Programs (LEAP), with the goal to better support programmers in exploration mode. The main idea of LEAP is to use Live Programming, a continuous display of a program’s runtime values, to help the user understand and validate AI code suggestions. In the second part of the talk, I will discuss LEAP and our user study, which shows that Live Programming lowers the cost of validating AI suggestions, thereby reducing both under- and over-reliance on the AI assistant. |
Nadia Polikarpova 🔗 |
Fri 12:15 p.m. - 12:30 p.m.
|
Break
|
🔗 |
Fri 12:30 p.m. - 1:45 p.m.
|
Lightning Talks
(
Paper Presentation
)
SlidesLive Video » |
🔗 |
Fri 1:45 p.m. - 2:00 p.m.
|
Closing Remarks
(
remarks
)
SlidesLive Video » |
🔗 |