July 7, 2021
We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.
Similar papers 1
March 1, 2024
We systematically evaluated the performance of seven large language models in generating programming code using various prompt strategies, programming languages, and task difficulties. GPT-4 substantially outperforms other large language models, including Gemini Ultra and Claude 2. The coding performance of GPT-4 varies considerably with different prompt strategies. In most LeetCode and GeeksforGeeks coding contests evaluated in this study, GPT-4 employing the optimal prompt ...
February 26, 2022
Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex (Chen et al., 2021)) are not publicly available, leaving many questions about their model and data design decisions. We aim to fill in some of these blanks through a systematic evaluation of the largest existing models: Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and CodePar...
September 6, 2024
In recent years, large language models (LLMs) have emerged as powerful tools with potential applications in various fields, including software engineering. Within the scope of this research, we evaluate five different state-of-the-art LLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning their capabilities for text-to-code generation. In an empirical study, we feed prompts with textual descriptions of coding problems sourced from the programming website LeetCode...
May 20, 2021
While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark mea...
August 16, 2021
This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Progr...
September 6, 2022
Source code documentation is an important artifact for efficient software development. Code documentation could greatly benefit from automation since manual documentation is often labouring, resource and time-intensive. In this paper, we employed Codex for automatic code documentation creation. Codex is a GPT-3 based model pre-trained on both natural and programming languages. We find that Codex outperforms existing techniques even with basic settings like one-shot learning (...
November 5, 2023
This paper presents a comprehensive evaluation of the code generation capabilities of ChatGPT, a prominent large language model, compared to human programmers. A novel dataset of 131 code-generation prompts across 5 categories was curated to enable robust analysis. Code solutions were generated by both ChatGPT and humans for all prompts, resulting in 262 code samples. A meticulous manual assessment methodology prioritized evaluating correctness, comprehensibility, and securit...
October 9, 2024
The capabilities of Large Language Models (LLMs) have significantly evolved, extending from natural language processing to complex tasks like code understanding and generation. We expand the scope of LLMs' capabilities to a broader context, using LLMs to execute code snippets to obtain the output. This paper pioneers the exploration of LLMs as code executors, where code snippets are directly fed to the models for execution, and outputs are returned. We are the first to compre...
June 26, 2023
Using large language models (LLMs) for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and generate programs accordingly or just retrieve source code from the most relevant problem in training data based on superficial cues has not been discovered yet. To explore this r...
March 25, 2022
Program synthesis strives to generate a computer program as a solution to a given problem specification, expressed with input-output examples or natural language descriptions. The prevalence of large language models advances the state-of-the-art for program synthesis, though limited training resources and data impede open access to such models. To democratize this, we train and release a family of large language models up to 16.1B parameters, called CODEGEN, on natural langua...