Evaluating Large Language Models Trained on Code

July 7, 2021

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba

Computer Science

Machine Learning

We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

A systematic evaluation of large language models for generating programming code

March 1, 2024

94% Match

Wenpin Hou, Zhicheng Ji

Software Engineering

Artificial Intelligence

Computation and Language

Programming Languages

We systematically evaluated the performance of seven large language models in generating programming code using various prompt strategies, programming languages, and task difficulties. GPT-4 substantially outperforms other large language models, including Gemini Ultra and Claude 2. The coding performance of GPT-4 varies considerably with different prompt strategies. In most LeetCode and GeeksforGeeks coding contests evaluated in this study, GPT-4 employing the optimal prompt ...

Find SimilarView on arXiv

A Systematic Evaluation of Large Language Models of Code

February 26, 2022

93% Match

Frank F. Xu, Uri Alon, ... , Hellendoorn Vincent J.

Programming Languages

Computation and Language

Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex (Chen et al., 2021)) are not publicly available, leaving many questions about their model and data design decisions. We aim to fill in some of these blanks through a systematic evaluation of the largest existing models: Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and CodePar...

Find SimilarView on arXiv

Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation

September 6, 2024

93% Match

Luis Mayer, Christian Heumann, Matthias Aßenmacher

Computation and Language

Machine Learning

Software Engineering

In recent years, large language models (LLMs) have emerged as powerful tools with potential applications in various fields, including software engineering. Within the scope of this research, we evaluate five different state-of-the-art LLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning their capabilities for text-to-code generation. In an empirical study, we feed prompts with textual descriptions of coding problems sourced from the programming website LeetCode...

Find SimilarView on arXiv

Measuring Coding Challenge Competence With APPS

May 20, 2021

93% Match

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, ... , Steinhardt Jacob

Software Engineering

Computation and Language

Machine Learning

While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark mea...

Find SimilarView on arXiv

Program Synthesis with Large Language Models

August 16, 2021

93% Match

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, ... , Sutton Charles

Programming Languages

Machine Learning

This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Progr...

Find SimilarView on arXiv

Automatic Code Documentation Generation Using GPT-3

September 6, 2022

93% Match

Junaed Younus Khan, Gias Uddin

Software Engineering

Source code documentation is an important artifact for efficient software development. Code documentation could greatly benefit from automation since manual documentation is often labouring, resource and time-intensive. In this paper, we employed Codex for automatic code documentation creation. Codex is a GPT-3 based model pre-trained on both natural and programming languages. We find that Codex outperforms existing techniques even with basic settings like one-shot learning (...

Find SimilarView on arXiv

Assessing the Promise and Pitfalls of ChatGPT for Automated Code Generation

November 5, 2023

93% Match

Muhammad Fawad Akbar Khan, Max Ramsdell, ... , Karimi Hamid

Software Engineering

Artificial Intelligence

This paper presents a comprehensive evaluation of the code generation capabilities of ChatGPT, a prominent large language model, compared to human programmers. A novel dataset of 131 code-generation prompts across 5 categories was curated to enable robust analysis. Code solutions were generated by both ChatGPT and humans for all prompts, resulting in 262 code samples. A meticulous manual assessment methodology prioritized evaluating correctness, comprehensibility, and securit...

Find SimilarView on arXiv

Large Language Models as Code Executors: An Exploratory Study

October 9, 2024

93% Match

Chenyang Lyu, Lecheng Yan, Rui Xing, Wenxi Li, Younes Samih, ... , Wang Longyue

Computation and Language

Artificial Intelligence

The capabilities of Large Language Models (LLMs) have significantly evolved, extending from natural language processing to complex tasks like code understanding and generation. We expand the scope of LLMs' capabilities to a broader context, using LLMs to execute code snippets to obtain the output. This paper pioneers the exploration of LLMs as code executors, where code snippets are directly fed to the models for execution, and outputs are returned. We are the first to compre...

Find SimilarView on arXiv

Exploring the Robustness of Large Language Models for Solving Programming Problems

June 26, 2023

93% Match

Atsushi Shirafuji, Yutaka Watanobe, Takumi Ito, Makoto Morishita, Yuki Nakamura, ... , Suzuki Jun

Computation and Language

Artificial Intelligence

Software Engineering

Using large language models (LLMs) for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and generate programs accordingly or just retrieve source code from the most relevant problem in training data based on superficial cues has not been discovered yet. To explore this r...

Find SimilarView on arXiv

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

March 25, 2022

93% Match

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, ... , Xiong Caiming

Machine Learning

Computation and Language

Programming Languages

Program synthesis strives to generate a computer program as a solution to a given problem specification, expressed with input-output examples or natural language descriptions. The prevalence of large language models advances the state-of-the-art for program synthesis, though limited training resources and data impede open access to such models. To democratize this, we train and release a family of large language models up to 16.1B parameters, called CODEGEN, on natural langua...

Find SimilarView on arXiv