Refined Direct Preference Optimization w...

Aligning Large Language Models with Counterfactual DPO

January 17, 2024

94% Match

Bradley Butcher

Computation and Language

Artificial Intelligence

Advancements in large language models (LLMs) have demonstrated remarkable capabilities across a diverse range of applications. These models excel in generating text completions that are contextually coherent and cover an extensive array of subjects. However, the vast datasets required for their training make aligning response styles during the pretraining and instruction tuning phases challenging. Consequently, an additional alignment phase is typically employed, wherein the ...

Find SimilarView on arXiv

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

April 16, 2024

93% Match

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, ... , Wu Yi

Computation and Language

Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art re...

Find SimilarView on arXiv

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

March 28, 2024

93% Match

Qi Gou, Cam-Tu Nguyen

Computation and Language

Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language. However, as they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that are not aligned with human values. This paper studies two main approaches to LLM alignment: Reinforcement Learning with Human Feedback (RLHF) and contrastive learning-based methods like Direct Preference Optimization (DPO). By analyzing the...

Find SimilarView on arXiv

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

May 29, 2023

93% Match

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, ... , Finn Chelsea

Machine Learning

Artificial Intelligence

Computation and Language

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLH...

Find SimilarView on arXiv

Enhancing LLM Safety via Constrained Direct Preference Optimization

March 4, 2024

93% Match

Zixuan Liu, Xiaolin Sun, Zizhan Zheng

Machine Learning

Computation and Language

The rapidly increasing capabilities of large language models (LLMs) raise an urgent need to align AI systems with diverse human preferences to simultaneously enhance their usefulness and safety, despite the often conflicting nature of these goals. To address this important problem, a promising approach is to enforce a safety constraint at the fine-tuning stage through a constrained Reinforcement Learning from Human Feedback (RLHF) framework. This approach, however, is computa...

Find SimilarView on arXiv

$\alpha$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

October 14, 2024

93% Match

Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, ... , He Xiangnan

Machine Learning

Artificial Intelligence

Computation and Language

Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces challenges in computational efficiency and training stability. Recent methods like Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO) have proposed offline alternatives to RLHF, simplifying the process by reparameteri...

Find SimilarView on arXiv

Robust Preference Optimization through Reward Model Distillation

May 29, 2024

92% Match

Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, ... , Berant Jonathan

Machine Learning

Computation and Language

Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. However, typical preference datasets have only a single, or at most a few, annotation per preference pair, which causes DPO to overconfidently assign ...

Find SimilarView on arXiv

ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization

February 14, 2024

92% Match

Feifan Song, Yuxuan Fan, Xin Zhang, ... , Wang Houfeng

Computation and Language

Artificial Intelligence

Large Language Models (LLMs) rely on Human Preference Alignment (HPA) to ensure the generation of safe content. Due to the heavy cost associated with fine-tuning, fine-tuning-free methods have emerged, typically modifying LLM decoding with external auxiliary methods. However, these methods do not essentially enhance the LLM itself. In this paper, we rethink the derivation procedures of DPO, based on which we conversely build an instant scorer using the states of the LLM befor...

Find SimilarView on arXiv

AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation

March 4, 2025

92% Match

Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, ... , Xu Jinan

Computation and Language

Artificial Intelligence

Machine Learning

In modern large language models (LLMs), LLM alignment is of crucial importance and is typically achieved through methods such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). However, in most existing methods for LLM alignment, all tokens in the response are optimized using a sparse, response-level reward or preference annotation. The ignorance of token-level rewards may erroneously punish high-quality tokens or encourage low-qual...

Find SimilarView on arXiv

Robust Multi-Objective Preference Alignment with Online DPO

March 1, 2025

92% Match

Raghav Gupta, Ryan Sullivan, Yunxuan Li, ... , Rastogi Abhinav

Computation and Language

Machine Learning

Multi-objective preference alignment of large language models (LLMs) is critical for developing AI systems that are more configurable, personalizable, helpful, and safe. However, optimizing model outputs to satisfy diverse objectives with variable weights at inference time for truly personalized models presents a significant challenge. Existing approaches are either computationally expensive to train or do not sufficiently steer model behaviors. This paper introduces the Mult...

Find SimilarView on arXiv

Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMs

Aligning Large Language Models with Counterfactual DPO

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Enhancing LLM Safety via Constrained Direct Preference Optimization

$\alpha$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Robust Preference Optimization through Reward Model Distillation

ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization

AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation

Robust Multi-Objective Preference Alignment with Online DPO