Draft:Direct Preference Optimization

Direct Preference Optimization (DPO) is an algorithm to align large language models (LLMs) to human preferences, eliminating the necessity for reward model training and reinforcement learning. It addresses the challenge of precisely controlling the behavior of unsupervised language models, which, despite their broad knowledge and reasoning skills, are difficult to steer due to their unsupervised nature of training. Traditional approaches involve collecting human labels on model generations and fine-tuning the model to align with these preferences using Reinforcement Learning from Human Feedback (RLHF). However, RLHF can be complex and unstable.

DPO addresses one approximation of RLHF by eliminating the need for a separate reward modeling stage. Instead of first learning a reward model from human preferences and then training a policy to maximize these inferred rewards, DPO directly optimizes the policy based on collected preference data. This direct optimization is achieved by reparameterizing the optimization objective in such a way that it directly reflects the preference data, thereby allowing the policy to be trained to align with human preferences without the intermediate step of reward modeling.

However, it's important to note that while DPO sidesteps the challenges associated with the generalization of reward models, it still relies on another approximation: The translation of pairwise preferences into a form that can be optimized directly. This reliance means that while DPO offers a significant advancement in terms of simplifying the training process and potentially improving the stability and performance of the model, it does not completely overcome all the theoretical challenges associated with learning from human preferences. The method still operates under the assumption that pairwise preferences can effectively guide the optimization of the policy in a meaningful way, an assumption that carries its own set of limitations and challenges in accurately capturing and reflecting the nuances of human judgments.

DPO requires less data and less computing power to be as good as Proximal Policy Optimization.

One shortcoming of DPO is that it tends to quickly overfit on the preference dataset.

The Loss Function
A central part of DPO is its loss function.

$$\mathcal{L}_{DPO}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right].$$

On the left side of the equals sign, we calculate the loss function of the policy model, with respect to the reference model.

On the right side, we calculate the expected value of the dataset for $$x$$ samples, from the outputs that were chosen ($$y_w$$) and the outputs that were rejected ($$y_l$$). Then we apply the sigmoid function logarithmically on the part that comes next:

Inside the brackets, we subtract two log probabilty ratios: On the left is the probability of the chosen output, on the right the probability of the rejected output. Both of the probabilities are multiplied with $$\beta$$, which is a hyperparameter.