Table of contents

0 - Background

In the paper "A General Language Assistant as a Laboratory for Alignment", Askell et al. [1] proposed that:

<aside> 💡

Ranked preference models tend to improve greatly on imitation learning, but binary discrimination typically provides little benefit.

</aside>

Specifically, preference modeling requires distinguishing between ‘good’ and ‘bad’ behavior. There are several different training objectives that may be used to accomplish this:

Imitation Learning: We simply train language models to imitate ‘good’ behavior via supervised learning with the usual cross-entropy loss. At test stage, we average over negative token log-probs of two samples to make pairwise comparisons.
Binary Discrimination: Given a sample of 'correct' behavior and a sample of 'incorrect' behavior, train the model to distinguish between the two through a classification task.
Ranked Preference Modeling: Given a dataset of samples with an overall quality ranking, we train Bradley-Terry models to output a scalar quality score for each sample that closely matches the provided ranking.

We consider that the paper result “ranked preference models tend to improve greatly on imitation learning” to be easily understood.

<aside> 💡

However, the result that “binary discrimination typically provides little benefit on imitation learning” is interesting and counterintuitive.

</aside>

Accordingly, we designed an experiment on the LeetCode dataset to compare preference modeling through binary discrimination and imitation learning. Our findings are as follows:

Preference modeling using imitation learning is comparable to binary discrimination when there is a single positive-negative response pair per query during training.
However, preference modeling through imitation learning is inferior to binary discrimination when there is more than one positive-negative response pair per query during training.

Based on these findings, we conclude that

<aside> 💡

The model can learn single-pair binary discrimination per prompt through imitation learning. However, it is unable to learn multiple-pair binary discrimination per prompt using the same method.

</aside>

Furthermore, we find that the discrepancy between our experiments and those described in the paper "A General Language Assistant as a Laboratory for Alignment" arises from the increased difficulty of the code generation samples in our experiments compared to those used in the paper. This increased difficulty may explain the differences observed between the two sets of results.

1 - Experiments

1.0 - Experiments Settings

LeetCoTE SFT Dataset [2][3]: We collected a total of 1,890 problems from LeetCode as our training set and 200 problems as our test set. The training dataset consists of problems provided on the LeetCode website before July 2023, while the test set comprises problems from LeetCode weekly contests occurring between July 2023 and July 2024. The solutions for these problems were generated using Qwen2.5-Coder-32B-Instruct [4]. In our experiments, we split the training dataset into two parts: one for training the preference model and one for performance validation. Accordingly, the validation dataset is entirely within the distribution of the training dataset, while the test dataset is out of distribution with the training dataset.