Table of contents
In the paper "A General Language Assistant as a Laboratory for Alignment", Askell et al. [1] proposed that:
<aside> 💡
Ranked preference models tend to improve greatly on imitation learning, but binary discrimination typically provides little benefit.
</aside>
Specifically, preference modeling requires distinguishing between ‘good’ and ‘bad’ behavior. There are several different training objectives that may be used to accomplish this:
We consider that the paper result “ranked preference models tend to improve greatly on imitation learning” to be easily understood.
<aside> 💡
However, the result that “binary discrimination typically provides little benefit on imitation learning” is interesting and counterintuitive.
</aside>
Accordingly, we designed an experiment on the LeetCode dataset to compare preference modeling through binary discrimination and imitation learning. Our findings are as follows:
Based on these findings, we conclude that
<aside> 💡
The model can learn single-pair binary discrimination per prompt through imitation learning. However, it is unable to learn multiple-pair binary discrimination per prompt using the same method.
</aside>
Furthermore, we find that the discrepancy between our experiments and those described in the paper "A General Language Assistant as a Laboratory for Alignment" arises from the increased difficulty of the code generation samples in our experiments compared to those used in the paper. This increased difficulty may explain the differences observed between the two sets of results.
LeetCoTE SFT Dataset [2][3]: We collected a total of 1,890 problems from LeetCode as our training set and 200 problems as our test set. The training dataset consists of problems provided on the LeetCode website before July 2023, while the test set comprises problems from LeetCode weekly contests occurring between July 2023 and July 2024. The solutions for these problems were generated using Qwen2.5-Coder-32B-Instruct [4]. In our experiments, we split the training dataset into two parts: one for training the preference model and one for performance validation. Accordingly, the validation dataset is entirely within the distribution of the training dataset, while the test dataset is out of distribution with the training dataset.