Language models can generate harmful and biased outputs and exhibit undesirable behavior according to a given cultural context.
We propose a Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets, an iterative process to statistically-significantly change model behavior by crafting and fine-tuning on a dataset that reflects a predetermined set of target values.
We evaluate our process using 3 metrics: quantitative metrics with human evaluations that score output adherence to a target value, toxicity scoring on outputs; and qualitative metrics analyzing the most common word associated with a given social category. Through each iteration, we add additional training dataset examples based on observed shortcomings from evaluations.
PALMS performs statistically-significantly better on all metrics compared to baseline and control models for a broad range of GPT-3 language model sizes without compromising capability integrity. We find that the effectiveness of PALMS increases with model size…The mean Human Evaluation score and effect size is consistently higher for our values-targeted models in Figure 3. All categories under values-targeted model show a statistically-significantly better rating, implying that the generated completions more closely match the intended sentiment. The rating improves as model size increases, signaling that PALMS has a larger positive impact with larger models.
Figure 3: Human Evaluations Scores Mean.
We show that statistically-significantly adjusting language model behavior is feasible with a small, hand-curated dataset.