Playing around with Claude 2 and uploaded some papers I was looking at today (and an old blog post of mine). A context of 100k tokens is pretty useful (regardless if an LLM truly understands or if it's just a parrot in a Chinese room, this is impressive):
Here is a summary comparing the key ideas and contributions of the two papers "Diversify and Disambiguate: Learning From Underspecified Data" by Lee et al. and "Quantification of Uncertainty with Adversarial Models" by Schweighofer et al., as well as the related blog post by Kirsch:
Overview of Papers:
"Diversify and Disambiguate" (Lee et al.):
- Proposes a two-stage framework called DivDis for learning from underspecified data, where multiple viable explanations exist.
- Stage 1 is "Diversify": Trains multiple classifier heads to make different predictions on unlabeled target data, while minimizing training error. Achieved by maximizing mutual information between head predictions.
- Stage 2 is "Disambiguate": Selects the best head using a small amount of labeled target data.
- Shows improved robustness in subpopulation shift settings compared to standard ERM.
"Quantification of Uncertainty with Adversarial Models" (Schweighofer et al.):
- Proposes QUAM method to estimate epistemic uncertainty by searching for "adversarial models" that have high posterior probability but differ from a reference model.
- Adversarial models identify important modes of the posterior missed by methods like deep ensembles.
- Uses adversarial models to construct a mixture distribution for variance reduction in MC estimation.
- Outperforms methods like deep ensembles in estimating epistemic uncertainty on vision tasks.
Connections between Papers:
- Both aim to learn diverse sets of models that make different predictions, in order to handle underspecification.
- Lee et al. do this by training multiple classifier heads, while Schweighofer et al. find adversarial models.
- Lee et al. focus on improving predictive performance, while Schweighofer et al. focus on quantifying uncertainty.
- However, Lee et al.'s "disambiguate" stage is similar to using adversarial models for uncertainty quantification.
Information-Theoretic Justifications:
- Lee et al. justify mutual information maximization between heads as promoting independence of predictions.
- Schweighofer et al. justify adversarial models via importance sampling bounds using KL divergence.
- Both connect diversity of models to information-theoretic quantities like mutual information and KL divergence.
- These justify explicitly encouraging diversity, rather than relying on implicit diversity of separately trained models.
Summary of Blog Post by Kirsch:
- Suggests encouraging model disagreement on unlabeled pool/target data when training ensembles.
- Motivated by making BALD scores more "informative" for active learning.
- Discusses connections to Lee et al.'s method of minimizing mutual information between heads.
- Proposes directly maximizing approximate BatchBALD scores as a novel objective.
- Raises open questions around constraints needed to avoid degenerate solutions.
The blog post relates model disagreement to BALD, an information-theoretic active learning metric. It proposes extending both Lee et al. and Schweighofer et al.'s approaches to directly optimize information-theoretic objectives related to model disagreement.
Jul 11, 2023 · 9:37 PM UTC