Dynamic Evaluation

Gwern

Dynamic Evaluation

Old neural net technique for runtime (online) learning which boosts performance.

2023-09-18–2024-06-16 finished certainty: log importance: 4 similar bibliography

Dynamic evaluation or test-time finetuning is a performance-enhancing¹ online machine learning technique where the ML model is trained further at runtime on ‘new’ data, eg. an RNN/Transformer is benchmarked on predicting text, but in addition to its prediction each timestep, it does an additional gradient descent on the newly-observed text. (It is analogous to short-term memory neural plasticity.) Dynamic evaluation was introduced for RNNs by Mikolov et al 2010, where the continual learning reduced perplexity in predicting English, and was used in many RNNs afterwards for the best performance (cf. neural cache).

Gradient descent works well. Have you considered using it more?

Dynamic evaluation is attractive because it requires no modifications to the architecture or training—it simply does more ‘training’, rather than leaving the weights frozen and relying on the hidden state (or self-attention) to do all learning, leading to greater consistency.² It is especially useful when dealing with rare entities, domain shift, or personalization³, and for serial tasks where the best performance is needed. It can also be augmented with retrieval methods or adding in similar datapoints, which can teach the NN more.

Dynamic evaluation has fallen out of fashion due to emphasis on simple-to-deploy models, proprietary cloud services, and throughput over quality; but it may be revived by local NN models, or by tasks requiring cognitive flexibility not handled by pure self-attention (ARC?).

<span id=“scaling-laws> In the limit, dynamic evaluation is equivalent finetuning/arbitrarily large context windows on the new data (cf. MCTS): finetuning on the new dataset as if i.i.id will typically beat any online learning approach like dynamic evaluation, and self-attention does in-context learning which is loosely similar to gradient descent and so a large enough context window without dynamic evaluation will beat a short context window with.

In past work like Rarren-Triki et al 2024 or Hardt et al 2023, dynamic evaluation’s benefits are equivalent to a 10× scale-up?

But the benefit might be much larger depending on the scaling laws. As dynamic evaluation can be seen as a form of runtime search, scaling laws like Jones2021 suggest that context windows may scale poorly and dynamic evaluation gain a large compute advantage: dynamic evaluation may enable models to need only small context windows (like thousands) for the immediate local context (having learned from the previous text and updated the model itself) to match models which require millions of tokens of context (because they must compute an equivalent update restricted purely to self-attention over the history).

And these could be hybridized: dynamic evaluation by default simply tries to predict the next token as usual, but in dealing with large corpuses, we often have a specific task in mind. So a hybrid might be to reserve the first n tokens of the context window for the user prompt, and then do dynamic evaluation along the way. The presence of the user prompt should “focus the LLM’s attention” on things relevant to that prompt, preserving relevant knowledge in the weights.↩︎
When running a frozen model on new text, this is different from training, which may confuse the model, and can also leak information about evaluations.

For example, an LLM can infer whether it is being trained or tested by whether it ‘remembers’ similar questions. If it is trained on a question like “50. Q. What is the airspeed of an unladen African swallow?”, then it will on average (in 1 epoch training) have ‘seen’ (experienced a gradient descent step, which is often enough to memorize a datapoint in large sample-efficient LLMs) half of the other questions implied to exist; while if it is being tested with frozen weights, then it will not ‘remember seeing’ any of the other questions, which given the increasing sizes & unusual difficulty & idiosyncrasies of benchmarks, provides strong Bayesian evidence about which phase it is currently in.↩︎
eg. to fix Whisper transcripts by user correction of novel or proper nouns, which can’t be done by prompting due to its short LLM context window.↩︎

[Error: JavaScript disabled.]

[Backlinks, similar links, and the bibliography require JS enabled to load.]

Bibliography

[Bibliography of links/references used in page]

Similar Links

Bibliography