To this end, we introduce Autocast, a dataset containing thousands of forecasting questions and an accompanying news corpus. Questions are taken from forecasting tournaments [Metaculus, Good Judgment Open, & CSET Foretell], ensuring high quality, real-world importance, and diversity. The news corpus is organized by date, allowing us to precisely simulate the conditions under which humans made past forecasts (avoiding leakage from the future).
Motivated by the difficulty of forecasting numbers across orders of magnitude (eg. global cases of COVID-19 in 2022), we also curate IntervalQA, a dataset of numerical questions and metrics for calibration.
We test language models [GPT-2, FiD T5, DeBERTa-v3] on our forecasting task and find that performance is far below a human expert baseline. However, performance improves with increased model size and incorporation of relevant information from the news corpus.
In sum, Autocast poses a novel challenge for large language models and improved performance could bring large practical benefits.
âŚBecause it relies on scarce human expertise, forecasting is only used for a small number of questions. This motivates using ML to automate forecasting, eg. by automating human information retrieval (finding news sources), reasoning (to decide if some evidence bears on a forecast), and quantitative modeling. ML models may also have some advantages over human forecasters. Models can read through text or data much faster than humans and can discern patterns in noisy high-dimensional data that elude humans. When it comes to learning, humans cannot be trained on past data in manner simulating actual forecasting (eg. How likely was the Soviet Unionâs collapse from the viewpoint of 1980?) because they know the outcomesâbut past data can be used for ML models.
Figure 1: Example from the Autocast dataset, including the question, the resolution of the question, and the timeseries of aggregate human expert forecasts (Crowd) from the start date to the time the question resolves. We train a language model to generate forecasts at each timestep, using only news articles available at that timestep (ie. without allowing any leakage of information from the future).
âŚRelated Work: Forecasting: A recent experiment (Bonde2022) tested GPT-3 in the few-shot setting on true/false questions collected from Metaculus (one of the sources for Autocast). However, since questions were not filtered by date, some answers would have appeared in GPT-3âs training data. Similar to our work, ForecastQA (Jinet al2021) is a dataset of forecasting questions that covers a range of topics. However, ForecastQAâs questions were written by crowdworkers without forecasting experience. Consequently, the questions are often nonsensical or ambiguous given the lack of additional context, eg. âTo how many people will the Representative of an internet speak to by September 2019?â, or âIn July 2019, will an article say there were no volunteers in 2016?â. We found that a high percentage of ForecastQA questions suffer from these issues. By contrast, our questions were written by experienced forecasters and are always unambiguous given the full question description. Finally, ForecastQAâs human baseline was done retrospectively (making it unrealistic) whereas our dataset contains expert human forecasts from real forecasting questions.
Table 2: Model accuracy on the Autocast dataset for each question type: true/false (T/F), multiple-choice question (MCQ), and numerical (Numerical). For Numerical, lower is better. For other metrics, higher is better. The model FiD Static (based on T5) retrieves the top 10 news articles over the period, while FiD Temporal (based on GPT-2 with T5 encoder) retrieves the top 1 article each day. Averaging over all model sizes, we find that the FiD Temporal achieves the best average.
âŚCalibration: Experiments: We fine-tune DeBERTa-v3 models (Heet al2020) to predict a point estimate and a set of confidence intervals corresponding to the confidence levels in the RMS calibration error metric. On a high level, we use a loss with 3 components: (1) MSE loss between the predicted point estimate and the ground-truth target, (2) MSE loss between the boundaries of the predicted confidence intervals and the ground-truth target for boundaries that are on the wrong side of the target, (3) a penalty on the length of the predicted intervals to encourage finer predictions. The models are trained for 5 epochs with a batch size of 100. A detailed description is in the Supplementary Material. We show results in Table 3: All 3 metrics decrease with model size.
Table 3: Results for DeBERTa-v3 models trained to output confidence intervals on our dataset of numerical predictions. The high dynamic range of the targets leads to large confidence intervals, but median interval size decreases with larger models as does RMS Calibration Error.