ā€œGeoLLM: Extracting Geospatial Knowledge from Large Language Modelsā€, Rohin Manvi, Samar Khanna, Gengchen Mai, Marshall Burke, David Lobell, Stefano Ermon2023-10-10 (, , )⁠:

The application of machine learning (ML) in a range of geospatial tasks is increasingly common but often relies on globally available covariates such as satellite imagery that can either be expensive or lack predictive power. Here we explore the question of whether the vast amounts of knowledge found in Internet language corpora, now compressed within large language models (LLMs), can be leveraged for geospatial prediction tasks.

We first demonstrate that LLMs embed remarkable spatial information about locations, but naively querying LLMs using geographic coordinates alone is ineffective in predicting key indicators like population density.

We then present GeoLLM, a novel method that can effectively extract geospatial knowledge from LLMs with auxiliary map data from OpenStreetMap. We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods.

Across these tasks, our method demonstrates a 70% improvement in performance (measured using Pearson’s r2) relative to baselines that use nearest neighbors or use information directly from the prompt, and performance equal to or exceeding satellite-based benchmarks in the literature.

With GeoLLM, we observe that GPT-3.5 outperforms LLaMA-2 and RoBERTa by 19% and 51% respectively, suggesting that the performance of our method scales well with the size of the model and its pretraining dataset.

Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe. Crucially, GeoLLM shows promise in mitigating the limitations of existing geospatial covariates and complementing them well.

Figure 3: Mean Pearson’s R2 for each model across all tasks at 1,000 training samples.
Figure 4: Learning curves for population density task from WorldPop.

…Not only does GPT-3.5 outperform all other models on every test, but its performance is also relatively consistent across tasks and sample sizes. This shows that GPT-3.5 is resilient to the size of the areas of prediction (eg. square kilometer vs ZIP code area), and any added noise (eg. ā€œjitteredā€ coordinates). LLaMA-2 also does better than all baselines for 18⁄19 total tests and consistently does better than RoBERTa. RoBERTa consistently does better than all baselines with 10,000 training samples, but struggles at lower sample sizes. All models experience a large drop in performance when the sample size is reduced to 100. However, GPT-3.5 and LLaMA-2 retain a much more acceptable level of performance compared to the others, emphasizing their sample efficiency. Specifically, with 100 samples, GPT-3.5 does 3.1Ɨ better than the best baseline and LLaMA-2 does 1.8Ɨ better.

GPT-3.5 and LLaMA-2 do especially well with the population tasks from WorldPop and the USCB compared to the baselines. GPT-3.5 is also especially impressive with the home value task from Zillow with a Pearson’s R2 of up to 0.87. However, the difference in performance between the models is less pronounced for the tasks from the DHS. This might be due to the noise that is added when the coordinates for these tasks are ā€œjitteredā€ up to 5 kilometers. With the added noise, it is potentially more difficult to achieve a high Pearson’s R2.

As shown in Figure 3 GPT-3.5, LLaMA-2, and RoBERTa, perform 70%, 43%, and 13% better on average than the best baseline (XGBoost) respectively with 1,000 samples, indicating that the method scales well. Figure 4 again shows that the sample efficiencies of LLaMA-2 and GPT-3.5 are exceptional, especially when making predictions on a global scale. Note that with larger sample sizes the gaps in performance will decrease as the physical distances between the training coordinates and test coordinates become smaller.