The widespread use of experimental benchmarks in AI research has created competition and collaboration dynamics that are still poorly understood. Here we provide an innovative methodology to explore these dynamics and analyse the way different entrants in these challenges, from academia to tech giants, behave and react depending on their own or others’ achievements.
We perform an analysis of 25 popular benchmarks in AI from Papers With Code [HMDB-51 · UCF101 · Montezuma’s Revenge · Space Invaders · CIFAR-100 · ImageNet · CIFAR-10 · Set5 · enwik8 · Penn Treebank · WN18RR · WMT2014 English-French · WMT2014 English-German · CoNLL 2003 · Ontonotes v5 · COCO Minival · COCO test-dev · MPII Human Pose · SQuAD 1.1 · WikiQA · Cityscapes test · PASCAL VOC 2012 test · IMD · SST-2 Binary classification · LibriSpeech], with around 2,000 result entries overall, connected with their underlying research papers. We identify links between researchers and institutions (that is, communities) beyond the standard co-authorship relations, and we explore a series of hypotheses about their behavior as well as some aggregated results in terms of activity, performance jumps and efficiency. We characterize the dynamics of research communities at different levels of abstraction, including organization, affiliation, trajectories, results and activity.
We find that hybrid, multi-institution and persevering communities are more likely to improve state-of-the-art performance, which becomes a watershed for many community members.
Although the results cannot be extrapolated beyond our selection of popular machine learning benchmarks, the methodology can be extended to other areas of artificial intelligence or robotics, and combined with bibliometric studies.
Table 1: list of AI benchmarks used in the analysis by task
Figure 1: Progress in accuracy over time for ImageNet.Coloured shapes show the different communities (with one or more institutions in the legend). Dashed lines show the global SOTA front (in grey) for all the entries (results) and local SOTA front per community (in colour). The blue dotted line shows the smoothed means (all results) with 95% confidence level intervals. Different shapes indicate the types of institution (companies, universities or hybrid).
…Figure 1 represents the results for the ImageNet dataset, which consists of 1.2 million images in 1,000 classes. The results of the different communities show that several long-term collaborative ‘hybrid’ groups, formed mostly by American universities (Johns Hopkins, University of California, Los Angeles, Cornell, Stanford, Toronto and so on) in collaboration with tech giants (Microsoft and Google) are those that have dominated the SOTA front from early on (communities numbered as #1 and #2). Although hybrid communities dominate the SOTA front, there are also some isolated company players, possibly representing different divisions, departments and research groups from companies such as Google, Xiaomi, Facebook and Microsoft. However, only a single non-hybrid community, Google, is able to achieve a score on the SOTA front.
…While all benchmark plots can be found in the Supplementary Information (Supplementary Figures 4–6), we include another example here, in Figure 2. This is the Stanford Question Answering Dataset (SQuAD 1.1), a reading comprehension benchmark with more than 100,000 question-answer pairs from more than 500 articles. Questions derive from Wikipedia articles where the answer may be a segment of text from the corresponding reading passage, or may be unanswerable (for example, written adversarially to look similar to answerable ones). Like ImageNet, the SOTA front is dominated by hybrid long-term collaboration groups (communities numbered #2 and #3) formed by American universities (Stanford, Carnegie Mellon, Washington and so on) in collaboration with tech giants (Facebook and Google), but also by large hybrid communities formed by Asian universities (Beihang, Fudan, Peking and so on) jointly with Microsoft (community #1). We also observe that the participation of European universities initiatives is very low. Unlike ImageNet, most entries correspond to the period 2016–22018, with a clear decline in activity 2018–22020. This is probably due to the introduction of the new (and much more difficult) version of the benchmark (SQuAD 2.0), with attention moving to the new challenge. However, SQuAD 1.1 is still being addressed by communities 2 and 3, which have participated since 2016 and have led the SOTA 2018–22020. Again, we see that long-term collaborative groups obtain better results than isolated communities
…From the above results, we reach a number of conclusions about the dynamics of communities engaging with AI benchmarks. We find that (1) SOTA jumps are mainly obtained by multi-institution communities, compared with the number of jumps obtained by single-institution communities; (2) multi-attempt communities are more likely to achieve SOTA jumps (compared with one-shot efforts); (3) jumps are mainly obtained by hybrid communities involving both universities and companies, meaning that heterogeneous communities achieve more success through collaborative efforts compared with ‘pure’ communities (only universities or companies); and, finally, (4) the presence of companies in a community, such as Google, Microsoft and Facebook, increases the odds of achieving a jump in an AI benchmark. All the above reinforces the usefulness of the increasing tendency of collaboration between universities and industry in AI research.
Figure 3: Most prolific institutions (at least 10 entries) in terms of total SOTA jumps entries and activity. Both axes are logarithmic. Point size represents the efficiency (ratio between number of SOTA jumps and attempts). Note that ‘Academic’ represents both higher education and independent research institutions. We refer the reader to Supplementary Table 1 for further detail about the abbreviations of the institutions.
…While institutions from the United States represent about 56.7% of all jumps, China only represents about 18%. However, the gap becomes smaller if we only consider the recent years. For instance, Table 3 (orange) shows the same results for year 2019 only. Here the institutions from Asia come at the forefront in terms of activity compared with those from America. At the country level, activities from the United States and China are much more similar (41% versus 37%) and although the United States keeps leading the chart with respect to to the number of SOTA jumps compared with China (54% versus 26%), the difference has narrowed. This country-level concentration is also reflected when we compute the (country-wise) HHI to analyse concentration and competitiveness. In this case, the HHI is 0.33, showing a much higher concentration level per country compared with the analysis per institution.
These results are loosely consistent with analyses framing AI research progress as a ‘race’ being led by the United States and China…The data we analyse here represent only a small snapshot of all AI progress, but it still suggests that the United States has had a relevant lead if we look at the whole period, but the gap is being reduced by other countries such as China (Table 3). In the whole period, as Figure 3 shows, 6 out of the top 10 institutions are from the United States (the top 3 being tech giants).