Visualizing Active Learning Sample-Eﬃciency

Gwern

Visualizing Active Learning Sample-Efficiency

R (language), data visualization, active learning, order statistics

Visualizing the sample efficiency of adaptiveness in active learning using R and the pedagogical example of binary search on an interval.

by: Gwern 2022-07-25–2025-04-25 finished certainty: log importance: 0 similar

In statistical estimation theory (like design of experiments), decision theory, and reinforcement learning (especially active learning), much emphasis is put on careful selection of data. The right data point can be better than millions of the wrong data.

This is true in areas like causal inference, where a single small randomized experiment on a few people can outweigh confounded correlation data from an entire country, and true in terms of dealing with systemic bias, but it is also true at the level of simple sample efficiency in estimating anything, even something as simple as a mean.

An analogy might be searching a database: if you can only sample random rows, this may work reasonably well for estimating “how much does the average customer spend?”, but you will have to sample for a long time to get a good estimate of “how much did Japanese customers spend on books yesterday during the book sale?”, whereas if you could specify which rows to pick, you’d finish instantly.

Here is another toy problem to help visualize the advantage of agency/choice in active learning: some parameter P is uniformly distributed 0–1; we would like to measure it, but can only measure whether a specific real number drawn from an interval is greater or less than P.

Random sampling 0–1 will constrain the range, but extremely slowly, because after the first few samples, it is ever more unlikely that the next random sample will fall within the remaining interval of possible values for P: it must fall closer to P than all n samples before it in order to contain any information. An active learning approach, however, which chooses to sample a random point inside that interval, becomes essentially a binary search; and homes in so fast that it causes floating point issues in my toy implementation.

A typical result is that after 100 samples, the random search will have an interval width of 0.0129678266 (1.2 × 10⁻²) vs the active’s floating-point minimum value of 5.55111512 × 10⁻¹⁷, or ~14 orders of magnitude narrower. It would take a very long time for the random search to match that!

(You can see this as like the difference between actively searching a sorted list with binary search, which requires only 𝒪(log n) checks to locate a desired item, and picking random items out and hoping you eventually randomly pick the two which bracket the desired item, which is similar to the coupon’s collector problem, and inefficiently requires ~𝒪(n · log n) picks.)

Below we generate simulations of sequentially sampling n = 1–100 points either randomly (left) or ‘actively’ (right), plotting the interval as it shrinks & points either fall in or out. We can see in each scenario that the active learning sampling homes in almost instantly on the target point, with every selected point being informative (solid black dot) while the simple random sampling wastes increasingly more of its samples on uninformative samples (bulleted dot):

Searching an interval for a point, random sampling efficiency vs active-learning sampling efficiency:
active-learning a random point in the remaining possible interval in effect binary-searches, while random sampling becomes arbitrarily inefficient as it needs to sample a point closer than all n prior points.

R source code for above animation:

library(animation)
library(ggplot2)
library(gridExtra)

guessStrategy <- function(d, random=TRUE) {
       if (random) { runif(1); } else { runif(1, min=max(d$LowerBound), max=min(d$UpperBound)); }
}
simulateSearch <- function(useRandomStrategy=TRUE, maxSamples=100, target=runif(1)) {
    df <- data.frame(_n_ = seq(1, maxSamples), Guess=rep(0, maxSamples),
        LowerThan=logical(maxSamples),
        LowerBound=rep(0, maxSamples), UpperBound=rep(1, maxSamples))

    for (i in 1:maxSamples) {
      currentSample <- guessStrategy(df[df$N <= i,], random=useRandomStrategy)
      lower <- currentSample < target
      df[i,] <- list(_n_ = i, guess=currentSample, LowerThan=lower,
        LowerBound={ if (lower && currentSample > max(df[df$N <= i,]$LowerBound)) { currentSample; }
                      else { max(df[df$N <= i,]$LowerBound); } },
        UpperBound={ if (!lower && currentSample < min(df[df$N <= i,]$UpperBound)) { currentSample; }
                      else { min(df[df$N <= i,]$UpperBound); } }
        )
    }
    df$IntervalWidth <- df$UpperBound - df$LowerBound
    df$Decreased <- head(c(1, df$IntervalWidth)<=1e-14 |
                         (c(df$IntervalWidth,0) != c(0, df$IntervalWidth)), -1)

    return(df)
}

plotSearchResults <- function(df, typeXLabel="Sampled datapoint") {
    return(qplot(df$Guess, 1:nrow(df)) +
        # show whole 0--1 range, to avoid misleading scale-zoom effects & keep animation 'fixed in place'
        coord_cartesian(xlim=c(0,1)) +
        # the true parameter we're trying to estimate:
        geom_vline(xintercept=currentTarget, color="black") +
        # the narrowest interval at each iteration:
        geom_segment(aes(x=df$UpperBound, xend=df$LowerBound, y=1:nrow(df), yend=1:nrow(df), size=I(1.8))) +
        # whether our measurement at each iteration was useful to decrease the interval:
        geom_point(size=I(7), aes(shape=if(all(df$Decreased)){df$Decreased;}else{!(df$Decreased);})) +
        scale_shape_manual(values=c(19,1)) +
        # overall GUI settings for clean monochrome theme:
        ylab("Iteration") + xlab(typeXLabel) +
        theme_bw(base_size=46) + theme(legend.position = "none")
        )
}

saveGIF(for (i in 1:200){
         currentTarget <- runif(1)

         d1 <- simulateSearch(TRUE, target=currentTarget)
         p1 <- plotSearchResults(d1, "Random sampling")
         d2 <- simulateSearch(FALSE, target=currentTarget)
         p2 <- plotSearchResults(d2, "Active-learning")

         print(grid.arrange(p1, p2, ncol=2))
    },
    ani.width = 1200, ani.height=1200,
    movie.name = "2022-07-25-orderstatistics-activelearningvsrandomsearch-200simulationruns.gif")

[Return to blog index]

Error: JavaScript disabled.

Backlinks, similar links, and the bibliography require JS enabled to load.

Similar Links