2015-04-06

Spurious Correlations in (Not So) Big Data

Synoposis

I performed simulations to determine the proportion of spurious correlations in a random matrices. Analyzing matrices with variables in the range of [100, 10000] with 10 observations per variable, I found that the proportion of spurious correlations is constant and not dependent on the number of variables.

Introduction

Spurious Correlations can be a source of humor, but recently, John P. A. Ioannidis and Campbell Harvey and Yan Liu presented evidence that many conclusions in science and finance are the product of spurious correlations rather than true causal relationships.

Data Science Central formulated a question based on these observations:

Spurious correlations in big data, how to detect and fix it. You have n = 5,000 variables uniformly distributed on [0,1]. What is the expected number m of correlations that are above r = 0.95. Perform simulations or find theoretical solution. Try with various values of n (from 5,000 to 100,000) and r (from 0.80 to 0.99) and obtain confidence intervals for m (m is a function of n and r). Identify better indicators than correlation to measure whether two time series are really related. The purpose here is twofold: (1) to show that with big data, your strongest correlations are likely to be spurious, and (2) to identify better metrics than correlation in this context.

Taking this post as a starting point, I attempted to create a model that related the proportion of spurious correlations above a threshold Pearson coefficient (rho) with the number of variables in a random data set.

Hardware and Software

I performed this analysis using a Macbook 2,1 running OSX 10.6.8 with a Intel Core 2 Duo processor and 2 GB of RAM. The software was R 3.1.2 run within RStudio Version 0.98.1087.

Preparing Simulation Data

The first packages loaded are dplyr for data manipulation and subsetting, WGCNA for its corFast correlation matrix function (Note: WGCNA has several non-CRAN dependencies), and randtoolbox for its random number generators.

library(dplyr)

The Sporious.cor function is used to simulate and summarize the data. It works by:

  1. Generating a matrix of random numbers uniformly distributed within [0,1]
  2. Creating a correlation matrix
  3. Calculating the proportion of correlations above a certain threshold
  4. Repeating this process for a specified number of iterations
  5. Summarizing the simulation by returning the mean and standard deviation for each correlation threshold.
Spurious.cor <- function(variables, observations,
                         num.iter = 100, rand.gen = congruRand) {
# Calculates the average proportion of correlations >= 0.8, 0.9, 0.95, 0.98, & 0.99
#
# Args:
#       variables: number of variables (columns)
#       observations: number of observations (rows)
#       num.iter: number of iterations (default 100)
#       rand.gen: random number generator (default congruRand)
# Returns:
#       A dataframe with 1 row
#       c(variables, observations, rho.8_mean, rho.9_mean, rho.98_mean, rho.99_mean,
#        rho.8_sd, rho.9_sd, rho.95_sd, rho.98_sd, rho.99_sd)

        # Divisor is the unique, not-self correlations in cor.matrix
        # Since both the divisor and numerators include division by 2
        # it was excluded from the calculation
        divisor <- (variables^2 - variables)

        result.df <- data.frame()
        for (index in 1:num.iter) {
                random.matrix <- rand.gen(observations, dim = variables)
                cor.matrix <- corFast(random.matrix)

                rho.8 <- (sum(cor.matrix >= 0.8) - variables)/divisor
                rho.9 <- (sum(cor.matrix >= 0.9) - variables)/divisor
                rho.95 <- (sum(cor.matrix >= 0.95) - variables)/divisor
                rho.98 <- (sum(cor.matrix >= 0.98) - variables)/divisor
                rho.99 <- (sum(cor.matrix >= 0.99) - variables)/divisor
                rm(random.matrix)
                result <- c(variables, observations,
                            rho.8, rho.9, rho.95, rho.98, rho.99)
                result.df <- rbind(result.df, result)
        }
        names(result.df) <- c("variables", "observations",
                              "rho.8", "rho.9", "rho.95", "rho.98", "rho.99")
        summary.row <- summarise_each(result.df, funs(mean, sd))  %>%
                select(variables = variables_mean,
                       observations = observations_mean,
                       3:7, 10:14)
        return(summary.row)
}

I vary the number of variables in the range of [100, 10000], increasing on an approximate log2 scale. In the Data Science Central blog post first presenting the spurious correlation problem, Vincent Granville discussed using either ten or 20 observations, arguing that many Big Data problems are limited to these numbers. For this study, I used ten observations for each variable.

set.seed(1260)
spurious.df <- data.frame()
obs <- 10
iters <- 100
obs.list <- c(100, 500, 1000, 5000, 10000)
for (num.var in obs.list) {
                new.row <- Spurious.cor(num.var, obs, iters)
                spurious.df <- rbind(spurious.df, new.row)
}

I next tidy the data, by creating three new variables: rho, mean, and sd. This information, along with the number of iterations is used to calculate the margin of error, me.

library(tidyr)

spurious.means <- spurious.df %>%
        gather(key = rho, value = mean, rho.8_mean:rho.99_mean) %>%
        select(variables, observations, rho, mean) %>%
        mutate(rho = extract_numeric(rho))

spurious.sd <- spurious.df %>%
        gather(key = rho, value = sd, rho.8_sd:rho.99_sd) %>%
        select(variables, observations, rho, sd) %>%
        mutate(rho = extract_numeric(rho))

spurious.tidy <- inner_join(spurious.means, spurious.sd,
                            by = c('variables' = 'variables',
                                'rho' = 'rho', 'observations' = 'observations')) %>%
        mutate(me = qt(p = 0.99, df = iters - 1)*sd/sqrt(iters))

Results

I graph the mean proportion of spurious correlations against the log of the number of variables.

library(ggplot2)
limits <- aes(ymax = mean + me, ymin = mean - me)
ggplot(spurious.tidy, aes(x = log(variables), y = mean)) +
        geom_line() +
        geom_pointrange(limits) +
        facet_grid(rho ~ ., scales = 'free') +
        labs(y = 'Proportion Spurious Correlations')
Figure 1. Mean & 99%CI for Spurious Correlations

Figure 1. Mean & 99%CI for Spurious Correlations

The graph indicates that the proportion of spurious correlations approaches a constant as number of variables increases. In the smaller data sets, there is considerable variability in the proportion of spurious correlations, but the margin of error lessens as the number of variables increases.

I create a generalized linear model for the proportion of spurious correlations as a function of rho and the number of variables.

spurious.model <- glm(mean ~ log(variables) + rho, data = spurious.tidy)

summary(spurious.model)
##
## Call:
## glm(formula = mean ~ log(variables) + rho, data = spurious.tidy)
##
## Deviance Residuals:
##        Min          1Q      Median          3Q         Max
## -0.0008254  -0.0002436   0.0002353   0.0004000   0.0004734
##
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)     1.594e-02  1.421e-03  11.222 1.42e-10 ***
## log(variables)  1.800e-06  6.165e-05   0.029    0.977
## rho            -1.652e-02  1.459e-03 -11.324 1.20e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2.567075e-07)
##
##     Null deviance: 3.8568e-05  on 24  degrees of freedom
## Residual deviance: 5.6476e-06  on 22  degrees of freedom
## AIC: -303.63
##
## Number of Fisher Scoring iterations: 2

The coefficient of log(varibles) is very small and is not significant.

Discussion

The results of this study indicate that the proportion of spurious correlations in a random data set is constant.

filter(spurious.tidy, variables == 10000) %>%
        mutate(est.corr.pairs = mean*10000) %>%
        select(rho, "Proportion Spurious" = mean, "Margin of Error" = me,
               "Estimated Correlated Pairs per 10K Variables" = est.corr.pairs)
##    rho Proportion Spurious Margin of Error
## 1 0.80        3.174526e-03    2.517228e-06
## 2 0.90        2.657848e-04    6.208080e-07
## 3 0.95        2.038044e-05    1.472894e-07
## 4 0.98        6.004600e-07    2.688135e-08
## 5 0.99        3.880388e-08    5.968261e-09
##   Estimated Correlated Pairs per 10K Variables
## 1                                 3.174526e+01
## 2                                 2.657848e+00
## 3                                 2.038044e-01
## 4                                 6.004600e-03
## 5                                 3.880388e-04

For a large data set with 10,000 variables, the results suggest that ~30 pairs of variables will have a Pearson’s correlation coefficient >= 0.8 and ~3 pairs of variables will have a Pearson’s correlation coefficient >= 0.9. For the thresholds of 0.9, 0.95, and 0.99 there were less than one correlated pair per 10,000 variables.

These results seem to confirm that “with big data, your strongest correlations are likely to be spurious”. When using a data set with approximately 10,000 variables, an analyst can be fairly confident that two variables having a correlation >= 0.95 isn’t a coincidence, but this confidence decreases as she examines more and more variables. Further, this suggests a real danger in building a model with a small number of observations and small to medium correlations.

An area for further research is to determine if the proportion of spurious correlations is inversely related to the number of observations. Intuitively, this should be true: if an analyst increases the sample size, there’s a smaller probability that the results are coincidence.

For applications where the number of observations must be small, is there a viable alternative to correlation to find relationships? Granville presents the idea of “strong correlation”; however, strong correlation is computationally inefficient and spurious correlations are still present.

Finally, why is the proportion of spurious correlations constant? Is this a product of using a pseudo-random number generator rather than true random numbers or is this statistical noise inherit to the problem?

No comments:

Post a Comment