Multilevel modeling of effect of small group’s meditation on math errors
A small group of Quantified Selfers tested themselves daily on arithmetic and engaged in a month of meditation. I analyze their scores with a multilevel model with persubject grouping, and find the expect result: a small decrease in arithmetic errors which is not statisticallysignificant, with practice & timeofday effects (but not dayofweek or weekend effects). This suggests a longer experiment by twice as many experimenters in order to detect this effect.
Quantified Self hobbyist Peter Lewis in April 2013_{11ya} began a selfexperiment with 11 other hobbyists (with the results posted on Seth Robert’s blog): they used a cellphone app “Math Workout” to do timed simple arithmetic problems (eg. “3.7 + 7.3, 93 + 18, 14 * 7, and 12^{2} + √9”) in an ABA experimental design, where B was practicing mindfulness meditation. This was motivated in part by the observation that, while there is a ton of scientific research on meditation, the studies are often of doubtful quality (he cited a remarkable 2007 collection of meditation metaanalyses).
Lewis thought the results were positive inasmuch as his timings & error rate fell during the meditation period and the falls seemed visually clear when graphed, but some of the participants dropped out and others had graphs that were not so positive.
The interpretation of the data from a statistical point of view was not terribly clear. Lewis & Roberts did not do the analysis. Reading the post, I thought that it was a perfect dataset to try out multilevel models on, as this sort of psychology setup (multiple groups of data per subject, and multiple subjects) is a standard usecase.
Data
The original data was provided by Lewis in an Excel spreadsheet, but formatted in a way I couldn’t make use of. I extracted the data, turned it into a ‘long’ rowbyrow format, added IDs for each subject, transformed the dates into R format, and added a variable for each score whether it was on a meditation day. The result was a CSV:
mdtt < read.csv("https://gwern.net/doc/nootropic/quantifiedself/2013peterlewismeditation.csv",
colClasses=c("factor","Date","character","integer","numeric","logical"))
# Subject Date TimeOfDay ErrorCount TestDuration
# 00 :173 Min. :20130411 Length:886 Min. : 0.00 Min. : 61.2
# 03 :126 1st Qu.:20130416 Class :character 1st Qu.: 1.00 1st Qu.: 95.3
# 01 :107 Median :20130421 Mode :character Median : 2.00 Median : 119.1
# 04 :106 Mean :20130421 Mean : 3.04 Mean : 126.1
# 06 : 99 3rd Qu.:20130427 3rd Qu.: 4.00 3rd Qu.: 143.0
# 05 : 95 Max. :20130505 Max. :25.00 Max. :2185.7
# (Other):180 NA's :68 NA's :69
# Meditated
# Mode :logical
# FALSE:502
# TRUE :316
# NA's :68
mdtt < mdtt[!is.na(mdtt$Subject),]
mdtt < mdtt[mdtt$TestDuration<400,]
mdtt$ErrorCount < ifelse(mdtt$ErrorCount<16, mdtt$ErrorCount, c(16))
The TestDuration
variable jumps out as bizarre: it goes from 143 seconds to 2185.7 seconds? A closer look shows that the last two entries are 368.42, and then 2185.66. This is a bizarre outlier, so I deleted it. The errors also jump 4 → 25, and a closer look shows a similar jump from 16 to the maximum value 25, which will distort plots later on, so I chose to make 16 the ceiling. We also don’t want the dummy rows which indicate a meditation session, they’re redundant once the Meditated
variable has been created. Now that the data is cleaned up and coded, we can look at it. For example, all error scores colored by subject:
Analysis
Descriptive
We have two response variables, number of errors & how long it took to finish the set of 50 arithmetic problems:
We can clearly see a practice effect in both test duration & number of errors, where subjects are gradually getting better over time. We will need to remember to control for that.
Splitting by subject instead and looking at small multiples for any ABA patterns (we delete 3 of the attrited subjects to free up vertical space):
mdtt2 < mdtt[as.integer(mdtt$Subject) < 9,]
qplot(Date, ErrorCount, color=Meditated, facets=Subject~., data=mdtt2)
qplot(Date, TestDuration, color=Meditated, facets=Subject~., data=mdtt2)
Overall, these are looking equivocal. There might be something going on, there might not be  ErrorCount
looks better than TestDuration
, but in both cases, the clear practice effect makes comparison hard.
Goals
Roberts comments:
Another way to improve this experiment would be to do statistical tests that generate p values; this would give a better indication of the strength of the evidence. Because this experiment didn’t reach steady states, the best tests are complicated (eg. comparison of slopes of fitted lines). With steadystate data, these tests are simple (eg. comparison of means).
If you are sophisticated at statistics, you could look for a timeofday effect (are tests later in the day faster?), a dayofweek effect, and so on. If these effects exist, their removal would make the experiment more sensitive. In my brainfunction experiments, I use a small number of problems so that I can adjust for problem difficulty. That isn’t possible here.
pvalues are standard in psychology, of course, despite their conceptual & practical problems which help contribute to systematic problems in science, so we’ll calculate one.
But we know in advance what the result will be: greater than 0.05, or nonstatisticallysignificant. A pvalue is mostly a function of how much data you have and how large the effect size is  we know a priori that this is a small experiment, and we also know that the effect of meditation on arithmetic scores cannot be huge because otherwise we would have heard of the groundbreaking research & society would make kids meditate before standardized exams & the subjects didn’t do much meditation & they were inexperienced etc. A small experiment on a small effect is guaranteed to not turn in a statisticallysignificant result, even when there genuinely is an effect, because the experiment doesn’t have much statistical power. In fact, a decrease in errors large enough to reach a statisticalsignificance of p < 0.05 would be so bizarre & unexpected that it would be grounds for throwing out the data entirely, as someone must be lying or have screwed something up or something have gone terribly wrong somewhere.
So the best we can hope for is to show that there was a benefit at all, rather than a penalty  it’s not clear, visually, that we’ll see better scores once we take into account a practice effect of time. Having shown this, then we can ask additional questions like, “how large an experiment would we need to reach statisticalsignificance, anyway?”
Multilevel Models
So, we’ll set up a few models in lme4
and compare them. The first model is the simplest: each subject has a different level of base accuracy (intercept), but are influenced the same way by meditation (meditation is a fixed effect):
library(lme4)
mlm1 < lmer(ErrorCount ~ Meditated + (1Subject), data=mdtt); mlm1
# ...AIC BIC logLik deviance REMLdev
# 3793 3812 1893 3784 3785
# Random effects:
# Groups Name Variance Std.Dev.
# Subject (Intercept) 1.37 1.17
# Residual 5.77 2.40
# Number of obs: 818, groups: Subject, 12
#
# Fixed effects:
# Estimate Std. Error t value
# (Intercept) 2.892 0.368 7.87
# MeditatedTRUE 0.213 0.179 1.19
#
# Correlation of Fixed Effects:
# (Intr)
# MedittdTRUE 0.139
(This is with a normal distribution family; using a Poisson distribution with family=poisson
would give a more accurate model but I am not yet comfortable using or interpreting Poissons.) We get a fixed effect of 0.2 or 1/5 of an error less on meditation days. Not a big effect. The next step is to imagine that besides having differing natural levels of accuracy, meditation may affect some subjects more or less:
mlm2 < lmer(ErrorCount ~ Meditated + (1+MeditatedSubject), data=mdtt); mlm2
# ...AIC BIC logLik deviance REMLdev
# 3794 3822 1891 3780 3782
# Random effects:
# Groups Name Variance Std.Dev. Corr
# Subject (Intercept) 1.6378 1.280
# MeditatedTRUE 0.0913 0.302 1.000
# Residual 5.7472 2.397
# Number of obs: 818, groups: Subject, 12
#
# Fixed effects:
# Estimate Std. Error t value
# (Intercept) 2.881 0.398 7.23
# MeditatedTRUE 0.144 0.196 0.74
#
# Correlation of Fixed Effects:
# (Intr)
# MedittdTRUE 0.568
In this view of things, there’s a weaker effect of meditation: 1/7th of an error less (rather than 1/5th).
But of course, what about the practice effect? The date should matter! Let’s rerun those two models with the date:
mlm1.2 < lmer(ErrorCount ~ Meditated + Date + (1Subject), data=mdtt); mlm1.2
# ... AIC BIC logLik deviance REMLdev
# 3791 3814 1890 3772 3781
# Random effects:
# Groups Name Variance Std.Dev.
# Subject (Intercept) 1.58 1.26
# Residual 5.69 2.39
# Number of obs: 818, groups: Subject, 12
#
# Fixed effects:
# Estimate Std. Error t value
# (Intercept) 698.5757 203.2720 3.44
# MeditatedTRUE 0.1978 0.1777 1.11
# Date 0.0440 0.0129 3.42
# ...
mlm2.2 < lmer(ErrorCount ~ Meditated + Date + (1+MeditatedSubject), data=mdtt); mlm2.2
# ...
# AIC BIC logLik deviance REMLdev
# 3791 3824 1889 3769 3777
# Random effects:
# Groups Name Variance Std.Dev. Corr
# Subject (Intercept) 1.867 1.37
# MeditatedTRUE 0.109 0.33 1.000
# Residual 5.662 2.38
# Number of obs: 818, groups: Subject, 12
#
# Fixed effects:
# Estimate Std. Error t value
# (Intercept) 706.2475 203.3645 3.47
# MeditatedTRUE 0.0928 0.1987 0.47
# Date 0.0445 0.0129 3.46
# ...
## what do the persubject meditation estimates look like?
## this is similar to what we would see if we did separate regressions on each subject:
ranef(mlm2.2)
# $Subject
# (Intercept) MeditatedTRUE
# 00 0.50140 0.11975
# 01 1.32272 0.31590
# 02 0.12651 0.03021
# 03 2.99856 0.71614
# 04 0.96203 0.22976
# 05 1.33934 0.31987
# 06 0.09078 0.02168
# 07 0.04431 0.01058
# 08 0.69278 0.16546
# 09 0.64673 0.15446
# 10 0.12256 0.02927
# 11 1.84076 0.43963
## 4/11 have negative meditation estimates
Both agree that there is definitely a practice effect, it’s probably statisticallysignificant with such large tvalues, and the effect size is cumulatively large  at 0.045 errors per day, we’d expect our error rate to decrease by 1.4 errors a month  which is a lot when we remember that the total average errors per session was 3.1!
Time of Day
So, that’s the practice effect verified. What else? Roberts suggested “timeofday effect (are tests later in the day faster?)” and “dayofweek effect”. First we’ll convert the times to hours and regress on that:
rm(x)
for (i in 1:nrow(mdtt)) { y < strsplit(mdtt$TimeOfDay[i], ":")[[1]];
x[i] < as.integer(y[[1]])*60 + as.integer(y[[2]]); }
mdtt$Hours < (x/60)
mlm3 < lmer(ErrorCount ~ Meditated + Date + Hours + (1Subject), data=mdtt); mlm3
# ...AIC BIC logLik deviance REMLdev
# 3789 3817 1888 3762 3777
# Random effects:
# Groups Name Variance Std.Dev.
# Subject (Intercept) 1.52 1.23
# Residual 5.63 2.37
# Number of obs: 818, groups: Subject, 12
#
# Fixed effects:
# Estimate Std. Error t value
# (Intercept) 690.4037 202.0927 3.42
# MeditatedTRUE 0.1875 0.1767 1.06
# Date 0.0434 0.0128 3.40
# Hours 0.0526 0.0162 3.25
To my surprise, there really is a timeofday effect. (I expected that there wouldn’t be enough data to see one, or that it would be a complex squiggle corresponding to circadian rhythms.) And it’s a simple linear one, too, which jumps out when one plots errors against hour:
The lesson here is that doing arithmetic at 7AM is not a good idea. (Would that American high schools could realize this…) But moving on: I really don’t think there’s enough data to get a dayofweek effect as Roberts suggests, so I’ll also do a simpler split on weekends.
mdtt$Day < weekdays.Date(mdtt$Date)
mdtt$WeekEnd < mdtt$Day=="Sunday"  mdtt$Day=="Saturday"
mlm4 < lmer(ErrorCount ~ Meditated + Date + Hours + Day + (1Subject), data=mdtt); mlm4
# ...AIC BIC logLik deviance REMLdev
# 3801 3858 1889 3757 3777
# Random effects:
# Groups Name Variance Std.Dev.
# Subject (Intercept) 1.47 1.21
# Residual 5.64 2.37
# Number of obs: 818, groups: Subject, 12
#
# Fixed effects:
# Estimate Std. Error t value
# (Intercept) 715.0345 204.1072 3.50
# MeditatedTRUE 0.1610 0.1803 0.89
# Date 0.0450 0.0129 3.48
# Hours 0.0518 0.0163 3.17
# DayMonday 0.4196 0.3138 1.34
# DaySaturday 0.3912 0.2959 1.32
# DaySunday 0.1379 0.3032 0.45
# DayThursday 0.4688 0.2894 1.62
# DayTuesday 0.3116 0.3164 0.99
# DayWednesday 0.1052 0.3267 0.32
# ...
# mlm5 < lmer(ErrorCount ~ Meditated + Date + Hours + WeekEnd + (1Subject), data=mdtt); mlm5
# ...
# AIC BIC logLik deviance REMLdev
# 3792 3825 1889 3762 3778
# Random effects:
# Groups Name Variance Std.Dev.
# Subject (Intercept) 1.52 1.23
# Residual 5.63 2.37
# Number of obs: 818, groups: Subject, 12
#
# Fixed effects:
# Estimate Std. Error t value
# (Intercept) 690.57787 202.44963 3.41
# MeditatedTRUE 0.18743 0.17684 1.06
# Date 0.04345 0.01280 3.39
# Hours 0.05263 0.01623 3.24
# WeekEndTRUE 0.00412 0.18231 0.02
# ...
Neither one is impressive. The model using weekends is completely useless, while the model using days looks like it’s overfitting the data.
Which of the 5 models is better, overall and keeping in mind that simplicity is good (we don’t want to overfit the data by throwing in a huge number of random variables which might “explain” variation in errorscores)?
anova(mlm1, mlm2, mlm1.2, mlm2.2, mlm3, mlm4, mlm5)
# ...
# Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
# mlm1 4 3792 3810 1892
# mlm1.2 5 3782 3806 1886 11.44 1 0.00072
# mlm2 6 3792 3821 1890 0.00 1 1.00000
# mlm3 6 3774 3802 1881 18.75 0 < 2e16
# mlm2.2 7 3774 3807 1880 1.68 1 0.19554
# mlm5 7 3776 3809 1881 0.00 0 1.00000
# mlm4 12 3781 3838 1879 4.26 5 0.51227
As we might have guessed, model 3 (Meditated + Date + Hours + (1Subject)
) wins because the practice effect & timeofday effect matter but apparently not dayofweek in the small dataset we have. (In a larger dataset, then the fancier models might do better.)
Significance
So, in mlm3
the meditation coefficient was 0.1875_{149ya}. How reliable is this meditation coefficient? We’d like a pvalue, and the tvalues aren’t clear: 1.106?
library(languageR)
mcmc < pvals.fnc(mlm3)
mcmc$fixed
# Estimate MCMCmean HPD95lower HPD95upper pMCMC Pr(>t)
# (Intercept) 690.4037 679.5600 268.8057 1069.9631 0.0018 0.0007
# MeditatedTRUE 0.1875 0.1837 0.5400 0.1511 0.2948 0.2889
# Date 0.0434 0.0427 0.0679 0.0172 0.0018 0.0007
# Hours 0.0526 0.0529 0.0839 0.0205 0.0012 0.0012
The practice & timeofday effects are pretty statisticallysignificant, but we get a pvalue of 0.29 for meditation at 0.15 (0.540.15). To put this in perspective: the dataset just isn’t a whole lot of data and the data is heavily loaded on nonmeditation data  not such an issue for the date or timeofday effects which get spread around and estimated evenly, but bad for the meditation data. Another check is by bootstrapping our model from the data and seeing how the meditation estimate varies  does it vary like it did above?
library(boot)
meditationEstimate < function(dt, indices) {
d < dt[indices,] # allows boot to select subsample
mlm3 < lmer(ErrorCount ~ Meditated + Date + Hours + (1Subject), data=d)
return(fixef(mlm3)[2])
}
bs < boot(data=mdtt, statistic=meditationEstimate, R=200000, parallel="multicore", ncpus=4); bs
# ...
# Bootstrap Statistics :
# original bias std. error
# t1* 0.1656 0.002162 0.1741
boot.ci(bs)
# ...
# Intervals :
# Level Percentile
# 95% (0.5108, 0.1735 )
# Calculations and Intervals on Original Scale
Very similar: 0.16 (0.510.17).
Power for a Followup Study?
Lewis viewed the collected data as a protostudy, and wants to do a followup experiment which will be better. Besides the obvious methodological improvements one could make like explicitly randomizing blocks of weeks & using better controls (some other mental activity or alternate form of meditation like yoga), how big should the experiment be so that a meditation effect could reach a p < 0.05?
Taking the final effect estimate from my analysis of Lewis’s protostudy at facevalue and assume it’s the true effect; how many comparisons of meditation/nomeditation would we have to make to have an 80% chance of detecting the true effect which could pass the 0.05 threshold if we ignored the possibility meditation worsened scores? Roughly estimating it, quite a few. A simple power calculation for the estimated effect of 0.187:
power.t.test(delta = 0.187, sd = 2.682, alternative="one.sided", power=0.8)
# ...
# n = 2544
# delta = 0.187
# sd = 2.682
# sig.level = 0.05
# power = 0.8
# alternative = one.sided
#
# NOTE: n is number in *each* group
(A ttest calculation because I am not yet sure how to do power analysis for a MLM, and this at least will get us in the neighborhood.)
Conclusion
If I’m interpreting the power right, this is not as bad as it looks since each of those n is one mathgame, but still. If you have 10 participants and each does 4 games a day, then you’ll need each of your 10 participants to spend 64 days playing the game without meditation (2544 / (10*4) = 64
) and another 50 days playing with the meditation. If you could get 20 participants then it’s a more reasonable 2 months total each.
But getting that many fullycompliant volunteers would be quite difficult, and one has to ask: why does one care? It’s not clear that this effect is worth caring about for several reasons:

the value of meditation is overdetermined by all the existing research, which in aggregate covers probably hundreds of thousands of subjects. 10 or 30 QSers’ data should not, for a rational person, override or noticeably affect one’s beliefs about the value or meditation. If the existing research says meditation is worthwhile, that is far more trustworthy than this; if it says meditation is not worthwhile, likewise.

the methodology here means that the small effect is not a causal effect of meditation alone, but an unknown melange of countless factors.
What the small decrease means or is caused by, I am agnostic about. I know a lot of people get annoyed when someone brings up methodological issues like expectancy or placebo effects, but with my own blinded selfexperiments with nootropics I have felt firsthand placebo effects, and I have been deeply impressed by the results of my dual nback metaanalysis where simply switching from a passive to an active control group changes the IQ gain from 9.3 IQ points to 2.5.
For me, things like that are the reality behind most selfexperiments, and certainly ones with selfselected participants in an unrandomized unblinded experiment with attrition leading to a small effect easily explicable by simply trying a little bit harder.

the metric of arithmetic scores has no clear relation to anything of importance
No one actually cares about simple arithmetic scores. Many results in brain training studies follow a pattern where the training successfully increases scores on some psychological task like digit span, but when researchers look for “far transfer” to things like school grades, no effect can be found. Even if all the other issues could be dealt with and mindfulness meditation did improve arithmetic, what is it a valid proxy for?