A/B Testing Indentation &amp; Justification

Gwern

A/B Testing Indentation & Justification

1-year-long website A/B test of controversial typographic choices of indenting & justifying paragraphs: a fairly precise null result. No design change was made.

2022-09-27–2023-08-23 finished certainty: possible importance: 2 similar bibliography

Indent Or Justify?
- A Convention?
- Worth Testing
Design
Analysis Plan
Implementation
- Randomization
- HTML/CSS/JS
Experiment
- Experiment Analysis
Decision

A core typographic decision for web pages is whether to separate paragraphs using newlines, or indentation; and whether to use ragged-right or fully-justified text. Gwern.net defaults to book-like indentation + justified text; however, most websites do the opposite, and some readers have criticized the Gwern.net style.

To test whether the style has any important effect on readers, I run an A/B test on Gwern.net, randomly testing each day one of the 4 possible combinations, for 318 days (2022-09-27–2023-08-10) over 719,549 page-views.

Analysis finds ~0 effect on page-views, with the effects of indentation/justification being far from statistically-significant or a posteriori probable.

So, I decided to not change the typography.

Inevitably, we return to some fundamental typographic questions—“time is a flat circle”. In this case, criticism of the Gwern.net design c. 2022 regularly focused on the use of indentation for paragraph separation rather than non-indented + blank line, and the use of justified text. Previously, I had tested how much I had indented, but not whether to indent or justify. Should we switch? Or should we test first?

Indent Or Justify?

We chose them mostly because they seemed like best-practices from books (on my account), and because we liked & believed them correct (on Said Achmiz’s account).

Some people have surprisingly strong opinions on these, and for others, these two design choices seem to be at the heart of a dislike for the site design—the layout rubs them the wrong way before they have even looked at any details. The rationale for hating them is not always clear, or is dubious.

For example, is it really the case that ‘justification is bad because it was just a hack by book printers to save a page or two of paper’ (if any, given the nature of printing in quire-units), and you are engaged in shameful cargo-culting skeuomorphism if you use full justification on a web page?

Given the generous margins of medieval manuscripts (when each page cost serious money & labor) and the gorgeous justification of the Gutenberg Bible (see Knuth & Plass1981 for a historical discussion of justification), it’s clear that justification’s purpose was not eking out some sort of economy, but was esthetic, in producing an ‘even’ text. And I see no reason evening out text like that would be less esthetic online.

Or, some claim that ragged-right margins are easier in some way to read by providing a wiggly shape to make it easier to shift your gaze at the end of the line. Maybe, but they generally don’t cite relevant research; and even if they did, most research in typography is un-credible, bearing all the hallmarks of Replication-Crisis-bait (like absurdly small sample sizes while measuring subtle effects unlikely to generalize out of their exact context), and shouldn’t be taken too seriously.

And why would indenting be bad and an entire blank line superior for separating paragraphs? People just seem to have a gut feeling about the page looking “busy” or “too dense”.

I suspect that the real reason is simple familiarity: web browsers default to no indentation with 1em paragraph margins, and ragged-right rather than justification. Not for any good reason—mostly because that’s simpler and fancy typesetting was not a web browser developer priority (look at how hyphenation really only become feasible recently and is still inferior to true LaTeX hyphenation & linebreaking). That is now just “how the Internet looks” to everyone, for better or worse.¹ Similarly, sans serif fonts are associated with science & technology by long tradition, further reinforced by designer trends like Bauhaus & International Typographic Style (not that there is anything intrinsically ‘scientific’ about them), and low-resolution screens requiring bitmap fonts where it’s difficult for a legible font to even have serifs.

A Convention?

Complaining about it is like complaining that green/red traffic lights could be better designed and more colorblind-friendly; perhaps, but you had better go when the light turns green if you don’t want problems. At some point you have to decide: do you want it enough to alienate readers? Or should you just go along with the de facto standard and spend your ‘weirdness points’ on something more important (eg. unusual citation formats)?

Of course, you can’t make everyone happy, and you shouldn’t try. It’s unclear that the average reader even dislikes it in the first place—no one is going to leave a comment about how much they like the indenting; they might leave a comment about how much they hate it. (A common problem online: hatred is a more powerful motivator than liking.) Maybe it makes zero difference to reader behavior! It’s not like I found many large effects in my other A/B tests, after all, aside from the harms of advertising.

Worth Testing

So the question of indentation & justifications seem like good things to A/B test: they are discrete changes which doesn’t come with many options (one either justifies or not, and the level of indenting has already been shown to matter little), where all options seem equally probable and to have equal risk/reward. It is also a topic where there is so little hard information that any A/B test results, however flawed, are useful to publish.

Decision-wise, I have no strong feelings about it (while I do prefer serif fonts), so if an A/B test showed high posterior probability of newline and/or ragged being even a little bit better than indent + justified, that readers really did mind, I’d be fine switching. (The main barrier would be the one-time switching costs in needing to refine the CSS & design for the new version, and Said Achmiz’s unhappiness.)

Design

To A/B test both questions in a statistically-powerful, client-side-only, cacheable way which doesn’t break NoScript clients or add meaningful delay to rendering, I borrow my advertising A/B test approach. Based on the previous experiments, a few months will be woefully under-powered to detect relevant effects on the order of a few percent, and at least a year will be required; I don’t want to run multiple years, because the A/B test will interfere with site design & add some complexity, so I pick 1 year as a round number.

We randomize the whole site on a daily basis, as Google Analytics prefers that granularity, and because (as with the advertising A/B test) site design may well have spillover effects through resharing or popularity, and not necessarily solely on each reader in isolation.

I pre-generate randomness in blocks for a 2×2 factorial experiment running for 1 year, which uses the date as an index into the randomness to set <body> classes where the default classes are the status quo & the toggled classes are the A/B test alternatives. The existing CSS is changed to have the classes as selectors.

So a NoScript browser opens the HTML with the status quo body & CSS, sees the normal appearance. A JS browser opens it, checks the date to see which body class to use, sets the class, and the CSS changes based on it. A year later, Google Analytics reports the time-spent-on-page for that day as an experimental unit.

Analysis Plan

We analyze it in brms with a spline time-series and 2×2 variable (due to possible interaction: non-indentation + non-justified is the ‘Internet default’, so may be preferred by users non-additively to cases of breaking either one, since then it ‘looks wrong somehow’ as opposed to just ‘looking right’) etc.

For autocorrelation and date trends (powerful in Internet traffic time-series), brms supports Bayesian splines.

Historical Data

For additional power, I will include a limited historical baseline of indented+justified; the last major change to the indented or justification settings was 2021-11-18, where justification was dropped on mobile in favor of ragged-right (due to low-quality mobile OS/browser justification looking especially bad at narrow widths), and it has been stable since. So I will include 2021-11-18–2022-09-27 as indented=1/justified=1 datapoints.

Mobile Covariate

Mobile readers still typically have a much shorter session length (0m:33s vs 1m:36s, overall 1m:17s checking 2022-04-27–2022-05-3) & pages-per-session (1.16 vs 1.61), so it’s worth having mobile covariate data available. Mobile browsers in this experiment will get the indentation change, but not the justification change, because they have already been set to ragged-right. So including mobile users overall will add some measurement error to the justification change, as changes in mobile time can’t have anything to do with justification; but it may not be worth the hassle of trying to split them out in the data export & analysis as a multi-level analysis fitting random-effects for Indented & Justified by mobile vs desktop. We’ll see.

Distribution

Outcome-distribution-wise, we know from all the past A/B experiments that traffic is not normal but something much spikier like a log-normal, and time-on-page or session length is somewhat better behaved but also still clearly not normal (too many spikes both up & down), so Student’s t is a good choice on top of the log transform.

Informative Prior

Prior-wise, we are neutral on the mean, so mean = 0, and we also know that the effect of any design change which does matter is tiny, and closer to ±1% than 10%, much less 100%. If we regress on log total traffic, then +1% = a coefficient of +0.01 and −1% = −0.01, so we would use an informative prior of 𝒩(0,0.01) If we regress on time-on-page, then if the grand mean is 1m:17s or 77s, ±1% is about ±1s (rounding up). So we expect our coefficients to have ~0s effect, but perhaps as high or low as a few seconds. Many more seconds would be quite surprising—I would wonder if something had gone wrong if tweaking hyphenation could somehow make people read an additional 30s on average! (People are hard sells, not necessarily that interested in what they’ve been linked to, and there’s a lot of other stuff to read out there.) So our coefficient prior for time-on-page effects will be 𝒩(0,1).

Summing up, the analysis should look something like this:²

library(brms)
b <- brm((Session.Length|Mobile) ~ s(Date) + (1|Indented) + (1|Justified),
      prior=c(prior(normal(0,1), "b")),
      family = student,
      data=df, chains=30, iter=10000)

Implementation

Randomization

In R, I create shuffled blocks of 1–4 corresponding to the 2×2 = 4 possible conditions (status quo of indent+justified, neither, or one), covering 1 calendar year, 0-366; blocking reduces variance:

as.vector(replicate(round(366/4), sample(4)))

HTML/CSS/JS

In the default.html footer, the reader browser will look up n^th day of year to randomize current pageview, and set the body class to the 2 variables, which the CSS declarations will condition on:

<!-- 2×2 A/B test of indentation vs newlining, and justified vs not. -->
<script id="justificationindentation-abtest-js">
randomness = [2,3,1,4,2,4,3,1,4,2,3,1,1,3,2,4,4,2,3,1,3,2,1,4,3,4,1,2,1,4,
    3,2,2,3,1,4,4,2,1,3,2,3,1,4,1,3,4,2,2,3,4,1,1,3,4,2,4,3,1,2,1,2,3,4,2,
    4,1,3,2,3,1,4,4,2,1,3,1,2,3,4,1,4,3,2,3,1,4,2,4,2,3,1,1,3,2,4,4,2,1,3,
    4,1,3,2,4,1,2,3,1,2,4,3,4,2,1,3,1,3,4,2,3,2,1,4,4,3,1,2,1,2,3,4,3,2,4,
    1,2,3,1,4,4,3,2,1,1,4,2,3,3,1,2,4,4,2,3,1,3,4,2,1,1,2,4,3,1,3,4,2,3,2,
    1,4,4,1,3,2,3,4,2,1,2,4,1,3,2,3,4,1,1,4,3,2,3,2,4,1,2,1,3,4,3,4,2,1,4,
    1,2,3,3,2,4,1,4,3,2,1,3,4,2,1,3,2,1,4,2,4,3,1,1,3,4,2,4,1,3,2,1,3,4,2,
    2,4,1,3,1,4,2,3,2,4,3,1,1,4,3,2,2,3,4,1,4,2,3,1,2,4,1,3,3,2,1,4,2,1,3,
    4,4,3,1,2,3,4,2,1,1,4,3,2,1,3,4,2,4,2,1,3,2,1,4,3,1,3,2,4,4,3,1,2,3,4,
    2,1,1,2,3,4,4,3,2,1,2,1,4,3,4,2,3,1,4,2,3,1,4,1,2,3,4,1,3,2,2,3,1,4,3,
    1,2,4,3,2,4,1,2,3,1,4,1,3,4,2,3,1,4,2,2,1,3,4];

var loadDate = (Math.round((new Date().setHours(23) - new Date(new Date().getYear()+1900,
                 0, 1, 0, 0, 0))/1000/60/60/24)) - 1;
choice = randomness[loadDate];

var indented  = true;
var justified = true;
switch (choice) {
 case 1: { indented = true;  justified = true;  break; }
 case 2: { indented = false; justified = true;  break; }
 case 3: { indented = true;  justified = false; break; }
 case 4: { indented = false; justified = false; break; }
 }

if (!indented)  { document.body.classList.remove("indented");
                  document.body.classList.add("indented-not");  }
if (!justified) { document.body.classList.remove("justified");
                  document.body.classList.add("justified-not"); }
</script>

The classes indented/justified were set on the <body> by default in default.html:

$if(index)$<body class="indented justified $safe-url$">$else$
    <body class="indented justified $safe-url$ $css-extension$">$endif$

Original paragraph indenting in initial.css:

p + p,
p + figure[class^='float-'] + p,
div[class^='dropcap-'] + p,
.abstract + p {
    text-indent: 2.5em;
}
@media only screen and (max-width: 649px) {
    p + p,
    p + figure[class^='float-'] + p,
    div[class^='dropcap-'] + p,
    .abstract + p {
        text-indent: 1.75em;
    }
}

changes to:

body.indented p + p,
body.indented p + figure[class^='float-'] + p,
body.indented div[class^='dropcap-'] + p,
body.indented .abstract + p {
    text-indent: 2.5em;
}
@media only screen and (max-width: 649px) {
    body.indented p + p,
    body.indented p + figure[class^='float-'] + p,
    body.indented div[class^='dropcap-'] + p,
    body.indented .abstract + p {
        text-indent: 1.75em;
    }
}

body.indented-not p + p,
body.indented-not p + figure[class^='float-'] + p,
body.indented-not div[class^='dropcap-'] + p,
body.indented-not .abstract + p {
    margin-top: 1em;
}
@media only screen and (max-width: 649px) {
    body.indented-not p + p,
    body.indented-not p + figure[class^='float-'] + p,
    body.indented-not div[class^='dropcap-'] + p,
    body.indented-not .abstract + p {
        margin-top: 1em;
    }
}

So indenting & removal of top-margin spacing only happens if indented is still set:

.markdownBody p,
.markdownBody li {
    -webkit-hyphens: auto;
    -ms-hyphens: auto;
    hyphens: auto;
}
@media only screen and (min-width: 900px) {
    .markdownBody p,
    .markdownBody li {
        text-align: justify;
    }
    .markdownBody .TOC li {
        text-align: left;
    }
}

/* ... */

body.justified .markdownBody p,
body.justified .markdownBody li {
    -webkit-hyphens: auto;
    -ms-hyphens: auto;
    hyphens: auto;
}
@media only screen and (min-width: 900px) {
    body.justified .markdownBody p,
    body.justified .markdownBody li {
        text-align: justify;
    }
    body.justified .markdownBody .TOC li {
        text-align: left;
    }
}

The results looked like this:

Screenshot of the 4 typographic Gwern.net variants A/B tested (indent+justify, indent+ragged-right, newline+justify, & newline+ragged-right); demonstrated on LARPing.

Experiment

The experiment ran 2022-09-27–2023-08-10 for 318 days (629 total including historical data 2021-11-18–2023-08-10), covering 499,257 (980,957) users with 719,549 (1,422,775 total) pageviews or 2,261/day. (This is a fairly ordinary period of traffic; for more historical traffic data, see the Gwern.net traffic page.)

There were few or no comments by readers about the A/B test, and we did not notice any bugs.

The experiment was terminated earlier than planned when Google Analytics shut down Google Analytics version 3 in favor of a completely different “GA4” service. (While officially GA3 would stop collecting data by 1 July 2023, Google didn’t seem to actually forcibly halt data collection until 10 August 2023.) I struggled with the new GA4, not having been all that great with GA3 to begin with, and didn’t want to deal with figuring out how to extract the same data or how to harmonize GA3/GA4, or deal with any disruptions or loss of data, so I include data only up to the GA3 termination. I also struggled to pull out the daily mobile pageview data and time-on-page data from GA3, and eventually gave up and settled for just users/pageviews/mobile data.

Experiment Analysis

Data:

## Raw: <https://gwern.net/doc/traffic/2023-08-21-gwern-abtesting-indentjustification-original.csv>
data <- read.csv("https://gwern.net/doc/traffic/2023-08-21-gwern-abtesting-indentjustification.csv",
    colClasses=c("Date", "integer", "logical", "integer", "numeric",
                 "integer", "logical", "logical", "logical"))
summary(data)
#      Date               Date.int         Mobile            Users
# Min.   :2021-11-18   Min.   :18949.0   Mode :logical   Min.   : 365.000
# 1st Qu.:2022-04-24   1st Qu.:19106.2   FALSE:631       1st Qu.: 543.000
# Median :2022-09-29   Median :19264.0   TRUE :631       Median : 652.000
# Mean   :2022-09-29   Mean   :19264.0                   Mean   : 777.303
# 3rd Qu.:2023-03-05   3rd Qu.:19421.8                   3rd Qu.: 790.750
# Max.   :2023-08-10   Max.   :19579.0                   Max.   :11063.000
#
# PagesPerSession     PageViews        Randomized       Indented
# Min.   :1.08000   Min.   : 510.00   Mode :logical   Mode :logical
# 1st Qu.:1.37500   1st Qu.: 770.00   FALSE:626       FALSE:318
# Median :1.51000   Median : 988.00   TRUE :636       TRUE :944
# Mean   :1.50545   Mean   : 1127.40
# 3rd Qu.:1.62875   3rd Qu.: 1229.75
# Max.   :3.12000   Max.   :12164.00
#
# Justified
# Mode :logical
# FALSE:320
# TRUE :942

## "Mobile browsers in this experiment will get the indentation change,
## but not the justification change, because they have already been set to ragged-right."
# data[data$Mobile,]$Justified <- FALSE

Visualization:

library(dplyr)
data <- data %>%
  mutate(Format = case_when(
    Indented & Justified ~ "I + J",
    Indented & !Justified ~ "Indented",
    !Indented & Justified ~ "Justified",
    TRUE ~ "Neither"
  ))
data_agg <- data %>%
  group_by(Date, Format) %>%
  summarise(TotalPageViews = sum(PageViews), .groups = "drop")
data_agg$Format <- factor(data_agg$Format,
                    levels = c("Neither", "Indented", "Justified", "I + J"))

library(ggplot2)
ggplot(data_agg, aes(x = Date, y = log(TotalPageViews), color = Format)) +
  geom_point(size = 2.5) +
  stat_smooth(size = 4) +
  labs(x = "Date", y = "Log of Total Page Views", color = "Format") +
  theme_bw() +
  scale_color_manual(values = c("Neither" = "red", "Indented" = "green",
                                 "Justified" = "skyblue", "I + J" = "black")) +
  theme(text = element_text(size = 30),
        axis.title = element_text(size = 35)) +
  coord_cartesian(ylim = c(6.3, 8.5))

Gwern.net total-pageview daily website traffic (2021-11-18–2023-08-10), with LOESS lines for the 4 experimental conditions.

Modeling: simple linear model:

summary(lm(log(PageViews) ~ Mobile + Indented * Justified, data = data))
# ...Residuals:
#        Min         1Q     Median         3Q        Max
# −0.7528328 −0.1979312 −0.0653795  0.0992045  2.6426583
#
# Coefficients:
#                              Estimate Std. Error   t value Pr(>|t|)
# (Intercept)                 7.1036417  0.0310513 228.77149  < 2e-16
# MobileTRUE                 −0.3278491  0.0318330 −10.29904  < 2e-16
# IndentedTRUE               −0.0122149  0.0278473  −0.43864  0.66100
# JustifiedTRUE               0.0113190  0.0502425   0.22529  0.82179
# IndentedTRUE:JustifiedTRUE −0.0135376  0.0514783  −0.26298  0.79261
#
# Residual standard error: 0.35107 on 1257 degrees of freedom
# Multiple R-squared: 0.179917,    Adjusted R-squared: 0.177307
# F-statistic: 68.9428 on 4 and 1257 DF,  p-value: < 2.22e-16

The Bayesian model outlined before, adjusted for the switch from time-on-page to total pageviews:

library(brms)

n_chains <- 30
## uninformative, then informative:
priors <- c(set_prior("normal(0, 1)",    class = "b", coef = "MobileTRUE"),
            set_prior("normal(0, 1)",    class = "b", coef = "sDate.int_1"),
            set_prior("normal(0, 0.01)", class = "b", coef = "IndentedTRUE"),
            set_prior("normal(0, 0.01)", class = "b", coef = "JustifiedTRUE"),
            set_prior("normal(0, 0.01)", class = "b", coef = "IndentedTRUE:JustifiedTRUE")
            )

## MCMC convergence optimization: seed at the final posterior mean values:
init_single <- list(`b_MobileTRUE`    =  0.40, `b_IndentedTRUE` = 0.00,
                    `b_JustifiedTRUE` = -0.01, `b_sDate.int_1`  = 0.99)
init_values <- replicate(n_chains, init_single, simplify = FALSE)

b <- brm(log(PageViews) ~ Mobile + s(Date.int) + Indented * Justified,
      prior = priors,
      family = student,
      inits = init_values,
      data = data, chains = n_chains, iter = 30000)
b
#  Family: student
#   Links: mu = identity; sigma = identity; nu = identity
# Formula: log(PageViews) ~ Mobile + s(Date.int) + Indented * Justified
#    Data: data (Number of observations: 1262)
# Samples: 30 chains, each with iter = 30000; warmup = 15000; thin = 1;
#          total post-warmup samples = 450000
#
# Smooth Terms:
#                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sds(sDate.int_1)     2.17      0.72     1.16     3.99 1.00    78956   133655
#
# Population-Level Effects:
#                            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
# Intercept                      7.06      0.01     7.03     7.08 1.00   432731
# MobileTRUE                    −0.40      0.01    −0.43    −0.38 1.00   397046
# IndentedTRUE                  −0.00      0.01    −0.02     0.01 1.00   511310
# JustifiedTRUE                 −0.01      0.01    −0.02     0.01 1.00   432847
# IndentedTRUE:JustifiedTRUE    −0.00      0.01    −0.02     0.02 1.00   458018
# sDate.int_1                    0.99      0.61    −0.21     2.18 1.00   226469
#                            Tail_ESS
# Intercept                    318959
# MobileTRUE                   326060
# IndentedTRUE                 319803
# JustifiedTRUE                326818
# IndentedTRUE:JustifiedTRUE   330339
# sDate.int_1                  260497
#
# Family Specific Parameters:
#       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sigma     0.15      0.01     0.14     0.16 1.00   380275   332992
# nu        2.04      0.15     1.77     2.34 1.00   381426   325363

The analyses are in good agreement and unsurprising: the main & interaction effects have small, and indistinguishable from zero, causal effect on total pageviews over the course of the experiment. The posterior point-estimate for the baseline of indented+justified (including the interaction) is perhaps −1% decrease in page views, which would be ~22 pageviews/day.

Decision

Given how tenuous the estimate of harm is, and how strongly we can rule out there being any large harmful effect, my decision is to leave the styling as it was: indented+justified.

[Error: JavaScript disabled.]

[Backlinks, similar links, and the bibliography require JS enabled to load.]

Bibliography

[Bibliography of links/references used in page]