Skip to main content

Hacker News submission analysis

Describing HN submissions; estimating manipulability



Page view counts:

  • 11.8k
  • 15k
  • 30k, 12k
  • 8.5k
  • 15k
  • 15.2k
  • 7k unique visitors
  • 10k unique visitors
  • 40k unique visitors “in one 24 hour period, 35,000 visits and 32,000 uniques”

  • 100 views, +4!/post/the-story-of-our-hackernews-submission
  • 5k pageviews, 3.6k, +42 uniques
  • 41.3k
  • 15,223 sessions
  • +315, 17,000 sessions


Question: is submitting to HN worthwhile?

Simple experiment: submit each day one link to + 2 links to other domains. These serve as both a rough control for that day’s difficulty of front page, any benefits or penalties applied to my account, and repayment to HN for the potential spamming.

per nathanael’s post, I tried to consistently post around 10AM EST (I don’t get up early enough to do 7-8AM EST). Apparently he’s wrong? Oh well. extract stats from!/story/sort_by_date/prefix/0/author%3Agwern ? API:

Sep 2013, started with: ‘“Equoid” (Charles Stross meets My Little Pony)’ ( 5 points by gwern 6 months ago; 0 comments

multi-level Poisson model? Group by Domain and cross group by Day. mixture model? seems appropriate for two different groups (those who make front page and those who don’t)

can’t filter GA by https breaks referrers


wget ‘,author_gwern&hitsPerPage=1000’

How to get all my comments? This doesn’t work:

wget ‘,(comment)&hitsPerPage=999&page=2’,(comment)&hitsPerPage=1000

{“hits”:[],“page”:2,“nbHits”:0,“nbPages”:0,“hitsPerPage”:999,“processingTimeMS”:1,“message”:“you can only fetch the 1000 hits for this query, contact us to increase the limit”,“query”:““,”params”:“advancedSyntax=true026analytics=false026hitsPerPage=999026page=2026tags=author_gwern%2C%28comment%29”}

algolia API is apparently heavily limited:

New API using Firebase

user <- "gwern"
user <- fromJSON(getURL(paste0("",user,".json")))
userAll <- sapply(user$submitted, function(id) { Sys.sleep(1); return(fromJSON(getURL(paste0("",id,".json")))); } )


The social news service Hacker News has a two-layered organization, where newly submitted links are displayed on a ‘/newest’ page seen by few users, and the best (as determined by users voting on each submission) are automatically selected for display on the high-traffic main front-page which most HN users read. I hypothesized that there is a lack of traffic on /newest and this implies that even one vote can substantially affect the chance a particular submission will reach the front-page, its ultimate score, and page-views of the submitted link. A randomized experiment in upvoting small batches of links confirms that the effect is real & large: TODO.

While using HN, as a sort of ‘public service’, I occasionally made sure to visit the newest submissions page rather than just the main front page most people read. After a while, I noticed that the links I upvoted there seemed to be turning up a lot on the front page, more than I would expect from my usual pattern of upvoting perhaps 5 links out of the 30 available. A horrible suspicion struck me: could the apparent arbitrariness of what links made the front page be caused by the /newest page being so sparsely voted upon that a single upvote made a meaningful difference?


I decided to do a randomized parallel groups experiment to test: On /newest, take the first 5 links (#1-5, to maximize impact), do a simple 50-50 randomization on each to decide whether to upvote or ignore (I have a shell function for randomization: echo "$((RANDOM % 2 < 1))"); and make no votes on any other /newest items (I allowed myself my usual browsing & upvoting on the main page while I was there). As I have ‘noprocast’ turned on for 180 minutes, each group of 5 links should have been separated by a minimum of 3 hours (more than enough time for all links to fall off /newest). This was typically done during HN’s busiest hours: 11AM-midnight EST. I was not blinded during the experiment, but the writeup & list were never public while the experiment was running. I used my existing high-karma (>7k) account since there doesn’t seem to be any weighting of upvotes, reading through “Inside the news.yc ranking formula”, “How Hacker News ranking algorithm works” & discussion.

A power calculation for sample size is hard to do: I don’t have a Poisson power function handy, and I don’t expect the data to fit a Poisson too well due to the stark contrast between the front-page and /newest. (And I do want a well-powered experiment - the only thing more wasteful than an overpowered experiment is an underpowered weak experiment.) However, I do expect the necessary sample to be quite large by the standards of continuous normally-distributed data. Most submissions to /newest fail to gain more than an upvote or two, which means that most of the sample contains little information about whether the extra upvote helped them reach the front-page or not. Further, I don’t just want to estimate whether the net effect of upvoting is >1, I would like a reasonably precise estimate of what the effect is: knowing that it’s, say, +2-10, is not very satisfactory. (“The oncoming car is somewhere between 10 meters and a kilometer away.”) All in all, I don’t expect the necessary n to be <200. I decided to stop at n = 300.

The experiment ran from 2014-03-16–2014-03-31. On 22 April, after doing my planned analysis, the results turned out to be as expected but weaker than I’d prefer (turned out a lot of the sample is just a waste as they are stuck at +1/2) and so I resumed randomization to get another 100 links or so. I stopped the additional randomization on 6 May.

Analysis Plan

The analysis strategy is:

  1. a non-parametric test of difference in mean scores
  2. dichotomize the items by <10 as a proxy for having made it to the front-page for a meaningful amount of time, for a logistic regression to estimate increase in front-page odds
  3. attempt a Poisson regression on scores, to extract an estimate of the difference in means
  4. something fancier, suggested by the data (such as a mixture model of Poissons, perhaps, to split between front-page and non-front-page)

TODO: extract page views & time on page from my old Analytics, letting me calculate ‘how much time am I steering with, say, 10 upvotes on /newest?’




The 2 groups are not exactly balanced due to the simple randomization. One URL suffered a typo and I could not figure out what the original was, so for one block I randomized an extra link (a #6). The n should be divisible by 5, but is off by TODO 1; I think I must have failed to copy-paste one link at some point.

for URL in `xclip -o`; do elinks -dump $URL | grep -E ' points* by '; sleep 2s; done


basic analysis:

upvoted <- c(3,27,2,19,60,2,69,2,14,6,72,9,2,2,2,57,2,3,31,2,2,32,2,3,2,2,33,2,8,2,2,2,8,4,2,2,8,4,55,2,10,7,327,35,58,70,6,14,3,2,2,79,3,2,100,2,5,4,7,72,2,2,158,2,3,2,73,59,54,2,76,7,141,2,424,11,2,6,2,3,3,3,2,2,2,2,3,126,3,2,2,3,2,3,2,7,4,2,6,2,2,42,2,2,2,3,2,5,77,2,5,5,13,85,19,2,2,2,6,6,2,2,30,72,2,2,6,2,2,4,17,2) - 1
ignored <- c(1,1,1,2,1,2,2,1,1,2,1,2,5,154,1,1,8,2,1,2,1,2,1,1,7,1,1,1,2,329,5,1,1,94,2,1,1,2,1,3,1,1,3,132,1,2,1,2,1,2,2,1,3,2,1,3,1,3,2,5,3,14,62,3,1,1,2,12,112,1,1,2,6,3,2,1,1,1,1,14,1,1,1,2,1,39,4,3,1,1,1,1,9,3,1,1,84,1,5,4,5,1,1,4,270,1,1,1,1,1,5,1,1,1,1,1,3,1,30,5,244,1,1,45,7,4,1,5,1,1,2,1,2,4,1,1,11,61,1,1,1,82,1,2,8,68,7,2,4,89,5,62,25,2,2,2,1,124,179,1,2)
wilcox.test(upvoted, ignored)
# ...W = 11758, p-value = 0.09914
hn <- data.frame(Scores = c(upvoted, ignored), FrontPage = c(upvoted>10, ignored>10), Upvoted = c(rep(TRUE, length(upvoted)), rep(FALSE, length(ignored))))
qplot(1:length(Scores), sort(Scores), color= hn[order(hn$Scores, decreasing=FALSE),]$Upvoted, data=hn)
histStack(hn$Scores, hn$Upvoted, breaks=25)

g1 <- glm(FrontPage ~ Upvoted, family="binomial", data = hn); summary(g1)
# ...Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   -1.742      0.221   -7.87  3.5e-15
# UpvotedTRUE    0.723      0.296    2.44    0.015
# (Intercept) UpvotedTRUE
#      0.1752      2.0597
#              2.5 % 97.5 %
# (Intercept) 0.1108 0.2649
# UpvotedTRUE 1.1579 3.7167
Reduce(`*`, exp(coef(g1)), 1)
# [1] 0.3608

## a single upvote on /newest is currently estimated as increasing the odds of making the front page by 2.25x (from 16% to 36%)
## or to put it another way, from a mean score of +16 to a mean score of +22
g2 <- glm(Scores ~ Upvoted, family="poisson", data = hn); summary(g2)
# ...Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   2.7907     0.0195   142.9   <2e-16
# UpvotedTRUE   0.2914     0.0270    10.8   <2e-16
# (Intercept) UpvotedTRUE
#      16.292       1.338
exp(Reduce(`+`, coef(g2), 0))
# [1] 21.28
fit4 <- (stepFlexmix(Scores ~ Upvoted, data=hn, model = FLXMRglmfix(family = "poisson"), k=4, nrep=20))
# summary(fit4)
#         prior size post>0 ratio
# Comp.1 0.0643   19     28 0.679
# Comp.2 0.4506  133    140 0.950
# Comp.3 0.1126   32     90 0.356
# Comp.4 0.3725  109    114 0.956
# 'log Lik.' -1154 (df=11)
# AIC: 2330   BIC: 2371
# $Comp.1
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)    2.485      0.120    20.7   <2e-16
# UpvotedTRUE    1.892      0.123    15.4   <2e-16
# $Comp.2
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   0.6002     0.0734    8.18  2.9e-16
# UpvotedTRUE   5.3254     0.0820   64.95  < 2e-16
# $Comp.3
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   4.3955     0.0309   142.4   <2e-16
# UpvotedTRUE  -3.5918     0.0794   -45.3   <2e-16
# $Comp.4
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   5.4605     0.0292   187.1   <2e-16
# UpvotedTRUE  -2.3164     0.0712   -32.5   <2e-16

exp(0.6002); exp(2.485); exp(4.3955); exp(5.4605)
# [1] 1.822
# [1] 12
# [1] 81.09
# [1] 235.2

Note that the submissions were chosen to maximize the impact of an upvote by upvoting the first 5 links on /newest and hence the most recently submitted: a submission at the very bottom, like #30, will benefit least from an upvote since its time is up and it’s almost vanished. If the effect on the first 5 is TODO, and the effect on the last link (#30) is 0, then a reasonable guess at the mean effect over all links is simply TODO/2.

Reddit Comparison

One might wonder if HN is uniquely aberrant in this lottery; the most natural comparison is the social news site Reddit. I picked /r/prog (a very large subreddit comparable to HN in size) and manipulated their /new/ the same way




No logistic regression; I’m not sure what the equivalent of ‘front page’ is for Reddit.