Hacker News new | past | comments | ask | show | jobs | submit login
How Should We Critique Research? (gwern.net)
72 points by SubiculumCode 9 months ago | hide | past | favorite | 51 comments



I have a particularly funny recent anecdote: in the past year, a professor at my dept had two papers of his attacked by aggressive letters to the editor pointing out flaws in both methodology and interpretation. The end result has been that we published two additional papers titled "a reanalysis of [insert title of 1st paper]". So a critique of bad research ended up producing twice the amount of bad research!

It's so absurd that it has a certain Monty Python flavour twist of comedy.


It does not automatically follow that the reanalysis is also bad...and by saying "we" you presumably were an author on this paper. If you thought it was bad, why did you put your name on it?


I didn't ask for it. I just found about it after the fact and not being a professor I have no say in the matter. Also, I'd prefer to keep my job since I have a family to feed.

Yes, the reanalysis is terrible and it follows from the fact that the base methodology is unsound.


Maybe it's a field specific thing, or maybe one of the other authors pretended to be you, but I am surprised something could be published with your name on the author list without you knowing about it. Every paper I've ever been listed on, sometimes 10+ authors deep, has required me to confirm that I want my name on the paper/attest I contributed.


you absolutely have a say in the matter of having your name included in any publication list. If a paper was submitted with your name and you don't support it, let the PI know, and send a letter to the editor asking for the paper to be withdrawn or retracted.


You absolutely can ask to have your named removed. To be diplomatic, you can always claim that you didn't do enough to deserve credit. Even now, you could contact the journal to request the change.

In your particular case, I will not contest that the reanalysis was terrible. My point was that reanalysis of a paper after an issue is raised is not automatically also crap. It may indeed be correcting the issue.


Yes, I could ask to lose my job. But again, I prefer not to. Research in certain (most?) fields is a dictatorship and even suggesting something to be fishy puts a target on your back.

In this particular case, there were multiple outcomes tested (read: tens of them) and alpha was .05


Do they produce mere noise, or there's additional fraud involved?


Noise. But don't judge too harshly, it's actually like that everywhere I got to work.


That's why I judge it so harshly.


Man, all I can say is you are working for a bad prof then. I can't imagine any of my former advisors/colleagues/etc punishing me for declining to be a co-author.


I am guessing that you don't do clinical research, then ;-)


Im on the edge of clinical work. I work with some clinical people, but am not clinical myself. I would say that they are a bit more willing to be instrumental and throw a bunch of comparisons out and see what sticks. Its largely because of a lack of awareness about statistical issues though, and they respond well to my objections.


Well, most professors around here don't respond well at all to objections. All (MD) professors I've met are dangerous Excel spreadsheet warriors. Note that we have a statistician in our unit, that no one listens to because "we always did it that way, right?".

However, in a wider view the issue is far bigger than that: for a clinician nowadays it's become basically impossible to both be a really honest researcher while acquiring sufficient publication velocity to rise above. In a word, our system selects for bad actors.


I occasionally help this sort of people process and visualize data (they ignore their own internal statisticians, of course), but they're friends, so I can shut them down when needed. I have heard the term "retrospective control group" being used.


I'm reminded of a bit of dialogue from Fight Club:

"Which car company do you work for?"

"...a major one."


Out of curiosity: why were these issues being flagged by readers after publication, and not during peer review?


Peer reviews seem to be very hit and miss. I guess it's like asking why a software bug wasn't found during code review, except code reviews are somewhat stronger in the sense that everyone involved wants the final result to actually function.

A clear problem I've seen when meta-reviewing peer reviews is that the goal seems to be demonstrating collegial constructivity, in the sense that things that are obviously totally wrong and yield a useless paper will be highlighted as sub-optimal or skipped over, in favour of smaller things that are fixable. That's understandable from the perspective of keeping friends and building influence with peers, all of which are critical to academic success, but it's pretty fatal for wider trust in science.

Another problem is that peer reviews are extremely slow. Several months on average. If someone takes 3 months to review a paper, and then addressing the comments takes another month, there is a lot of pressure in the system to go straight to publication at that point - in a strict code review environment the reviewer might observe that some of the problems still weren't fixed and send bad work back four or five times, often easily more. But that's because code review has a fast turnaround time. You can't do that with the speed of scientific peer reviews.

On the plus side the reviews do tend to be pretty comprehensive and seem to be taken seriously within their own standards. Part of why they take so long is they're often pretty long. The issue isn't scientists half-assing it, the problems are deeper and more psychological.


> if someone takes 3 months to review a paper

I have the exact opposite experience/ problem. Whenever I'm asked to review, there s an artificially short deadline involved (typically a couple of weeks).

Which, assuming you actually work for a living, effectively equates to having a couple of days at the weekend to review a paper.


Hmm, interesting. My number comes from a meta-analysis of peer reviews that compared usual latency to COVID-paper latency. However they might have been looking at the delay between the paper being finalised and the review being committed, or something like that. There's presumably a lot of latency added by the journals finding and routing papers to reviewers.


I guess the reviewers did not see a problem. Or deemed the contribution to the field sufficiently important to let it pass through. I don't know, really.


Aren't we forced to require replication for a measure of truth in empirical studies? If that's universally accepted, shouldn't we demand that researchers come up with predictions of the form "do X and see result Y"? And if we accept that, then should we not discuss who is going do actually "do X" before research gets published?

From the outside perspective, it appears that a lot of people do "open-ended" experiments, write up some findings and pretty much no one tries to confirm the results with the notable exception of actual industrial applications (e.g., pharmacy or aircraft design), where the validation happens out of necessity.


Preregistering studies [1] are common in some fields and a nice idea. Unfortunately my personal experience is that reproductibility, or even providing the sources for your experiments, is not something many reviewers care about in comp-sci fields. At least no conference review I received ever mentioned reproductibility no matter if it was available or not (and I am not sure anyone ever mentioned source code availability either).

[1] https://en.wikipedia.org/wiki/Preregistration_(science)


Pre-registration is very common in medicine. I don't see it change anything, because there are no real consequences for not doing what you registered for.


Unless registration is an employment contract, it will be as binding and useful as project plans in business.


Substantially solid research takes years to complete. Each new finding opens possibilties of 10 new research directions that one "should" investigate. The timelines necessary for this follow up research studies blatantly exceed the scope of a Phd or Postdoc researcher and just one scientific paper.

It is not true that no one tries to confirm the results. The labs do further studies internally to follow up. Also those working closely in the field in other monitor these "findings" and try to align them in their own research whether directly comparing them or indirectly using those ideas to support or reject their own observation.

It is never one study but years of past research that lays the foundation for a scientific discovery that finally creates a paradigm shift in our understanding of the field.


It's important to notice that the funding, which determines what research gets done, is conditional on "impact factor" and publication rates. The classic case of letting a single metric drive a business function without regard for what it actually means. This has been incredibly bad for research, which now requires very large amounts of gaming the system to get actual science done.

Reproducibility and other forms of validity are not currently factored in very well.


A recipe for dysfunction not just in science but in any field


> Aren't we forced to require replication for a measure of truth in empirical studies?

Almost no scientific discipline requires it, sadly. There's usually no career prospects if you do replication studies.


This is a really great article, and one thing in particular that struck me was something super obvious in retrospect but that I didn’t think of before:

Replicability doesn’t mean the study was right. If a study doesn’t replicate then it’s almost certainly nonsense. If it does replicate that just means that it’s self consistent—but if it was garbage in, then it will invariably be garbage out.


> If a study doesn’t replicate then it’s almost certainly nonsense.

For some fields, especially natural sciences, this statement makes sense. However, for sciences of the artificial (to use Herb Simon's term) I would respectfully disagree.

For example, sometimes the underlying context has changed. I remember a luminary in my field (HCI) once argued that we should consider re-doing many studies every decade, because each cohort uses a different set of tools and has a different set of experiences, and because the underlying demographics of those cohorts change.


This is what we should be paying people like rscho upthread to do. Graduate students should cut their teeth on half a dozen replication studies. Stop chasing novelty at all costs. Instead improve process, methodologies, statistical analysis, etc etc. Our system doesn't pay for that though, so instead we get to live in the world where researchers actively and knowingly publish junk so they can feed their family.


Replicability shows that your method leads to consistent results, not that your hypothesis correctly explains the cause of those results. Yes, your intervention did provoke the causal chain into action, but it may not necessarily have correctly identified or thoroughly characterized the component you identified as the trigger. Your method may work great but only under the perverse set of conditions you happened to explore.

Conversely, if your method fails to drive the desired outcome, your hypothesis could still be correct, just incomplete. Maybe your perturbation of the chemical reaction didn't quite reach the activation energy. Or maybe other essential components in the mechanism were overlooked, given the set of conditions you happened to explore.

Complex black box systems like the brain are notoriously perverse in reproducibly giving up their secrets, even when your hypothesis is correct and your method robust.


“doesn’t replicate then it’s almost certainly nonsense”

Disagree. A significant finding is expected to occur due to chance in direct proportion to the surface area for such possibilities. More studies, more forking paths within those studies, more models, all increase the frequency of spurious findings. So one can do everything perfectly and still get “garbage out”. It is certain.


I’m not super sure I understand your point, but I think you’re saying that it’s possible to run a good replication attempt on a good study and still have it not replicate. I agree with that. I’m not super sure how to correctly estimate the chance of that happening, but one dumb way I can think of is just using p value, so if it was .05 then you have a 1/20 chance of failing to replicate a study even if everything was done correctly.

However, when I said “doesn’t replicate” I didn’t have a single attempt with a 5% chance of failure in mind. I had a field’s aggregate attempt to confirm the result in mind, which would include multiple attempts and checking for statistical bullshit and all that.

Under those conditions the chances are vanishingly small of a whole field getting massively unlucky when trying to replicate a well-done study that theoretically should replicate.

That’s what I had in mind, and I still think it’s right.

Rereading what you wrote, a different interpretation of what you said is that the original investigators might have done everything perfectly, and nevertheless found a significant result that was spurious just because that stuff can happen by chance. If that’s what you meant, I don’t understand the disagreement, except maybe semantically. I would call a perfectly done study that shows a spurious result “nonsense,” and I would expect replication attempts to show the result is nonsense, even if the process that generated the nonsense was perfect. Maybe you’re just saying you wouldn’t call a perfectly done study “nonsense,” regardless of the outcome?


There's a bunch of disciplines (live animal experimentation, microorganisms and organic chemistry come to mind, though they aren't "my" fields) where it's genuinely difficult to perform experiments, where it's reasonably common for experimenter to screw up the experiment in multiple ways, and failure to replicate may just as well indicate not a flaw with the original experimenter but a weakness of the skills of the team trying to replicate.

One could argue that it's a failure of not sufficiently detailed descriptions of the experiment, but it is how it is in different disciplines. A relevant example for that is a bunch of earlier machine learning research (current methods subjectively seem more robust in this regard) some of which were very difficult to replicate from scratch because it was very finicky and relied on many tiny details that simply can't all fit in the couple pages of a standard paper, but it was definitely not nonsense, because it could be replicated if you knew all the best practices from previous experience. I mean, providing code for a ML paper doesn't change whether the research results there are nonsense or not, but many people might have a very hard time replicating that research without the code.


Yes, a semantic disagreement: I don’t think of spurious findings as nonsense. Nonsense has negative connotations that I feel contributes to the real problem which is the competitive nature of research and misconceptions about “error.” Too often I hear people elevate the researcher in a sort of “great man” theory of scientific progress. Too rarely do I hear praise for those doing boring replication studies and the like. I worry that younger people hear someone published “nonsense” and think of it as an indication of the quality of the researcher. All of this leads to more p hacking, more avoidance of replication, and more inefficiency in the scientific process. The economic incentives are such that universities compete for big name researchers, those who don’t publish nonsense. That is the problem IMO.


I interpreted the "difference which makes a difference” as the sensitivity of our understanding of the status quo relative to the research instance. In that sense, the critique would be about estimating impact on some dimension of understanding conditioned on a finding of credible difference. That would say that big bets are better research.

In other terms of critique, Gelman talks about two components (which may appear to be in tension), one referencing the craft of research, and the other in terms of apparent novelty of effect. My interpretation is that a critique of research should factor in both the character of the bet and an assessment of the evaluation. I don't think big is better, I think incisive is better, but that broader impact (news worthiness) is conditioned on credible surprise.

[0] https://statmodeling.stat.columbia.edu/2014/08/01/scientific...


Waiting for the article : "How should we critique research without being called anti-science?"


The problem with something being wrong is that there's nearly an endless number of ways for something to be wrong.

However, with something correct, there's very little to say about it.


The point of the article I think is that while many criticisms of a piece of research are valid, they are not necessarily meaningful. Because of their validity (despite being not meaningful), such criticisms can be weaponized to undermine any piece of research that does not fit your worldview...indeed, I see this regularly on HN...which is why I posted the article.


I don't necessarily see a problem with this on an n of 1 scale. Strongly held priors are not going to be updated by a single piece of research, and for the most part I don't feel they should be. Single studies in biology these days are pretty much never good enough to justify that. If over time many studies build up that's a different story. Perhaps there should be more review articles on HN rather than the focus on original research reports.


Throw grant-money at it to until those with weak evidence, buckle under the load?


I think it should simply be two parts: 1). A list of what steps were taken 2). What conclusions can be drawn from it

Instead we get something that combines both so it is hard to tell where researchers beliefs come into play.


Well, that looks at first like a pretty comprehensive list of possible problems in a scientific studies, but the correct reaction on reading it should be "wait, what about the really serious problems?". I mean, (ab)using Google Surveys or experimenter demand effects is one kind of problem that indeed probably deserves a lot of nuance about criticizing it. But it's also not really where the biggest problems are or where the attention should be focused.

We're in an environment where people routinely discover flat out academic fraud, as in, scientists just made whole tables of data up out of thin air. Someone notices, informs the journal, and with 95% likelihood nothing happens. Or maybe the researcher is allowed to 'correct' their made up data. We're in an environment where you literally cannot take any number in a COVID-related paper at face value, even when there are multiple citations for it, because on checking citations routinely turn out to be fraudulent e.g. the cited paper doesn't actually contain the claimed data anywhere in it, or explicitly states the opposite of what's being claimed, or the number turns out to have been a hypothetical scenario but is being cited as a "widely believed" empirical fact, etc. We're in an environment where literature reviews argue that whilst very few public health models can ever be validated against reality, that's not a big deal and they should be used anyway.

Gwern criticises people who learn about logical fallacies and then go around over-criticising people for engaging in them. Yeah, sure, if you claim someone is taking bribes and they actually are then it's technically an ad hominem but still correct to say. Granted. But we are not suffering from an over-abundance of nitpicky fallacy-criticizers. Where are these people when you need them? COVID related research frequently contains or is completely built on circular logic! That's a pretty basic fallacy yet papers that engage in it manage to be written by teams of 20, sail through multiple peer reviews and appear in Nature or the BMJ. As far as I can tell the scientific institutions cannot reliably detect logical fallacies even when papers are dealing with things that should be entirely logical like data, maths, code, study design.

The notion that scientific criticism should be refined to focus on what really matters sounds completely reasonable in the abstract, and is probably an important discussion to have in some very specific fields (maybe genetics is one). But I'd worry that if people are arguing about p-hacking or lack of negative results, that means they're not arguing about the apparently legion researchers making "mistakes" that cannot plausibly be actual mistakes. Stuff that should be actually criminal needs to be fixed first, before worrying about mere low standards.


In practice, there's a massive bikeshedding problem in scientific review. It is very frustrating to see proposals or preprints criticized for missing a dot in paragraph 37 when the bigger problem is that the experiment is overall worthless because it can never change a decision.... This is bad enough when it's an honest mistake, but then there are all the dishonest or plainly incompetent people to worry about.

So, no, in fact I think we are suffering from an abundance of nitpickers. Science desperately needs more reviewers who can see the bigger pictures.


That's fair enough. I was meaning nitpickers about logical fallacies specifically, not stuff like grammar or minor details of an experiment.

That said, is peer review really the place to dunk on a study because the whole goal is pointless? The work is done by that point, it's too late. It's the granting bodies that should be getting peer reviewed in that regard, but one of the root causes of the malaise in research is that the granting bodies appear to be entirely blind buyers. They care about dispersing money, not what they get out of the spending. If they didn't spend the money they'd be fired, so that's understandable. The core problem IMHO is the huge level of state funding of research. The buck stops at politicians but they are in no position to evaluate the quality of academic studies.


I was actually thinking of grant applications when I wrote that.

And it's not even politicians, it's grant committees too. For the usual reasons, they'd rather fund study #13 from a researcher than anything that is the slightest bit uncomfortable to them.


I have to say, it's a nice article, but I am a bit skeptical of this analysis' casual endorsement of treating Likert scales as continuous data.


I didn't realize how incredible gwern's website is. Wow. Very information dense and well thought out.


Have you heard about the "grounded theory" methodology? It is astonishing - some academic circles have formally standardized the practice of fitting a model to data, and then immediately drawing conclusions from it.




Applications are open for YC Summer 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: