“Cores That Don’t Count”, 2021-05-31 (; similar):
We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often “silent”—the only symptom is an erroneous computation.
We refer to a core that develops such behavior as “mercurial.” Mercurial cores are extremely rare, but in a large fleet of servers we can observe the disruption they cause, often enough to see them as a distinct problem—one that will require collaboration between hardware designers, processor vendors, and systems software architects.
This paper is a call-to-action for a new focus in systems research; we speculate about several software-based approaches to mercurial cores, ranging from better detection and isolating mechanisms, to methods for tolerating the silent data corruption they cause.
…Because CEEs may be correlated with specific execution units within a core, they expose us to large risks appearing suddenly and unpredictably for several reasons, including seemingly-minor software changes. Hyperscalers have a responsibility to customers to protect them against such risks. For business reasons, we are unable to reveal exact CEE rates, but we observe on the order of a few mercurial cores per several thousand machines—similar to the rate reported by Facebook [8]. The problem is serious enough for us to have applied many engineer-decades to it.
…We have observed defects scattered across many functions, though there are some general patterns, along with many examples that (so far) seem to be outliers. Failures mostly appear non-deterministically at variable rate. Faulty cores typically fail repeatedly and intermittently, and often get worse with time; we have some evidence that aging is a factor. In a multi-core processor, typically just one core fails, often consistently. CEEs appear to be an industry-wide problem, not specific to any vendor, but the rate is not uniform across CPU products.
Corruption rates vary by many orders of magnitude (given a particular workload or test) across defective cores, and for any given core can be highly dependent on workload and on frequency, voltage, temperature. In just a few cases, we can reproduce the errors deterministically; usually the implementation-level and environmental details have to line up. Data patterns can affect corruption rates, but it’s often hard for us to tell. Some specific examples where we have seen CEE:
Violations of lock semantics leading to application data corruption and crashes.
Data corruptions exhibited by various load, store, vector, and coherence operations.
A deterministic AES miscomputation, which was “self-inverting”: encrypting and decrypting on the same core yielded the identity function, but decryption elsewhere yielded gibberish.
Corruption affecting garbage collection, in a storage system, causing live data to be lost.
Database index corruption leading to some queries, depending on which replica (core) serves them, being non-deterministically corrupted.
Repeated bit-flips in strings, at a particular bit position (which stuck out as unlikely to be coding bugs).
Corruption of kernel state resulting in process and kernel crashes and application malfunctions.
CEEs are harder to root-cause than software bugs, which we usually assume we can debug by reproducing on a different machine.
View PDF: