Hacking Smartphone ESP Apps

Gwern

Hacking Smartphone ESP Apps

computer security, epistemology, parapsychology, AI safety

Illustration of how to think about security and reward-hacking by walking through ways to fake psychic powers even on someone else’s smartphone and ESP application. Supply-chain attacks, sleight of hand, bugs…

2026-02-08 finished certainty: highly likely importance: 0 bibliography

Taxonomy
Discussion

A “your phone, your app” extrasensory perception (ESP) demo looks like a strong test: the randomness is generated by a machine you control, and the performer never touches anything but glass.

It is not. The protocol leaves an enormous attack surface, from narrative selection effects and stopping rules through UI/pseudo-random number generator (PRNG) bugs, side channels, confederates, and outright system substitution.

I inventory these attack classes as a pedagogical threat model. The point is not that any one trick is likely, but that “above chance” is cheap when the target metric is so underspecified—providing an analogy to AI reward-hacking in street clothes.

Imagine that you go to a performance by a stage magician or mentalist who claims to be psychic. To test him, you hand him your personal smartphone (probably in a case) with a square-guessing app you installed beforehand. The app works by presenting a grid of blank squares, the user tapping one, and the app revealing if that was the “correct” square. You watch him closely as he uses the app. After a few dozen taps, he has astonished you by performing well above chance level.

Has the magician proven he is psychic? After all, this may not be a rigorous gold standard trial (of the sort which show precise null effects), but it still seems pretty good. He is surely unable to use all the most obvious attacks, like peeking at your screen or using reflections or hacking the app—you are looking at the screen simultaneously, you installed it before you met him or ever heard of it, it’s being randomly generated by a machine so he can’t influence it, he can’t research you beforehand… Right?

Well… what is “magic” (or hacking), if not putting in more effort than any “reasonable” person would?

I think you’ll see what I mean if I teach you a few principles magicians employ when they want to alter your perceptions…Make the secret a lot more trouble than the trick seems worth. You will be fooled by a trick if it involves more time, money and practice than you (or any other sane onlooker) would be willing to invest. My partner, Penn, and I once produced 500 live cockroaches from a top hat on the desk of talk-show host David Letterman. To prepare this took weeks. We hired an entomologist who provided slow-moving, camera-friendly cockroaches (the kind from under your stove don’t hang around for close-ups) and taught us to pick the bugs up without screaming like preadolescent girls. Then we built a secret compartment out of foam-core (one of the few materials cockroaches can’t cling to) and worked out a devious routine for sneaking the compartment into the hat. More trouble than the trick was worth? To you, probably. But not to magicians.

—Teller, 2012

And so you can be fooled by an unreasonable person—like how Harry Houdini fooled Arthur Conan Doyle, or the countless ways to mark or cheat at cards.

In the spirit of my earlier taxonomy of security exploits, applying “the security mindset” to the topic of “Hacking Pinball High Scores”, I ask: how can you hack ESP smartphone app high scores?

Taxonomy

I think we can taxonomize many of the ESP exploits as follows (cf. Macknik et al 2008):

Group / Session / Doc Manipulation

Which trials exist, and which trials become “the story”? Attack the post-event record:

Selection:
- run it on many people; only the lucky streaks get retold / publicized (your “obvious one”, eg. Derren Brown)
- choose witnesses who are pliable, distracted, intoxicated, impressed, or numerate-but-not-auditing
Stopping rules:
- stop when ahead; stop at a local maximum; declare victory and end the demo (publication bias)
- restart the narrative (“that was calibration”) without restarting the app
Selective quoting / citation laundering: record only the best segment; quote only the best number
Video/screenshot manipulation: splice runs; cut away from failures; edit overlays
Retelling laundering: “12 or 14” becomes “14” via selective quoting / citation laundering; eg. compress the denominator (“14 in a row”) while omitting the prelude

(Telephone game/‘leprechauns’ and selective quotation of ranges or rounding can inflate performance; you say “he got maybe 12 or 14 right”, which gets summarized into “he got 14 right”, which is rounded into “he got 15 right”, and misremembered into “he got 20 right”.)

App Integrity / Supply-Chain Tampering

The app is secretly the magician’s prop, via a supply chain attack.

Backdooring such an app is probably cheap and easy.

Zener cards guessing or square-guessing apps are presumably highly unprofitable, in part due to their triviality. Therefore one could create and market such an app for a small budget, perhaps thousands of dollars (less than many stage magic tricks cost to purchase, never mind create); this app could have arbitrary backdoors.

And it would be difficult to audit the app or its creators, as they could be shell companies, pseudonymous, shadow partners, could have purchased it from the original developer, or have backdoored any of the software libraries the app depends on (especially JavaScript Node libraries or advertising SDKs). Alternately, if it’s too hard to compete with pre-existing apps, one could try to engage in ‘typosquatting’—create many confusingly named apps and hope that one gets installed.

It is unlikely that anyone will try serious reverse-engineering, and the app could well be set to only allow backdoor behavior under narrow conditions like specific times and places (similar to advanced persistent threat hacking tactics or some criminal lottery backdoors).

Given that:

Backdoored target generation:
- deterministic “random” sequence (eg. day-keyed), memorized or computed by the magician
- per-device / per-session keying (harder for outsiders to notice; same basic idea)
Backdoored forcing: “force mode”, the right square is selected after the tap, with high probability
Backdoored leakage:
- deliberate visual/audio/haptic cues encoding the answer (tint, micro-clicks, vibration patterns…)
- “special glasses” decoding (polarization, infrared contact lenses, spectral filtering, etc.)
Backdoored control channel: covert gestures that toggle rigging (shake/hold patterns, tap rhythm, “practice” code-sequence, like in Las Vegas slot machines)
Backdoored demo/test mode: hidden mode switch reachable quickly (not “install/jailbreak”; just “feature flag for cheating”)

Bugdoors

The app is “clean” but incompetent, especially in the all-important source of randomness:

Weak PRNG / weak random seeds:
- RNGs have many common failure modes: use of public randomness like blockchain hashes or time-based-seeding (easily guessed since relatively few possible candidates), low entropy, state reuse, predictable initialization¹
- naive modulo mapping biasing some squares (rand()%N failure)
“fairness” heuristics that create structure: PRNG that is fine for games but learnable/predictable by adversaries given enough output:
- no repeats, anti-streak logic, shuffle-bags (“sampling without replacement” in disguise, cf. unsorting)
  
  This is the game-dev version of “Gambler’s Verity”: dependence masquerading as randomness—well-intentioned but deceptively random and exploitable.
- outcome smoothing (“balance the squares so it feels fair”)
Practice rounds as inference: early rounds are used to estimate bias/state/heuristic rules; once above-chance prediction is available, then the “real test” begins (“warming up” becomes data collection)

Input / GUI Semantics Exploits

One “guess” secretly becomes many guesses:

Hitbox edge cases (literally):
- touch regions overlap; borders mis-bucket; corners have rounding bugs
- a fingertip can straddle multiple cells; app records “touched” ambiguously.
  
  On a 4×4 grid, straddling two cells would double your chance 6.25% → 12.5%; straddling 4 (corner intersection) quadruples it to 25%.
Multi-touch & gesture races:
- two-finger contact counts as two taps
- tap+micro-drag registers multiple entered cells
- tap-down vs tap-up finalization lets you slide the commitment point
Naive scoring predicates: “correct square was touched at any time during this gesture” rather than “the chosen square equals the correct square”; works fine for users using the app normally, who will never notice the slight excess of “hits” they accidentally trigger, but highly exploitable
Hard-to-audit interaction timing: rapid tapping exploits animation states where UI feedback lags internal logging
Miscellaneous bugs: software is hard. Everything that can go wrong will go wrong for someone somewhere. Any zero-day bug could be unpredictably leveraged into a hack, and it is difficult to taxonomize them all beforehand.

Side-Channel Leakage

The answer is visible, but only if you know where to look in side-channels:

Visual micro-signals:
- slight tint differences; subpixel artifacts; anti-aliasing differences (eg. browser link rendering allowing a pixel stealing attack)
- per-cell animation timing differences (one square highlights 10–20ms earlier)
Audio / haptic micro-signals: tiny differences in click sound, system haptic envelope, or latency
Performance/timing side-channels (“chick-sexing” or roulette wheel predicting style):
- different answers traverse different render paths; trained perception beats naïve perception
- accomplice upgrades this into instrumentation (high-speed camera + classifier famously could beat roulette wheels, see also high-speed robotics, Eulerian video magnification, reflections, etc.), then signals live
Network metadata leakage: the app phones home; even if encrypted, answer-dependent sizes/timings can leak (traffic analysis / fingerprinting, eg. due to compression leaking secrets), which can be analyzed by a personal device or confederate (see next item)

Confederates / Collusion

Turn a solo trick into a distributed system:

Parallel session + synchronization: accomplice runs the same RNG/seed and signals the correct square
Man-in-the-middle of remote randomness: app fetches “random” from a service; accomplice intercepts/rewrites
Optical assist: accomplice records the screen at distance and extracts cues too subtle to be perceived by any human
Social assist: accomplice reinforces counts (“that’s 14 right, wow!”), provides social proof, steers memory (cf. Asch conformity experiments)

Confederates can communicate with the magician in too many ways to bother trying to enumerate (eg. idle chitchat could be a prearranged steganographic code, common among mentalists or performers like Penn & Teller, or one could use hand position, body orientation; auditory: cough patterns, speech cadence; haptic: shared-surface vibration, Bluetooth device; temporal: pause length as a code…)

System Substitution/Impersonation

It is not a live honest run:

Playback masquerading as play: during an attentional gap, open a prerecorded video URL and pantomime taps in sync; close it before handing back.²
App swap: you cannot ‘install’ a smartphone app easily while being watched; but you could potentially open up a web page which is a single-page app clone.³
Phone swap: sleight-of-hand swap to a prepared phone with identical form factor/case (possibly delayed); pickpocket technique can be unbelievably good, and people don’t notice people or clipboard swaps. (See the previous Houdini anecdote, who swapped balls and papers.)
Hardware overlay / fake screen: a phone case overlay shows a fake “screen”; any obtrusive parts or revealing seams are carefully hidden by their hands.
Demo/test mode while you blink: opportunistic mode switch during a distraction moment (distractions here could include elaborate rituals of focusing, chanting, staring in the distance, sudden hand or head gestures etc. or be a confederate “passing by”)

Rule and Semantic Manipulation

Move the goalposts, then declare you hit them:

Warm-up framing: failures are “calibration”; successes are “the test”
Partial credit: “one square off but my finger slipped”, “I meant that corner region”
Martingale cold-reading: after misses, escalate confidence until an inevitable streak occurs; the final win is remembered more than the intermediate losses (peak-end rule)
Sheep/goat + Ψ-missing: claim skepticism suppresses hits
Heads I win, tails skeptics lose: count below-chance runs as “psi” too, doubling the “unusualness” budget
Redefine patterns: The magician simultaneously monitors several possible “impressive” patterns: ‘longest run of consecutive hits’, ‘overall hit rate’, ‘hits on corner cells’, ‘alternating-color patterns’, “I got the hard ones.” etc.

Whichever one happened to be unusual by chance gets emphasized. This is the multiple comparisons problem applied to performance and it’s distinct from optional stopping or partial credit—it’s about which metric to report, not when to stop or how to score

Witness / Judge Deception

Attack the audit trail inside the observer; is what you saw what you remember?

Working-memory overload: speed + patter prevents reliable tallying; people forget what they thought before
Change blindness: bury misses in attention shifts; glide past the moment you’d verify
Count drift: “was it 13 or 14?” becomes plastic within minutes, especially with confederates
Conformity pressure: crowd reaction substitutes for verification (Asch dynamics as a scoring mechanism)

Discussion

That is a lot of attack vectors for a scenario that seemed so airtight. How many of these would someone have thought of, or found plausible, before they were demonstrated in the real world?

“Controlled conditions” are never as controlled as you think, and science runs on non-adversarial trust; almost no psychological research can survive even a little bit of adversarialness, as the standard research methodology is flimsy. (This is a particular concern as science becomes automated by AI; how loadbearing is our “human proof of work” assumption, given how successful complete fabricators like Diederik Stapel have been?)

People overestimate how much of the attack surface they’ve covered because they only think about the obvious attacks (peeking, rigged app) and declare those ruled out. This is Teller’s core point: security is about the attacks you didn’t think of, not the ones you did. You have to defeat every attack any attacker can think of; otherwise you may discover you have just been fooled by someone using an off-the-shelf app they bought for $99 on sale.

You also have to defeat their combination—perhaps optimal stopping is not enough, but what about optimal stopping and a confederate bruteforcing the PRNG? A professional will often chain multiple tricks, and will sometimes use multiple tricks in the same performance when they do something to “prove” they couldn’t be cheating in a particular way (because they had just switched to an alternate trick, counting on you assuming it’s the same trick the entire time—because who would ever use more than one trick?).

The history of experimental parapsychology is the long painful bitter history of trying to slowly lock down methodology every time a new exploit is found or a new fraud uncovered or a new statistical bias analyzed, and every generation of controls generates a new generation of exploits (while still having no real world application). You can sketch what a serious protocol might look like, done by someone like James Randi (eg. debunking spoon bending)—“pre-registered, double-blind, hardware RNG, no confederates allowed in the room, video-recorded, multiple independent observers”…

But note how many of my taxonomy entries still aren’t ruled out (supply-chain attacks on the app requiring, at a minimum, reproducible builds from open source code and cryptographic signing for a meaningful chain of trust, side-channel leakage, GUI semantics exploits, and selective reporting by the experimenters themselves). The residual attack surface after supposed “gold standard” controls is still surprisingly large, which is the real epistemological lesson.

And it is highly relevant to the current era of AI agents which reward hack, and are happy to satisfy their tasks by doing things like redefining requirements or editing test-suites to always pass, and who will always put in an “unreasonable effort”.

In AI alignment terms: specifying what you mean by “magician performance on an ESP app” turns out to require ruling out a huge implicit space of that’s-not-what-I-meant! solutions. The “magician” is a reward hacker—optimizing the metric (high score) rather than the intended property (actual precognition). Every entry in the taxonomy is a way to get the reward without the intended behavior, and the reason there are so many is that the metric (tap accuracy on a smartphone app) is a proxy that can be Goodharted from dozens of directions. This is the same structure as specification gaming in reinforcement learning: the environment always has more degrees of freedom than the designer imagined.

[Error: JavaScript disabled.]

[Backlinks, similar links, and the bibliography require JS enabled to load.]

Bibliography

[Bibliography of links/references used in page]