Is Writing Style Sufficient to Deanonymize Material Posted Online?
I have a new paper appearing at IEEE S&P with Hristo Paskov, Neil Gong, John Bethencourt, Emil Stefanov, Richard Shin and Dawn Song on Internet-scale authorship identification based on stylometry, i.e., analysis of writing style. Stylometric identification exploits the fact that we all have a ‘fingerprint’ based on our stylistic choices and idiosyncrasies with the written word. To quote from my previous post speculating on the possibility of Internet-scale authorship identification:
Consider two words that are nearly interchangeable, say ‘since’ and ‘because’. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these ‘markers’, you can construct a profile.
The basic idea that people have distinctive writing styles is very well-known and well-understood, and there is an extremely long line of research on this topic. This research began in modern form in the early 1960s when statisticians Mosteller and Wallace determined the authorship of the disputed Federalist papers, and were featured in TIME magazine. It is never easy to make a significant contribution in a heavily studied area. No surprise, then, that my initial blog post was written about three years ago, and the Stanford-Berkeley collaboration began in earnest over two years ago.
Impact. So what exactly did we achieve? Our research has dramatically increased the number of authors that can be distinguished using writing-style analysis: from about 300 to 100,000. More importantly, the accuracy of our algorithms drops off gently as the number of authors increases, so we can be confident that they will