Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Using stylometry to find HN users with alternate accounts (stylometry.net)
676 points by costco on Nov 26, 2022 | hide | past | favorite | 512 comments
Author here. This site lets you put in a username and get the users with the most similar writing style to that user. It confirmed several users who I suspected were alts and after informally asking around has identified abandoned accounts of people I know from many years ago. I made this site mostly to show how easy this is and how it can erode online privacy. If some guy with a little bit of Python, and $8 to rent a decent dedicated server for a day can make this, imagine what a company with millions of dollars and a couple dozen PhD linguists could do.

Here's Paul Graham:

https://stylometry.net/user?username=pg

Here are some frequent HN commenters: (EDIT: Removed due to privacy concerns)




Wow. This gives a lot of false positives, but it found all ~10 of my old accounts over the years.

The most interesting thing is that my writing style changed pretty drastically since a decade ago. Searching for my oldest account matches my earliest usernames, whereas searching this account matched the rest.

The details of the algorithm are fascinating: https://stylometry.net/about Mostly because of how simple it is. I assumed it would measure word embeddings against a trained ML model, but nothing so fancy.


Woof.

I create new accounts on a semi-regular basis because I think cliques are the most corrosive factor to social media. Any time my account gathers enough upvotes enough I destroy it for another.

I had four accounts. None are over 50% confidence, but when I look at any one account the others are consistently #2, #3, and #4.

Now I’m thinking very carefully about what words I use to avoid linking this as the 5th account.


This makes me melancholic. One should be able to express themselves without the overhead of privacy concerns.


Exact same thing happened to me. Wild.


On the other side of the coin, I have never had an alternate HN account (beyond maybe 1-2 throwaways with only one post or comment) so seeing the list of users that are most similar to me was interesting. I didn't see some stark similarities based on a quick peek at their comments, but it was interesting.


Yeah top 20 is a little excessive because in my own tests I found that top 20 is only marginally more accurate than top 10. You can get a more academic explanation [here](https://www.tandfonline.com/doi/abs/10.1080/09296174.2011.53...). I was amazed too because it seemed too easy!


FWIW, top 20 was necessary for mine. The bolding was a brilliant move. Several of my accounts were ranked 10-20, but popped out due to the bolding.


What does the bolding indicate?


The explanation is here: https://news.ycombinator.com/item?id=33755466

As far as I’m concerned, it’s the killer feature of the app. The top 20 results may be noisy, but the bolded results have a signal to noise ratio close to infinity.


The precision of the bolded results looks like maybe 30% to me. Significantly better than the non-bolded, but nowhere near perfect precision.


False positives become an increasingly difficult problem the more and more potential authors you introduce. If I had wrote a fancier model it probably wouldn't be as much of a problem but what can you do.


Yes, this wasn't a criticism of the tool. It is crazy good.

But I don't think people should be making the assumption that bolded results are definite alts, which sillysaurus' comment reads like.


Hmm, that wasn’t my intent. I see this tool as a recommendation engine more than a doxxer. By “signal to noise ratio close to infinity,” I meant that if you visit one of the bolded accounts, they’ll probably sound a lot like you.

It’s one of those ideas that makes the tool substantially more effective, yet never would’ve occurred to me. It’s like the simplicity of pg’s “a plan for spam” algorithm: deceptively simple, but (like scrubbing dishes with fingers) works really well.


> I see this tool as a recommendation engine more than a doxxer.

That is absolutely all this will be used for. This is a dangerous tool that serves no real world purpose.


Of my top 20, 19 are bold, all are above 0.6, and I have no alts.


Vast majority of my top 20 were bold, except you funnily enough!

None of them are me (and you were the only one I recognised and thought "yeah, I can see where it gets it from"...)


I have 7 bolded names (0.53-0.62) in the top 20 list, and none are alts of mine.


I'm one of them and I can confirm. But then again that's what I'd say if I was.


Hi style-adjacent friend :-). Just briefly looking at your recent comment history, we seem to find different kinds of articles interesting, but maybe have a similar writing style.


Pretty much the exact same. (I do have a throwaway account but I rarely use it and it probably hasn't been used enough to qualify.)


The funny thing is that I thought of it while eating dinner last night :)


My results have 5 bolded users in my top 20, and I have 0 alt accounts.


Frankly similar to how I was doing in back in 2018 (when you and I chatted about it on HN lol)

https://news.ycombinator.com/item?id=17944293

The approach I took was a bit different, but also no ML required.

The real trick is pruning and going cross platform. There are around 100k active HN accounts (meaning posts a few times a year), maybe 200k if you count at least one post a year. But <10k that post weekly.

It’s a very small space to try to compare so simple methods will work fine.


Exactly. HN emphasizes long-form posts much more than other forums which makes the commenters here very susceptible to this kind of analysis. Plus you can fit every single HN comment in RAM on a mid tier gaming laptop so it's even easier. I was trying to think of applications of this kind of data and the only thing I could think of was moderation tools/detecting ban evaders but what you've done seems much more profitable lol.


It works like a charm for me too.

I put in my username and found my pre-echelon alt, possibilistic.

(Echelon was taken when I registered possibilistic, but it must have been unused and dropped.)


I’d figured it would be some kind of n-gram frequency analysis. Would be interesting to code that up and compare.


It is. The description on the about page is a little simplified but I basically I look at the most common word and character ngrams of size 1,2,3 (200 each), put all the frequencies in an array and then compare to all the other users with https://scikit-learn.org/stable/modules/generated/sklearn.me....


Cool, I only skimmed the description maybe I needed to read it more carefully.

Have you considered doing rune rather than word ngrams? I can imagine that might be prohibitively expensive, but I really don’t know. I did something like that long long ago in C for automatic document language detection. It was quite accurate.


sillysaurus3 was in mine. :) Clearly we're not the same.


> sillysaurus3

> sillysaurus2

Tbf a human could have found a bunch of them relatively easily


The method used, i.e. to calculate the cosine of the two authors' word vectors, is poorly suited for stylometric analysis because it is based on a poster's lexicon and the word frequencies of each word, but ignoring stylistically relevant factors like word order.

Also, the cosine of the vectors of word frequencies conflates author-specific vocabulary and topics; in other words, my account is grouped (with >51% similarity, according to the demo) with someone probably because we wrote about similar things. A strong stylometric matcher ought to be robust against topic shifts (our personal writing style is what stays constant when we move from writing about one topic to writing about another topic, just like our personality is what stays constant about our behavior over time - of course styles do change, but the premise then has to be that such changes happen very slowly).

Stylometrics/authorship identification is interesting and has led to some surprising findings, e.g. in forensic linguistics (Malcolm Coulthard wrote several good books about the topic).

This paper lists some other features that could be used and compares a bunch of techniques: https://research.ijcaonline.org/volume86/number12/pxc3893384...


> based on a poster's lexicon and the word frequencies of each word, but ignoring stylistically relevant factors like word order.

Interesting. I was expecting to be grouped with other Russian speakers and I am (based on some nicknames). But I thought the most telling feature will be exactly word order - it’s absolutely relaxed in Russian. Word frequencies? Well, probably the absence of articles, lol (but I swear to God that I often spend some extra time trying to insert as many articles in my texts as I could).


There’s https://en.wikipedia.org/wiki/Idiolect :

”Language consists of sentence constructs, choice of words, and expression of style. Accordingly, an idiolect is an individual's personal use of these facets. Every person has a unique idiolect influenced by their language, socioeconomic status, and geographical location.”


In practice a more complex approach will tend to require a greater amount of data per user, so in this specific case this simple approach is not too bad. Moreover, fake accounts are likely to talk about the same topics, so while this leads to false positives, also makes it more likely that in the list we find actual duplicates.


Ha, gruseom shows up for pg, which is dang’s old account. A worthy successor.

This is a fascinating way to find similar HN users who aren’t the same person. It’s a surprisingly great recommendation engine. “If you like pg, you might also like…”

Sure, the privacy concerns are valid, but the cat’s out of the boot. Might as well enjoy the benefits.

montrose is almost definitely pg. Someone who talks about ancient history, Occam’s razor, VCs and startups, uses the phrase “YC cos” (relatively uncommon), etc. https://news.ycombinator.com/item?id=17112567

Nicely done. One of the best hacks I’ve seen in a long time.


> motrose is almost definitely pg. Someone who talks about ancient history, Occam’s razor, VCs and startups, uses the phrase “YC cos” (relatively uncommon), etc. https://news.ycombinator.com/item?id=17112567

I had this hunch too. It's either pg or someone trying really hard to be pg.


I mean, this is HN -

> someone trying really hard to be pg

describes half the site.


> Someone who talks about ancient history, Occam’s razor, VCs and startups,

I think these are all common topics among HN readers and commenters.


Why would montrose be pg ? The correlation is not that high. Looks like a few people have picked up pg's mannerisms.


Yeah, that score is only slightly higher than the highest one it shows for my account (which is also bold) - and unless my alter ego has been disguised so well it even managed to hide from myself, I'm pretty sure that isn't me :)


The score for montrose vs pg is lower than the score for someone most similar to me, who is definitely not me.

I think, the similiarity has to be in the high .80's to suspect that it's the same individual.


There are factors that make me think it is more likely than not (just scrolled through the comment history, don't feel like linking everything) that he is pg.

- Is bolded on pg's page

- Mentions yoga

- Talks about Lisp often

- Talks about YC often

- Talks about kids

- Links to Paul Graham's website

- Says he uses vi

- Writes exactly like you would expect pg to write


I agree that this person is trying very very hard to sound like pg ! You could be right actually. Could still be a "wannabe" though.


I'm sophisticately sure they are not. They recommend a founder to ask users directly what they will pay for.

Is that what PG would say?


Of course. Why wouldn’t he? That’s sound advice.


YC startup videos recommend not asking users directly what they will pay for.

Users freq. say they will pay for something but back down against other things.



Wow, what an odd thing to get so worked up about.


> but the cat’s out of the boot

It's my first time hearing that variant. Usually its, "the cat's out of the bag" where I'm from.

Do you mean boot in the UK sense, what Americans would call the trunk of a car? Or do you mean a sturdy piece of footwear?

Obligatory xkcd https://xkcd.com/2390/


It’s a little writing trick I leaned from (I think) Orwell. Any time you’re about to use a common metaphor, try to tweak it. You’ll catch readers off guard, which piques their curiosity.

It’s a fun game, too. I wish I’d used “the cat’s out of the hat,” but I didn’t think of it till later.


What you are describing is also known as an eggcorn.

https://en.wikipedia.org/wiki/Eggcorn


This is my all time favourite one of these:

https://thehabit.co/knowledge-is-power-france-is-bacon/

> When I was young my father said to me: “Knowledge is power, Francis Bacon.” I understood it as “Knowledge is power, France is bacon.”

> For more than a decade I wondered over the meaning of the second part and what was the surreal linkage between the two. If I said the quote to someone, “Knowledge is power, France is Bacon,” they nodded knowingly. Or someone might say, “Knowledge is power” and I’d finish the quote “France is bacon,” and they wouldn’t look at me like I’d said something very odd, but thoughtfully agree. I did ask a teacher what did “Knowledge is power, France is bacon” mean and got a full 10-minute explanation of the “knowledge is power” bit but nothing on “France is bacon.” When I prompted further explanation by saying “France is bacon?” in a questioning tone, I just got a “yes.” At 12 I didn’t have the confidence to press it further. I just accepted it as something I’d never understand.

> It wasn’t until years later I saw it written down that the penny dropped.


You left the funniest thing - the guy/gal's nickname was "Lard_Baron"


Thank you! I was trying to find the original essay I learned it from. I’m now pretty sure it was by Poe, but all I can remember is the main advice: avoid common metaphors.

I vaguely remember one of the metaphors in the essay was about a chicken coop melting, or something like that. It was vivid enough to leave a big impression.


I remember this being from Politics and the English Language (https://www.orwellfoundation.com/the-orwell-foundation/orwel...):

“ Dying metaphors. A newly invented metaphor assists thought by evoking a visual image, while on the other hand a metaphor which is technically ‘dead’ (e. g. iron resolution) has in effect reverted to being an ordinary word and can generally be used without loss of vividness. But in between these two classes there is a huge dump of worn-out metaphors which have lost all evocative power and are merely used because they save people the trouble of inventing phrases for themselves.”


Thank you so much! That’s the one.

(It’s remarkable how often a vague description can yield an HN comment with an answer from a clever sleuth like yourself. Much appreciated.)


That's neeto!

The 2nd example also loosely falls under the classification of malaphor.

https://en.m.wiktionary.org/wiki/malaphor


An eggcorn is a soundalike though, isn't it? Deliberately altering idioms to catch people's attention isn't an eggcorn IMO.


> An eggcorn is a soundalike though, isn't it?

Not necessarily, you might be thinking of malapropisms but yes probably a closer word would be the general term: protologism.

Another commenter added some useful info on the evocative alteration of metaphors [2]

1: https://en.wikipedia.org/wiki/Malapropism

2: https://news.ycombinator.com/item?id=33757097


Yeah, it’s like shooting ducks in a barrel it works so well.

Easy to overuse then people just get annoyed though…kind of like commas, I suppose.


That reminds me of a PETA campaign on social media trying to get people to replace violent idioms with alternatives like "feeding a fed horse" and "there's more than one way to pet a cat."


I like mixing metaphors, in this case "the cat's out of the tube". ("the toothpaste's out of the bag" doesn't work as well though)


I love doing this too, it's fun to write.


There's a popular movie called "Puss in Boots". That's what I had to think of first.


It's a bit older than the movie or movies in general.

https://en.wikipedia.org/wiki/Puss_in_Boots


This is somewhat similar to how they ended up catching the Unabomber. The FBI were literally at a dead end. They ended up posting one of his letters/manifestos in the paper, somebody recognised a turn of phrase the unabomber used that was unusual and reported it as possibly being their brother, FBI investigated the lead and it lead them straight to him.

Excerpts from wiki:

> Before the publication of Industrial Society and Its Future, Kaczynski's brother, David, was encouraged by his wife to follow up on suspicions that Ted was the Unabomber.[91] David was dismissive at first, but he took the likelihood more seriously after reading the manifesto a week after it was published in September 1995. He searched through old family papers and found letters dating to the 1970s that Ted had sent to newspapers to protest the abuses of technology using phrasing similar to that in the manifesto.[92]

> In early 1996, an investigator working with Bisceglie contacted former FBI hostage negotiator and criminal profiler Clinton R. Van Zandt. Bisceglie asked him to compare the manifesto to typewritten copies of handwritten letters David had received from his brother. Van Zandt's initial analysis determined that there was better than a 60 percent chance that the same person had written the manifesto, which had been in public circulation for half a year. Van Zandt's second analytical team determined a higher likelihood. He recommended Bisceglie's client contact the FBI immediately.[96]

> In February 1996, Bisceglie gave a copy of the 1971 essay written by Ted Kaczynski to Molly Flynn at the FBI.[87] She forwarded the essay to the San Francisco-based task force. FBI profiler James R. Fitzgerald[98][99] recognized similarities in the writings using linguistic analysis and determined that the author of the essays and the manifesto was almost certainly the same person. Combined with facts gleaned from the bombings and Kaczynski's life, the analysis provided the basis for an affidavit signed by Terry Turchie, the head of the entire investigation, in support of the application for a search warrant.[87]

https://en.m.wikipedia.org/wiki/Ted_Kaczynski


As I recall, one of the clinchers was his use of the phrase, "you can’t eat your cake and have it too" as opposed to the now-predominant variant "you can’t have your cake and eat it too."

I often wonder if stylometry can be used to positively identify a person based not on general word frequency, but by a single phrase or two which are rare in general but commonly used by the individual. In theory this could be relatively easy to find given a large corpus. You'd pick out the top few n-grams for short phrases by an individual and identify the ones which are most overly-represented compared to the rest of the population.


It was actually his brother.


So is the lesson you should have GPT rewrite your manifesto so as to obscure your personal idioms?


Or something purpose-built like Anonymouth (https://github.com/psal/anonymouth), although it seems to be both unique and dead.

Also interesting:

> Ross Ulbricht aka Dread Pirate Roberts, the mastermind behind the infamous Silk Road site which served as a black market for drugs, weapons and fake documents was also well aware of the potential danger of stylometry being used against him. At the time of his arrest in a San Francisco public library, the FBI captured images of his laptop screen as evidence. Guess what what he had bookmarked — “Science of Stylometry.”

https://medium.com/svilenk/the-case-for-anonymity-12db114f0c...


I mean he used an forum account with an email that had his name in it.


That's the problem - it only takes a single slip and it is recorded forever. Perfect opsec is an impossibly high bar if you are maintaining an active online presence.


Only if you have a history of sending crazed writings/manifestos to newspapers and family.


The show “Manhunt: Unabomber” (Netflix) shows this whole story very well.


This is a super interesting tool for self reflection. Looking at the top 10 similar accounts to mine, it gives me an arms-length view of how other people probably interpret my tone.

I appear to be a well-educated, over-confident know-it-all.


My #3 match is cstross, and now I’m convinced that my life-long secret dream of being a successful sci-fi novelist is basically a matter of typing. (Ideas? Character development? Ruthless editing? Developing an audience? Having a publisher? What do I need of those when the Computer told me I’m practically a genius…)


I'd suggest giving the back story to Agent to the Stars by John Scalzi a glance.

http://www.scalzi.com/agent/

> In the summer of 1997, I was 28 years old, and I decided that after years of thinking about writing a novel, I was simply going to go ahead and write one. There were two motivations for doing so. First, I was simply curious if I could; I'd had up to that time a reasonably successful life as a writer, but I'd never written anything longer than ten pages in my life outside of a classroom setting. Two, my ten-year high school reunion was coming up, and I wanted to be able to say I'd finished a novel just in case anyone asked (they didn't, the bastards).

> In sitting down to write the novel, I decided to make it easy on myself. I decided first that I wasn't going to try to write something near and dear to my heart, just a fun story. That way, if I screwed it up (which was a real possibility), it wasn't like I was screwing up the One Story That Mattered To Me. I decided also that the goal of writing the novel was the actual writing of it -- not the selling of it, which is usually the goal of a novelist. I didn't want to worry about whether it was good enough to sell; I just wanted to have the experience of writing a story over the length of a novel, and see what I thought about it. Not every writer is a novelist; I wanted to see if I was.


Same. Looking through some of the handles on my list tells me that I come across like a not-particularly-well-educated McSmug that needs to take a good long look at myself. Wouldn’t be so bad if I wasn’t reading the posts thinking I definitely could see myself writing this.

This was certainly eye-opening.

Update: It’s actually a little strange that reading through some of the matches it’s not just style that overlaps but perspectives in quite a few cases too. I’m definitely not the unique little snowflake that some others are finding themselves to be.


I also enjoyed reading one of my style-partner’s posts.

The most noticeable similarity is that we both clearly have strong opinions about some things, and like to share information, but also like to be clear about our unknowns or opinions. So, lots of “sounds likes,” “probably,” “could be” and so on.

The downside is, I guess, this could be seen as a bit weasel-word-y or indirect.


> like to be clear about our unknowns or opinions. So, lots of “sounds likes,” “probably,” “could be” and so on.

Commonly called just “hedging” like hedging your bets.


That’s a kinder description than I gave it in my next paragraph, so thanks I suppose.

I do think it is an under-emphasized aspect of honesty, though, that we should be clear about our level of experience/understanding. Especially online — people like to discuss things, even (especially?) when we are just getting started. So if we’ve picked up opinions through osmosis and we start repeating them without testing them, we’re really just amplifying some possibly-incorrect viewpoint (and if we’ve picked it up, there’s a good chance it is already widespread in the community, which is bad if it is wrong).

And I mean, more concretely a measurement is not complete without the error bars!

Often this doesn’t really matter, because it is just chit-chat anyway. But it is nice to keep in mind.


> we should be clear about our level of experience/understanding

there are many languages that encode this info as mandatory grammatical affixes, it's called evidentiality.


I hadn’t heard of that. Neat!

I find it interesting that the first example they use in the Wikipedia article is Turkish. I’ve only met a couple Turks, but they were all quite good engineers. I wonder to what extent embedding this kind of information in the language helps organize your thoughts.


> I appear to be a well-educated, over-confident know-it-all.

Don't we all?


I hate us insufferable nerds. !


> over-confident know-it-all.

I’m pretty sure participation in HN is a 99% sure filter for being called this many times in one’s life.


That's what we all come to HN for...


we must be a good match


I'd love a version of this where you enter two usernames and get a match score.


After a few tries on boring accounts, I thought to try the account of somebody who was notorious for an incident outside of HN, and had a (deservedly) bad time at HN for a couple of years before the account went dark.

And yeah, there's a bunch of high confidence (.6-.8) hits for that account, and from a quick browse of the comments of the recently active ones, they look really likely to be alts. Like, all three that I looked at had comments that made it very clear it was this person writing pseudonymously. (E.g. writing on their signature issue, and saying they couldn't go into more detail due to fear of self-doxxing; or somebody literally saying that the alt's claims reminded them of the public writings of the notorious guy years ago).

Obviously I'm not naming the account, but this functionality turned out way creepier than I thought the moment I tried it on the account of somebody who has a reason to disassociate from an existing public persona, but still wants to participate here.


I keep no alternate accounts, but this tool reports best matches for me that appear to be Slavic or just Russian - and I am Russian. Best match score in my list is just above 0.5. There are some clearly alternate accounts on the list, their match scores with this tool are well above 0.7.

It is probable that persons of same cultural origin will have similar writing style and vocabulary. It is also probable that persons of same cultural origin would have same relationships with the world as a whole, they would like same things and dislike other same things.

So, in my opinion, it is possible that you have found not only alternate accounts (score above 0.7), but accounts of people with same cultural origin (ones that are around 0.6).


My highest was 0.41 and the person writes nothing like me. I guess I'm a unique snowflake after all.


I was curious about this, my highest match was 0.47 and I have no alts, maybe I'm also a unique snowflake, or haven't said anything noteworthy enough to have been deepfaked yet ;).


my second highest hit (ie, third in the list) is gwern at 0.45 who i'm fairly sure is not me.


I was actually just looking at near hits for gwern and found what's almost definitely a defunct alt for him.


Well is certainly NOT me, that's for sure.

On an unrelated topic, I'm starting a service to write comments in the style of others to provide plausible deniability for other alt accounts. Rates negotiable.


I have a few in the low 0.5's and, honestly, they seem cool and I want to meet them.


I don't have any alternate accounts here either and my writing style is apparently nearly the same as a high profile account that I recognize and has many points. I wouldn't say this is a highly accurate thing.


There're 19 other accounts this tool finds similar to me. Those are not my accounts. 0.46 - 0.56 are numbers.


I think people are sort of confused at what this tool is supposed to be which I will concede is partially my fault. The results of this tool are by themselves not indicative of having an alternative account. It generates the 20 most similar users for every single user on the site, regardless of whether they have an alt or not (there's obviously no way for me to know that for every single user). In your case further investigation would reveal that none of those accounts are yours.


It is a fun tool, I can assure you. It is just people have found use case you haven't foreseen yourself.

I think your tool should have internal embeddings for each of the user. Also, most probably your tool uses cosine similarity for a search.

Thus, I would like to suggest a feature: recognize simple arithmetic operations over user's embeddings, such as "thesz - 2 * patio11". It will make things even more fun, this way we can find users who are like me and much not like patio11. Even simple additions and subtractions would suffice.

(an idea is taken from properties of word2vec embeddings)

Your tool is thought provoking. What I discovered with it made me think about my use of language and what other languages (body, imagery, etc) I use differently because of who I am. Which made me think about my favorite underrated superhero Cypher [1] - would his innate ability to understand languages make him best detective ever?

[1] https://en.wikipedia.org/wiki/Cypher_(Marvel_Comics)

Thank you!


Really cool idea. I'd need to upgrade the VPS though so all the vectors would fit in memory but it probably wouldn't be too hard (right now I'm just storing a map of username string -> array of 20 username strings because my VPS only has 512mb RAM). I'll think about if I can do this in a way that is more resource conservative.


Fwiw, and as gp mentioned, > 0.7 seems more likely to be alt territory.


You are fools, one and all! This tool's only purpose, is to tag people who use it!

Now they know just who cares about which alternate accounts. They know!

They freaking know, man!

You have all fallen for their ploy. Fools!


I have no alternate accounts and visited the site out of curiosity, because I used to worked in the domain like this.

What I found was worth visiting the site. Somehow notably many accounts with (relatively) high similarity to mine's are sharing at least one of my personal traits.

Which is fascinating, to me.

And I think is worth to be noticed by others - what and how you write can disclose who you are.


It knows my IP now.

(Or does it?)


It offers no privacy policy, so can't tell.


.6 is high confidence? I did my own username, wondering what it would return, since I know I don’t have any alt accounts. The top results are in the .6-.7 range. If they aren’t alt accounts, is it just coincidence that we have similar writing styles?


I think so.

A funny thought — my “matches” cap out at around .56. Having false positives* in a tool like this might feel like a “bad result” but actually I think it just means that if someone were running this sort of tool across the whole internet, I’d be relatively easy to correlate, while your identity would be intermingled with your .6-.7 partners.

*actually they aren’t really even false positives because the tool doesn’t promise to detect alts in the first place, just find similar styles.


> but this functionality turned out way creepier than I thought the moment I tried it

Hopefully this raised awareness means that people who actually need anonymity will be more likely to know to take precautions.


Genuinely asking, what way is there to combat this? Is there a tool that takes out stylistic elements of your comment?


The site mentions a service called Quillbot which apparently does just that. https://stylometry.net/avoid


This is the million dollar question. I think the goal of "anonymity for most intents and purposes" is worthy, it's been how I've enjoyed HN and Reddit, but I also know that it was just a matter of time before stylometry and other meta-analysis of post history become 10 second tools for everyone. Now the cat is out of the box.

I've been thinking about this a bit, and I've landed in that having a stable identifier across ALL comments & posts is a poor default. We still probably want some coherence, at minimum within a thread, eg to follow a back-and-forth. The site itself may also use stable identifier for abuse prevention. But there's no reason one should have the same username externally traceable for posts about completely different topics.

In practice, this could be done with low friction pseudonym creation, which all ties to the same account privately.


One way would be to run such tool before posting and then based on the results, tweak the post and repeat until the similarities are not statistically significant. Or instead of tweaking, start posting under a new throwaway account. But this won't save you when some new way to analyze style appears in the future. Moreover there are other types of meta data which can be taken into account to narrow down the search space a bit such as timestamps. And obviously more you write, harder it is to control these things.


I wonder if gpt3 has a use case here?


[flagged]

0.6 isn't much. I have 3 matches above 0.6, and they're not me. 20 or so over 0.5.


I get one 0.68 match, which... fair enough. It is an account I've abandoned some years ago, no secrets there.

No other hits above 0.5, so I guess that either makes me pretty unique as a commentator or my English is broken in a unique way.


That's why you manually evaluate the matches. And like I wrote in that comment, I did that manual eval, and these clearly are alts of that main account, not spurious. Narrowing down the pool of accounts you'd need to do this kind of manual evals for by a factor of 100000 is a pretty significant change in capabilities.


Could you elaborate on why it's obvious why you won't name the account?


Maybe to avoid attracting any extra attention to this user? Also, as someone who’s read HN for a few years, it only took me 2 guesses to find an account that the above comment describes (and not necessarily the same person).


It was a classy move by jsnell, too. Thank you.

(I don’t know who the comment is talking about, which is how it should be. There’s no need to blow someone’s cover in a highly visible way. Even if they were satan, they’d still be welcome on HN as long as they’re writing substantive, interesting comments that follow the guidelines.)


Such quality comments would track with most thorough Satan representations.


They obviously don't want it to be known, seeing as they've got alts to post under and avoid going into too much detail. Being able to go out and do your own research is different than posting the information open for everyone to see at a glance.

I would say it's obvious why one might respect that wish (do unto others...), but I'm also aware that my and my culture's sense of privacy goes further than many others'.


MD5 of the username is 9abc27e93b7e3c04b7c599017c1cfe5f ? The top one seems an odd one out in that case?


Usernames aren't random enough to be safe as a simple MD5. Perhaps with a strong bcrypt, but similar to PIN codes, it might be better to give partial information like "is the second character an ...", assuming nobody else made similar statements. Or give the first ~two hex characters of the hash, so that it would match 1/(16²)rd of the usernames. I'm sure there's also a clever way for a zero-knowledge proof here, probably something with diffie-hellman using the name as your random integer or something, but I'm too sick to think about this stuff right now. Privately sharing data publicly is hard.


Another problem is that it's a small set. If you had a list of all HN users, you could compute md5 for all of them in seconds.


I think the intention of the post not mentioning the handle was just to prevent old discussions from flaring up or so? The post doesn't really contain any new information on the person that would be worth obscuring. So I just thought I'd hash it to prevent that. But it seems I actually screwed up the hashing so I will leave it at that.


Good point - I've been running john on that md5 for a couple minutes :)


Why use John? Just run down the list of Hacker News usernames; it'll take less time. (Or, better still, don't; just because the privacy's theoretically compromised doesn't mean we have to exploit that.)


I don't think there's a public list of all HN usernames is there?

Found this, it includes 250k usernames, but it's not there. https://www.kaggle.com/datasets/hacker-news/hacker-news-corp...


The username in question isn't in this dataset but maybe it was created in the past 10 days, as the max(timestamp) is Nov 16th, 2022.

https://console.cloud.google.com/marketplace/details/y-combi...


It isn't there, and given the "story" it happened years ago so it should be there, so I guess we've been played.


Unintentionally played I might add... But I will leave it at that.


> quick browse of the comments of the recently active ones, they look really likely to be alts.

Hmm isn't a spot check of comments somewhat tautological, since that is how the tool identifies alts (rather than something like IP address or time of day)? If this had been promoted as "find accounts with similar writing style to yours" would people immediately assume alts?


I would presume that OP is referring to the actual content of the comments. This just does stylometric analysis, which looks at word choice, but not what the arrangement of the words mean.

If some accounts are found to be stylometrically similar, and then a visual inspection also shows them all stating similar opinions, that latter piece of data is a strong signal.


It would be nice to make the names clickable.

I don't think the list of pg alternate account is accurate. I checked a few. They have many oneliners that is typical of pg, but the topics and style don't look similar.

I searched a few more and got better results. :)

I searched myself (that I know that I have no alternate accounts). I recognize a few users that are interested in similar topics, and I discuss/upvote them many times. But I didn't recognize most of the user of the list.


> I searched myself (that I know that I have no alternate accounts). I recognize a few users that are interested in similar topics, and I discuss/upvote them many times. But I didn't recognize most of the user of the list.

It's based purely off frequency of the 200 most common English 1 word phrases, 2 word phrases, 3 word phrases, 1 character sequences, 2 character sequences, and 3 character sequences. Topic does not really have anything to do with it. If I had more time I probably would've done a smarter model that accounted for things like that.


One is also a mathematician. It's trivial that we overuse some technical words even if it's unnecessary.

Another is form Argentina, so I guess the native language leaks, for example using words derived from latin that are not idiomatic.

And there are a few more, that is a honor to be "confused" with, but I have no clue why.


Cool stuff, thank you for sharing your findings!

I don't do throwaway. I either post or STFU. I also STFU on darknet. Its why I found it fun to read/lurk on things like I2P back when it was new. And I know that on a pseudonymous account it is only a matter of time until it can be linked to another pseudonymous account. It would not surprise me if stylometry was used on Dread Pirate Roberts or the people behind The Pirate Bay or the people behind Wikileaks (Assange's sockpuppet accounts). Such can also have been used to verify afterwards instead of beforehand. Though with TPB since it was on clearweb an advanced adversary could have used correlation/timing attack to figure who wrote what.

I'm having fun times recognizing other Dutch people though their usage of English language. For example, a distinctive word I see Dutch people use a lot is 'oke' instead of 'OK' or 'okay'. Its a red flag the person is native Dutch. I wonder if there are stylometry tools available for figuring if someone used physical vs touchscreen keyboard (I used Glider to write this post, spellchecker unavailable).

And yes, organizations like secret service and police should use such tools as well. It is a known tool, why not use it for good? As with any tool, it can be used for good and evil. On HN this could be useful for the mod team (AFAIK nowadays only dang) to find banned people's sockpuppets. Cross-community could also be a fun project: find a HN user's Twitter or Reddit account. And I hope this method is also used to find Russian trolls on social media.


Most people greatly underestimate the power of linkage attacks on anonymity. And it doesn't even take fancy ML. In the context of healthcare records, I like to trot out this 25 year old example of an MIT grad student and the then-governor of MA.

https://ischoolonline.berkeley.edu/blog/anonymous-data/


The top hit on my list looked familiar. I looked at their recent comments and saw a discussion between that user and me. We were quoting eachother directly throughout.

I wonder if this explains our similarity. And if so, could we tweak the algo by e.g. Removing text that is prepended with ”>”


The scary thing is that once you have this data, finding HN matches for individual targeted users on other sites becomes trivial, even if those sites are harder to scrape. I bet most people here have an anonymous Reddit account, for example. If you wanted to know who was behind a particular Reddit account, you could feed it into something like this and compare the results with HN, where accounts are less likely to be anonymous. Or build a database based on blogs, Github comments, etc.

Also, since this uses only word frequency, there are probably relatively easy improvements to make that would make it even more powerful, like looking at particular runs of words that are unique. Some expressions or figurative language only show up in combinations of words, and tend to be highly style specific.


I could have used a part of speech tagger, looked at time of day a user posts, capitalization, spelling errors, etc. From what I understand the state of the art is lightyears ahead of this, there are even companies with actual linguists who will act as expert witnesses in court to say stuff like "we can say with 95% certainty that xyz authored this email." Honestly it's kind of scary. There are papers that talk about cross platform authorship attribution, one I think did it with Twitter, Blogspot, G+ and had pretty good results.


Thus proving the only actually anonymous community in practice is 4chan, and that’s why it’s so toxic.


If you define “toxic” as “people disagreeing with you”, sure. That was what the entire internet was like until maybe 2005.


I'm old enough to remember when 4chan was self identifying as the Internet's hate machine, before xkcd referenced it as such: https://xkcd.com/591/

Sometimes people insist that's all role-play and irony; others insist that if it ever was, it certainly isn't now.

But regardless, I remember pre-2005, and it wasn't all like what I saw the two times I looked at 4chan. Bits were. Bits were much worse. But mostly, mostly, people were kinder… at least, unless political tribalism came up.


“People disagreeing with you” describes almost none of the conversation on 4chan


Forget the alternate accounts — if two users are close in style, there’s a decent chance they should be friends. This is an HN friendship machine.


It would be convenient if the usernames linked to the comment pages on Hacker News (to avoid having to copy/paste and URL hack, which is made even slightly more annoying because for some reason when I tap and hold the usernames to copy them your markup--I haven't looked at why yet--is causing an extra space character to get copied on the left).


This is interesting.

I'm 0.566 correlated with logfromblammo -- and while we are definitely not the same person, I could easily imagine writing a sentence such as:

"For some bizarre reason, management has not yet assigned a task to their programmer underlings to automated themselves out of existence. I can't imagine why."

which is theirs, not mine, from about a year ago. I like that.

On the other hand, I'm nearly as correlated with peterwwillis: 0.5485 -- who has no comments and no submissions.


> On the other hand, I'm nearly as correlated with peterwwillis: 0.5485 -- who has no comments and no submissions.

This is due to the Firebase API not updating when users ask the admins to move their comments to another account.


Yeah, I got a good match with my previous nick here. Which to me proves the tool works well.


I had a similar experience finding my most likely alt (.50 suggesting I am a unique snowflake as I have always thought :-), my most likely alt is writing certainly in a style I appreciate and on subjects I often mention.


How about this for countermeasure:

As you're typing out a comment the software gives you a list of accounts you're becoming similar to. That way you can adjust your writing as you type.


Someone linked it in the thread: https://github.com/psal/anonymouth


Forget countermeasures, go covert. Write a comment, have the comment be rewritten before submission in order to resemble a targeted account.


Sounds great, except there are many different similarity measures. Which one does the algorithm use?


Why not all of them? Which metrics are closer would tell you which aspects of your writing you need to focus on.


This found an alt that I created specifically to see if I could write artificially to defeat this kind of analysis. I have seen other tools like it posted to HN, but none before had found that account. I guess I need to up my game.


If you don't mind sharing, are you "writing artificially" purely in your head, or are you using techniques like intermediate translations?


No mechanical means, but I have referred to a thesaurus occasionally. Mostly I tried to change my sentence structure, not just words. It requires actually thinking differently, in a way. Which makes it difficult to know how well I'm communicating.


I imagine this would be quite difficult in practise, due to all the subliminal factors behind a person's writing choices.

For example, as somewhat illustrated here, your personal vocabulary is a kind of fingerprint. As you mention, using a thesaurus can somewhat alleviate that, but if a thesaurus is only changing a small % of your words, then it will only have a suitably small % effect upon analysis.

To go yet further might (I suspect!) entail methods such as directly lifting and using other people's sentences to convey your own thoughts. But even then, "your own thought patterns" are still informing the manner of the post, to some extent, so over time increasingly robust analysis may still find patterns to hook into.


I wonder if someone will come up with a Grammarly-like tool which you can feed with sample writings to help you increase/lower the similarity score of a new text you are writing.



That post was actually what motivated me to make this. I'm on your email list :)


WOW! It's such a pleasure for me


Ahhh, anyone remembers this hacking crew who leaked BLUEETERNAL and other NSA tools and exploits? Shadowbrokers.

They were always communicating in some kind of meme-russian, and their texts were funny to read. [1]

I believe their writing mostly defeated this kind of analysis, at the cost of looking like idiots (which was probably the reason no one sent them crypto-dollars to buy that stuff exclusively).

Here's an excerpt:

"Attention government sponsors of cyber warfare and those who profit from it !!!!

How much you pay for enemies cyber weapons? Not malware you find in networks. Both sides, RAT + LP, full state sponsor tool set? We find cyber weapons made by creators of stuxnet, duqu, flame. Kaspersky calls Equation Group. We follow Equation Group traffic. We find Equation Group source range. We hack Equation Group. We find many many Equation Group cyber weapons. You see pictures. We give you some Equation Group files free, you see. This is good proof no? You enjoy!!! You break many things. You find many intrusions. You write many words. But not all, we are auction the best files."

[1] https://archive.ph/20160815133924/http://pastebin.com/NDTU5k...


*EternalBlue


Have you tried including parts of speech (for example, as bigrams and trigrams) as part of the features considered in your model? I’ve had great success with stylometry that goes beyond TF-IDF with bags of words; including grammar patterns was shockingly good.

(FWIW, it didn’t find my throwaways; my own model didn’t, either, because I knew that word choice wasn’t enough to avoid being outed by stylometry)

Edit: by bigrams and trigrams, I mean reducing word to their parts of speech labels and using THOSE as word tokens. You’ll find that native English speakers have higher weights on some phrase construction patterns than, say, folks from Romania. TF-IDF is useful for these POS-grams (just made that word up) as well.


> Edit: by bigrams and trigrams, I mean reducing word to their parts of speech labels and using THOSE as word tokens. You’ll find that native English speakers have higher weights on some phrase construction patterns than, say, folks from Romania. TF-IDF is useful for these POS-grams (just made that word up) as well.

That is a very good idea and when I update the site that will almost certainly be included :) Any other tips? Been reading papers for ideas and I think I may have to ditch the cosine similarity and go for something fancier soon. Thank you


How long until this becomes the algorithm for a dating site?

“Find hot single women who write just like you”


This seems like a great way to hire freelance copywriters/ghost writers too. I would absolutely hire someone I knew could match my tone well for writing generic unattributed copy.


Wouldn't be surprised if dating sites already used similar algorithms.


Do dating sites really use clever algorithms to match up people together? I was under the impression that, the less likely you are to meet your perfect match, the more you're going to use the app.

In my experience I don't see a relevant list of potential matches aside from gender and age preference, it's all completely random, even frequently I see people outside the settings I've specified (i.e. men or older women).


Wouldn't be surprised if most of the women on a specific dating site had very high similarity scores.


This is one reason why I like legal doctrines such as "beyond a reasonable doubt." Even a 0.9 match in a tool like this could be a coincidence, if there are millions of users. But that won't stop people from casually believing "aha it must be an alt account", based on some anecdata.

It's so easy for something like this to be turned into a tool for a witch hunt, targeting innocents.


But a 0.8 or 0.9 match and something like Tor usage could be enough to justify a warrant. That's why I'm not sure I want to open source the code because I don't want to normalize this.


Keep in mind the potential to create false accusations by fabricating similar looking accounts.


Hmmm, doesn't seem to work. But you have convinced me (and many others?) to search our alts consecutively and so now do know who has alts?


I wonder what's a reasonable threshold for "probably the same person". I've never had an alt on HN, and when I searched myself, it found 3 other users above 0.6, none of whom I've ever heard of before.


If it's >0.9 is you can almost guarantee it's an alt but I've seen certain matches at 0.6. The problem is writing styles change over time. Another idea I had was converting the scores which are just cosine similarity scores into percentiles (so 0.99 would be 99th percentile of certainty) to make them more human interpretable.


I make new accounts every so often and the accounts of mine that it found have a score of around 0.3. I'm not actively trying to defeat stylometry but it's possible I just have a particularly unremarkable writing style.


Well I must be stereotypical myself because it found me at 0.8 !


The people at 0.4-0.6 with me do share some interests. That's cool on its own.


>The problem is writing styles change over time.

Will be interesting if we could plot the writing style divergence over time.


I got matched with my old account with a score of only 0.45


I have no alts. The highest match for me is about 0.66.


Interesting. The highest non-me account is under 0.4 on my page. I do not believe that I have such a unique writing style - especially since half my posting is on mobile and therefore possibly slightly different than my desktop posts.


My closest is 0.4879. I know I tend to be wordy but I thought I had a pretty generic style as well. This is definitely a fascinating demonstration.


Feeling better about my high of 0.49 now


0.6 is not high enough to indicate an alt


Oh wow, it's really sure that I'm stavrosk, which I am:

https://stylometry.net/user?username=stavros

The next person is 30% less certain, that's huge! This would basically identify any alt I might have with near certainty.


Funny thing is, it thinks I'm you, but it doesn't think you're me!

https://stylometry.net/user?username=rogual

I'd have thought this stylometry thing would be commutative.


I guess it's a multidimensional space, so you can have someone closer to you than me, but they aren't also closer to me than you. Basically, they're close to you, but on the "other side" of me, I guess?


Don't need multiple dimensions for that.

0.1, 0.2, 0.3, 1.0, 2.0

To 2.0, 1.0 is closest.

To 1.0, 0.3, 0.2 and 0.1 are closer.


Thanks, seems obvious when you put it like that.


The word you are looking for is "symmetric".


stavrosk doesn't have any posts/comments? What's it using to match?


It's my old username.


Huh... seems there are some inconsistencies between what's presented on news.ycombinator.com and the Firebase API. Glad it matches for you though :)


I guess they just didn't go back and reparse, not a big problem. I don't think people change their username frequently :P


This is an evil website. We won’t have any anonymity soon. The highest match is my years old banned account that I forgot about. Where did you get the data from?


HN has an Algolia-based API. It’s also very easy to crawl.

I wouldn’t call this evil, however: it’s merely demonstrating a technique that you should be aware of, if you’re a privacy-conscious person. It looks like they also provide some resources for avoiding stylometric detection.


I would bet my bottom dollar that the likes of Reddit and Google already have models to turn a corpus of text into probable demographic data and models to measure the similarity of users.


Please don't shoot at the messenger. costco shared this voluntarily and I can see no bad intention.

We should see it as an opportunity to learn how easy it is to associate different pseudonymous accounts. Nothing drives this point home better than a practical demo.

We can be pretty sure stylometry is used widely by bad actors already and we should not punish people who help to spread the word about these technical possibilities.


And this is actually quite a simple approach--which is interesting in and of itself. While there would be diminishing returns, there are a ton of other techniques you could use to make stronger inferences about similarity.


> This is an evil website. We won’t have any anonymity soon. The highest match is my years old banned account that I forgot about. Where did you get the data from?

I'd way rather have someone tell me "look at all the things I can find out about you" so that I can act accordingly (whatever that means!) rather than what we've mostly actually got, which is companies silently exploiting my data and doing everything they can to mumble reassuring but legally ineffective formulas assuring me that they deeply respect my privacy.



Why didn't you use the google bigquery?

https://news.ycombinator.com/item?id=10440502


I was aware there was a HN dataset on BigQuery but I had never used a library to work with it before and when I played around on the website the posts I got were all from 2015 at the latest. It probably would've made my work easier but there's not really anything I can do about it now.


I don't know that I'd call this evil. We have no idea who else is using this kind of technology but not making the results public. Better to know what's possible and take measures to make it less effective.


It’s just statistics. I recall that during his whistleblowing, Snowden intentionally took anti-stylometry measures.


Imagine using this across different platforms :/, and let alone using different techniques in addition...

edit: maybe you'd catch some criminals if you tried to match reddit against dark web for example


Interesting that the Op doesn't come up in the search: https://stylometry.net/user?username=costco


Their first comment and submission were 4 hours ago. Text on the page is accurate it seems.


Not surprising considering the account had no activity before today.


My nearest match is only at 0.406. It'd be interesting to see who the most unique commenters are, but it's also quite possible it wouldn't be flattering.


0.35 is my nearest. In hopes of lowering it even further, here are some nonsensical opinions never expressed on HN before: 1) Programming peaked with COBOL 2) Paul Graham is responsible for 90% of SIDS cases 3) There's no reason to use car when cdr exists.


0.2506 is my nearest match


That's the lowest I've seen yet. You must write uniquely :)


I have no alternative accounts besides making a single throwaway account to post one "Ask HN" five years ago, but I have a decent number of matches above 0.5. I think this is due to the relatively uniform style of "who is hiring posts," since my matches did that in a similar way for other companies. I made many of those for about two years when I was at a start-up.


On the how to avoid section: Isn't running comments through a randomised translator a few times then back considered a countermeasure also?

Also think it's probably poor form to list users as examples without their permission.


> On the how to avoid section: Isn't running comments through a randomised translator a few times then back considered a countermeasure also?

Yes.

> This may be out of line but isn't pg on here with a different username, Levenschtein distance of one that's not included? Or is that just a very motivated 13yo account who writes a lot of admin-esque comments.

What other pg account are you referring to? I want to see it so I can see what my algorithm missed.

> Also think it's probably poor form to list users as examples without their permission.

You're right. I'll remove that - I just wanted some examples especially for people on phones who don't feel like typing. Thanks for the feedback.


> However, using automated methods like machine translation services do not appear to be a viable method of circumvention.

https://www.whonix.org/wiki/Stylometry


It found my old account (ara4n; i lost the password) at 0.63. More amusingly it found my cofounder too, who hardly ever posts here (at 0.48)


> ... This site works primarily by analyzing for each user the frequencies of the most common words and phrases in the English language. Accordingly, the easiest way to avoid being identified is to simply use different words than you ordinarily would when writing. More sophisticated models than the one I made can use punctuation, comma usage, and capitalization to identify you so try alternating those as well. Services like Quillbot can help with you this but depending on your circmstances you may not want to send your writings to a third party service.

HN offers many other threads which could be tied together, including:

- time of posting

- ratio of replies to top-level comments

- comments being mainly upvoted or downvoted

- sentiment (mostly angry, dismissive, questioning, etc.)

- most common topics (keyword analysis of post being replied to)

- ratio of new posting to post replies

- first-to-comment on a post

- lone comment on a post

- etc...

It seems very likely that sooner or later every pseudonym for posting content will get discovered and linked. The lesson here is don't post anything that would cause you undue shame or harm if linked directly to your legal name.


Well now I'm self conscious about my closest match being an 0.34 when so many other people are reporting much closer matches with accounts that aren't alts. Do I write weirdly?


Same for me, the closest match is 0.36. But I expected that because I don't speak english very well so the pool of candidates is small.


.31 here! I'm a non-native speaker tho, so it wouldn't surprise me if I had weird speaking habits


My closest is 0.40, so I’m right there with you.

Native English speaker as well.


0.36 here! Out of curiosity, are you a native speaker?


I am, yes.


0.39 for myself, I’m a non-native speaker.


What does the bold signify? For example when I search for dang (https://stylometry.net/user?username=dang) the 4th most likely user is not bold whereas the 16th is?


Say you see user2 listed in bold on user1's page. That means that user1 is also in user2's top 20 users. In my experience it is often an indicator of a good match (but not always).


Huh, that's a somewhat non intuitive property.


It is a bit, but if stylometric equality was a thing you'd expect it to be symmetric, so if stylometric simmilarity is a thing....


And this is why I’m a reader and not a poster on HN :)

The second that I found out that requesting deletion of an account and its posts needed a MANUAL request to a single user (dang) I noped out so fast

But happy that the rest of you are still happy to contribute :)


I really liked the informative and straight-to-the-point about page - describing how the algorithm works in a way that is easy to understand. All the important details are summarised there. Well done!

Edit: From the "How to avoid .." page, there is the following sentence:

> Also, most authorship identification algorithms have poor accuracy when working with small amounts of words. This means the optimal strategy would be discarding an account either after every comment or after a small number of comments. Unfortunately, this is against HN rules and may result in a ban.

Can you clarify what this means and why it would result in a ban?


> Can you clarify what this means

Imagine that for every new comment you want to post you would create a brand new account which you would use precisely once and never again. Then the stylometry would have just a few words and wouldn’t have enough corpus to get a reliable signature. If a lot of people does this it would be hard to figure out which account belongs with which human. ( Of course if you alone do this, your messages will stick out like a sore thumb. See xkcd 1105 )

> why it would result in a ban?

Because this practice is especially discouraged in the guidelines: “please don't create accounts routinely. HN is a community—users should have an identity that others can relate to.”


At the same time, HN doesn't let you delete comments.

Maybe with some GDPR magic.


Not sure what is your point, or how does that connect with my comment. Care to elaborate?


Your comment quotes an HN guideline, and my point relates to it. Some users may feel the need to create throwaway accounts in order to post comments that in an alternative reality they could post under their primary account and later delete if desired. It may not stop a scrupulous collector of data, but such a scenario may not be the object of their worry.

Drawing this into the logical conclusion, a user may opt to always post under a throwaway account, to avoid any possible tainting associated with a primary account.


> Can you clarify what this means and why it would result in a ban?

I have seen dang respond to users multiple times asking them to stop making new accounts especially but not always if it's to avoid rate limiting. I don't know if there's an official policy but it's definitely something I recall.


Just a heads up that for everyone who doesn't like to link their alt accounts, maybe not use this tool to see if it works.

Unless the author would run this against all HN user accounts, no need to flag the ones "of interest".


Have you done any data analysis on distributions of similarity? How similar you'd expect any 2 people to be given English focused around tech? Or any other interesting stats you'd like to share?

Very nice clean site, great work.


What match level would you expect to see between two randomly chosen individuals?


It's accurate enough that I had to create a new account now :)

I guess it's difficult to evade it as the word frequency certainly catches all about the countries I frequently refer, programming languages, interests etc.


Similar to how they make adversarial fashion[0][1] in order to not be tracked by face id AI, I wonder if we can make adversarial stylometry tools to run your comments through in order to anonymize it

.. [0] https://hackaday.com/2022/10/20/render-yourself-invisible-to...

.. [1] https://adversarialfashion.com/


OP links to a paraphrasing tool on their website.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: