Last Friday, one of the top articles on hacker news was called Breaking the Silk Road’s Captcha
This sounded pretty cool to me, though not necessarily applicable because the current Silk Road 2.0 (I’ll just be calling it SR from now on) isn’t using anything nearly as sophisticated.
I thought it would be really interesting to scrape SR for, let’s say a month or two. I could do cool stuff like make a stock ticker and display the values like
The following information is for educational purposes only, I have no affiliation with the Silk Road 2.0, nor have I ever purchased anything off the site. As far as I know, visiting the site and writing about it with no intention to buy (commit a crime) is perfectly legal.
Some implementation quirks
Before we begin: I only wanted to spend an hour or two doing this. I was late for a dinner and wanted it to run overnight while I was sleeping. If you are looking to build a robust system, you should consider a different solution.
Simply download the captcha, run it through some opencv transforms, then feed it to tesseract. If it doesn’t work, just keep on trying until we can get a relatively easy one. I think my sucess rate was >90% with some very tranforms using opencv.
Connecting through tor
The SR site is an anonymous hidden service reachable only through the tor network. You run the tor client daemon on your machine, then use it as a SOCKS5 proxy.
This has some complications, because dns requests also have to go through tor.
The quick and dirty solution is to just spawn the scraper through torsocks which wraps all the net requests from my scraper.
The SR site seems to be very eager to automatically log out users. When logged out, I simply create a new user. When I am back on the site, I make sure to traverse to the last known point from the root node of our crawl tree. This is to avoid detection.
The nature of web crawling through tor:
Crawling through tor already obfuscates your identity to a certain degree, so we don’t really have to do anything other than cycling
User-Agentstrings to look different from any other client.
I’ve made a one day snapshot available at github.com/dlau/sr-data
I will release the source code for the crawler when I am done, with the SR specific portions removed if anyone is interested. This will all go to the same repo.
Alright enough technical details, let’s see what useful information we can get out of this.
Knowing very little about recreational drug use, I visited the National Institute of Drug Abuse’s website which conveniently provided the names of, what the US considers, to be the most widely used drugs.
I thought, if I know them, they must be a big deal right?! I guess so. Here are the drugs I picked out:
Total number of listings
Sorted by number of listings ---------------------------- MDMA 1321 Weed 761 LSD 523 Cocaine 475 Amphetamine 215 Heroin 150 Ketamine 67 Opium 53 Mescaline 20 Total 3585
weed is simply marijuana that is smoked, not any other derivative such as hash
To put things in perspective, at the moment of writing this SR has approximately 13,000 listings for drugs. Just a guess, but it looks like prescription drugs account for a large portion of SR drug listings.
Nothing much to say here, other than the fact that MDMA seems to have the most listings.
Highest number of ratings
Just like buying off Amazon, users can review the specific product. SR gives a rating from 1-5 stars and the total number of reviews per product listing.
The average number of ratings per product as shown here seem to be rather uniform, there is on average 29 reviews per product.
MDMA 33822 25 Weed 28213 37 LSD 12122 23 Cocaine 16591 34 Amphetamine 6251 29 Heroin 3132 20 Ketamine 1504 22 Opium 1256 23 Mescaline 62 3 Total 102953
Top 100 Most Reviewed Items
MDMA 48 Weed 22 LSD 10 Cocaine 9 Amphetamine 7 Ketamine 1 Opium 1 Mescaline 1 Heroin 1
In case you are wondering, there were some outliers:
- One had 100g of MDMA for $1510.77. It had 392 ratings.
- Another was selling 100g of mdma for $1186 and 50g for $659. They had 293 ratings and *279 ratings respectively.
- The other was for 1/4lb of bulk medical marijuana for $619.10. It had 378 ratings.
I somehow doubt this guy has sold half a million dollars worth of MDMA at $1.5k a pop in such a huge quantity, but the price seems to be in line with other sellers for an equivalent amount. I’m not entirely sure what the rules are regarding who can give feedback, but there seem to be people buying huge quantites if a user must buy a product to be able to review it. I have never purchased anything from the site, and I wasn’t presented with any choices to review an item.
If only people who purchase the item can review it, then I am a bit less skeptical. I saw one canadian seller listing 1 kilo of MDMA for USD $8k with 1 review!
The average price of the top 100 items is
The average price of the top 500 items is
The average price of the top 1000 items is
Prices are converted to USD at time of crawl using exchange rates from the coinbase api.
Sellers on SR can specify where they ship from and where they ship to.