Bitcoin donations on The Pirate Bay
Downloading and parsing TPB metadata to estimate Bitcoin usage and revenue for uploaders
Background
SDr> gwern, here's a shiny new angle for your cryptocurrencies knowledge file: do crypto donations to warez distributors, like work?
gwern> SDr: how would that work? they put addresses in the READMEs of their torrents or something?
SDr> gwern, specifically eg. scraping Pirate Bay's NFO files for wallet addresses & cross-referencing it with blockchain, is there volume for it,
such that distributors are incentivized to provide clean cracks / keygens, as opposed to bundling blackmail-ware with it?
TODO: compare against Paypal, Flattr, Gratipay?
Watashi kininarimasu!
Data
Download
https://
more efficient to not download comments
diff --git a/download.py b/download.py
index 82837e2..9fed0aa 100644
--- a/download.py
+++ b/download.py
@@ -6,12 +6,12 @@
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
-#
+#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
-#
+#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
@@ -21,7 +21,7 @@ import HTMLParser
import torrent_page
import filelist
-import comments
+# import comments
import requests
import datetime
@@ -39,10 +39,10 @@ def main():
tp_status_code = torrent_page.get_torrent_page(torrent_id, protocol)
if (tp_status_code == 200):
filelist.get_filelist(torrent_id, protocol)
- comments.get_comments(torrent_id, protocol)
+ # comments.get_comments(torrent_id, protocol)
elif (tp_status_code == 404):
print "Skipping filelist..."
- print "Skipping comments..."
+ # print "Skipping comments..."
else:
print "ERROR: HTTP " + str(tp_status_code)
error_file.write(datetime.datetime.utcnow().strftime("[%FT%H:%M:%SZ]") + ' ' + str(torrent_id) + ": ERROR: HTTP " + str(tp_status_code) + '\n')
@@ -58,10 +58,10 @@ def main():
tp_status_code = torrent_page.get_torrent_page(torrent_id, protocol)
if (tp_status_code == 200):
filelist.get_filelist(torrent_id, protocol)
- comments.get_comments(torrent_id, protocol)
+ # comments.get_comments(torrent_id, protocol)
else:
print "Skipping filelist..."
- print "Skipping comments..."
+ # print "Skipping comments..."
time_log.write(datetime.datetime.utcnow().strftime("%FT%H:%M:%SZ") + ' ' + str(torrent_id) + " " + str(tp_status_code) + '\n')
time_log.flush()
break # Success! Break out of the while loop
@@ -72,10 +72,10 @@ def main():
tp_status_code = torrent_page.get_torrent_page(torrent_id, protocol)
if (tp_status_code == 200):
filelist.get_filelist(torrent_id, protocol)
- comments.get_comments(torrent_id, protocol)
+ # comments.get_comments(torrent_id, protocol)
else:
print "Skipping filelist..."
- print "Skipping comments..."
+ # print "Skipping comments..."
time_log.write(datetime.datetime.utcnow().strftime("%FT%H:%M:%SZ") + ' ' + str(torrent_id) + " " + str(tp_status_code) + '\n')
time_log.flush()
break # Success! Break out of the while loop
@@ -102,7 +102,7 @@ if (len(sys.argv) == 2+offset):
torrent_id = sys.argv[1+offset]
print torrent_id
main()
-
+
elif (len(sys.argv) == 3+offset):
if (int(sys.argv[1+offset]) > (int(sys.argv[2+offset])+1)):
for torrent_id in range(int(sys.argv[1+offset]),int(sys.argv[2+offset])-1, -1):
@@ -112,6 +112,6 @@ elif (len(sys.argv) == 3+offset):
for torrent_id in range(int(sys.argv[1+offset]),int(sys.argv[2+offset])+1):
print torrent_id
main()
-
+
elif (len(sys.argv) > 3 and not https):
print "ERROR: Too many arguments"
started 2014-02-25, 8:51PM EST
sudo apt-get install python-requests python-beautifulsoup
git clone git@github.com:andronikov/tpb2csv.git
cd tpb2csv
sed -i -e 's/thepiratebay\.sx/thepiratebay.se/' *.py
# End:
# http://thepiratebay.se/recent
# most recent: http://thepiratebay.se/torrent/9666564/Blonde_Avenger_008_%28BlitzWeasel_-_1995%29_%28Talon-Novus-HD%29_%5BNVS-D%5D
# ID: 9666564
python download.py 1 9666564
8,xxx,xxx
21:02:44 <@gwern> https://
09:06 PM 0Mb$ python download.py 9000000 9666564
git clone https://
http://
$ cd home/
cd ~/
problems with tpb2csv:
hardwired to .sx, eeds .se
fragile, easily bombs out after a few connection failures
doesn’t check for already-downloaded?
desired features: - default final target from http://
but could also just figure out all missing ids
countmissing.hs
:
import Data.Set (fromList, difference, toList)
import System.Environment (getArgs)
main :: IO ()
main = do args <- getArgs
let latest = read (head args) :: Int
ids <- readFile "ids.txt"
let numbers = Data.Set.fromList ((Prelude.map read $ lines ids) :: [Int])
let allIDs = Data.Set.fromList [9000000..latest]
let missing = Data.Set.difference allIDs numbers
writeFile "missing.txt" $ unlines $ Prelude.map show $ Data.Set.toList missing
and finally, GNU parallel
to download each torrent ID separately without mucking around with ranges:
cd ~/tpb/tpb2csv/ && rm ids.txt missing.txt randomized.txt
find ./data/9xxxxxx/ -type d | cut --delimiter='/' --fields=6 | sort --unique | tail --lines=+2 > ids.txt
LATEST_TORRENT="$(elinks -dump 'http://thepiratebay.se/recent' | grep -F 'http://thepiratebay.se/torrent/1' | cut -d '/' -f 5 | head -1)"
runghc countmissing.hs $LATEST_TORRENT
# sort --random-sort missing.txt > randomized.txt
# cat randomized.txt | parallel --max-chars=40 --ungroup --jobs 7 -- python -OO download.py
cat missing.txt | tac | parallel --max-chars=40 --ungroup --jobs 1 -- nice python -OO download.py
gwern> there are an amazing number of 404s on tpb. I wonder why
gwern> why would they ever delete a page?
gwern> short of CP
X> They delete spam, viruses, fake uploads, uploads with wrong names. There is tons of that on TPB actually (and usually uploaded in bulk, that's why the 404s usually form "blocks"). Not that much CP.
Processing
$ find . -type f -name "description.txt" -exec grep --extended-regexp '[13][a-zA-Z0-9]{26,33}' {} \;
# Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# UniqueID/String : 198031419884096486800071953706812228345 (0x94FB76D26E78C1B7A2EA96FDDB10C2F9)
# http://easyimghost.com/ImageHosting/8102_7346420e085511e2ba4022000a1e89327.jpg.html
# http://easyimghost.com/ImageHosting/8104_83901b420fe611e29797123138133f0a7.jpg.html
# http://sharepic.biz/show-image.php?id=b264f152debff549632be27bfe965f86
# http://image.bayimg.com/08f6b9b52398b07c3c86e1dc1f3b3d36594e67b8.jpg
# http://image.bayimg.com/0b397e88fdefe06fa99acddf86190f4cd2ef3922.jpg
# Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# http://image.bayimg.com/34eae2b9725eb15e7a58fd6bf6e2fedb2c5af0a7.jpg
# Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# d17a78495f673990bb6d3ea096e5830bcc7dc4dd
# Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# http://rss.thepiratebay.se/user/01997b95ece549c293b824450ea84389
# http://rss.thepiratebay.se/user/01997b95ece549c293b824450ea84389
# B0A544F55125A26C21622D4118739305B9088448
# http://rss.thepiratebay.se/user/01997b95ece549c293b824450ea84389
# http://image.bayimg.com/6cd5e44b139c1506dfa8f8d59003187454700627.jpg
# Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# http://rss.thepiratebay.se/user/01997b95ece549c293b824450ea84389
# http://image.bayimg.com/0070020142c02ecbadd55e73f0d60a7169367885.jpg
# http://photosex.biz/v.php?id=a7f013597a02d5ec0fffe529da47e7eb
# http://image.bayimg.com/b6a1d70ca01e50d91f30ef683756442d4d52a357.jpg
$ find . -type f -name "description.txt" -exec grep --only-matching --extended-regexp '[13][a-zA-Z0-9]{26,33}' {} \;
# 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# 1980314198840964868000719537068122
# 346420e085511e2ba4022000a1e89327
# 3901b420fe611e29797123138133f0a7
# 152debff549632be27bfe965f86
# 398b07c3c86e1dc1f3b3d36594e67b8
# 397e88fdefe06fa99acddf86190f4cd2ef
# 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# 34eae2b9725eb15e7a58fd6bf6e2fedb2c
maybe use R? http://
nice find tpb/ -type f -print0 | sort --zero-terminated --key=6,5 --field-separator="/" | tar --no-recursion --null --files-from - -c | nice xz -9e --check=sha256 --stdout > ~/tpb.tar.xz; alert
I still miss anything newer than ID 8599995 (06-2013). There is this torrent …9xxx, 10xxx and 11xxx are missing (the last torrents uploaded to TPB were 11xxx) …There were the github CSV repos (that I linked now on the page) by some other guy, but github took them down (at least the 4xxxxxx one)… I am not sure why, but they did, and they did it today, more exactly about 10 minutes ago while I was cloning it to my disk …Do you have them? I found your website when I googled it, it seems you did some experiments on it. …If you do, could you upload it somewhere? (Ideally some torrent) If you have newer than 8xxxxxx (as you seem you had), that would be even perfecter
I’m not sure I have as much as you think I have: the TPB downloader broke 2014-09-18, and I stopped scraping. (I was getting tired of having to babysit it, it was using up a lot of bandwidth and disk space which make my backups much slower, and I wanted to focus on scraping the blackmarkets.) Also, note that I hacked the downloader to only download the metadata I wanted for my Bitcoin analysis; I did not intend or want to mirror all of TPB since I assumed the original archivers were doing that and that the TPB itself had better procedures than my scraping. So while I don’t remember editing any of the archives I pulled off Github, the more recent files -which are probably what you really want - will not be complete.
Here’s a summary of what I have:
$ ls
03xxxxxx/ 04xxxxxx/ 05xxxxxx/ 06xxxxxx/ 07xxxxxx/ 08xxxxxx/ 09xxxxxx/ 10xxxxxx/ 11xxxxxx/ tpb2csv/
$ duh */.git/
481M 03xxxxxx/.git/
684M 04xxxxxx/.git/
718M 05xxxxxx/.git/
944M 06xxxxxx/.git/
806M 07xxxxxx/.git/
495M 08xxxxxx/.git/
156K tpb2csv/.git/
4.1G total
$ du -ch *
5.6G 03xxxxxx
7.7G 04xxxxxx
8.2G 05xxxxxx
11G 06xxxxxx
9.9G 07xxxxxx
6.4G 08xxxxxx
7.3G 09xxxxxx
3.8G 10xxxxxx
1.8G 11xxxxxx
208K tpb2csv
62G total
$ find ~/tpb/ -type f | sort | xz -9e --check=sha256 --stdout > ~/tpb.txt.xz
# https://www.dropbox.com/s/te1zimevmmzi1qg/tpb.txt.xz
Similar Links
Measuring dark web marketplaces via Bitcoin transactions: From birth to independence
Traveling the Silk Road: A Measurement Analysis of a Large Anonymous Online Marketplace
Sex, Drugs, and Bitcoin: How Much Illegal Activity Is Financed through Cryptocurrencies?
Is it possible to establish the link between drug busts and the cryptocurrency market? Yes, we can
KopperCoin—A Distributed File Storage with Financial Incentives