×
all 16 comments

[–]MrSmilingWolf[S] 2 points3 points  (0 children)

The main post is for the most mature one, but I've got three other smaller models using different architectures trained to varying degrees of maturity that I want to release, plus ONNX and quantized TFLite versions. I'll upload these at a later date.

[–]KichangKim 2 points3 points  (1 child)

am I the only one using focal loss for this? I know DeepDanbooru implements it, but was it ever used? It did miracles for my recall, esp. for rarer classes. Ping /u/KichangKim

Hi. And your are right. DeepDanbooru has focal loss implementation, but not used yet. I'll try it for next training.

[–]MrSmilingWolf[S] 1 point2 points  (0 children)

I highly recommend it. I started noticing diminishing returns when using BCE to train networks scaling up from top-500 to top-1000 tags with ResNet50, and severely so when scaling up again from top-1000 to top-2000.

Focal loss did a LOT of good for the bottom half of the classes.

[–]gwern 1 point2 points  (7 children)

Are you clustering based on an embedding or just on clustering the predicted tags?

How many GPU-hours is 100 epochs?

Apply for Google TPUs perhaps?

TRC hands out TPUs like candy. Since you're using Tensorflow already, interfacing with TPUs shouldn't be too hard. Just be careful with egress/cross-region bandwidth.

Point is, I'd like to find out how well new archs, bigger archs, mobile archs work outside of Imagenet.

Yeah, tagging isn't remotely as well studied as categorizing. Do RegNets break on tagging or do they just need some tweaks? Who knows? Certainly not Facebook.

DL researchers generally tend to jump from 'categories' to 'semantic segmentation' or 'text captions', with not much in between. I assume it's the general absence of popular datasets like, well, Danbooru20xx which are heavily tagged. (This is why I keep telling people that they can train their CLIP-like or DALL-E-like on Danbooru20xx without a problem, just turn the tags into a string by concatenating them with spaces & commas or something. It's not like CLIP isn't already mostly learning on the bag-of-words level anyway!)

[–]MrSmilingWolf[S] 2 points3 points  (0 children)

Are you clustering based on an embedding or just on clustering the predicted tags?

Embeddings from the GlobalAveragePooling2D layer, right after the final projection layer to 3072 channels and the ReLU activation.

To keep the indexes' size somewhat sane I also throw some PCA in there. I choose the number of components to keep following the procedure from this post: https://towardsdatascience.com/how-to-tune-hyperparameters-of-tsne-7c0596a18868

How many GPU-hours is 100 epochs?

On a single p3.2xlarge instance, using TF 2.3 (ancient, but stable at least) it takes 6.75-7 hours to complete a single epoch with 1553694 images. Makes for around 700 hours, or 29 days of training, split more or less evenly to about 1 week per month for the last 4 months for budgeting reasons.

Last month I finally chased down the legacy bits in the model implementation and AGC code, so now I can train with TF 2.7 and expect a bump in performance using mixed precision on the V100.

[–]MrSmilingWolf[S] 2 points3 points  (5 children)

TRC hands out TPUs like candy

Long story short, I applied for a single GPU quota on GCP 3 times about 18 months ago and I was shot down each single time, hence why I started using AWS.

Fast forward to NYE, I applied to TRC thinking "as if" purely because of the quote above. They sent me a code for 5 TPUv3 and 5 TPUv2, all on-demand, plus an obscene amount of preemptible TPUv2 devices no questions asked in a matter of hours.

I cannot overstate how excited I am to scale up my experiments. There's a lot of things I always wanted to try out, including longer schedules and different model sizes or archs, plus increasing the number of labels once I get a grasp of what works and what doesn't.

I'd really like to use Danbooru2021 for this, so out of curiosity, is there an ETA for its release?

[–]gwern 1 point2 points  (4 children)

Fast forward to NYE, I applied to TRC thinking "as if" purely because of the quote above. They sent me a code for 5 TPUv3 and 5 TPUv2, all on-demand, plus an obscene amount of preemptible TPUv2 devices no questions asked in a matter of hours.

Ah, that was your mistake. A logical person would assume that GPUs and TPUs are pretty much the same thing; a logical person would assume that if they couldn't get any GPUs at all, they definitely couldn't get TPUs. A logical person would be baffled by the reality of TRC: TRC is in a really weird position where they are siloed from GCP and they are their own little world. GPUs are GCP's thing, as are GCP credits. So, TRC can hand you literally a million bucks of TPU time (list-price)*, it ain't no thang - but they can't hand you so much as $100 of GCP credits to cover bucket storage costs or GPUs without them moving heaven & earth, because they have to go beg GCP for it. (EDIT: and if they do, they can only do so once because you can only apply 1 credit to an account?!) It's probably the single worst drawback of TRC and I've told them and anyone who will listen that this is completely crazy and self-sabotaging, but, well, big corporations...

I'd really like to use Danbooru2021 for this, so out of curiosity, is there an ETA for its release?

I was hoping I'd get it out by now, but I've run into 2 issues. (I had to delete a lot of my monthly crawls when my server ran out of space, and I forgot to redo then after getting a new server, so that is taking a while. And the BigQuery mirror is currently refusing to let me do anything at all with it due to permissions problems.) EDIT: resolved the perms but it turns out the BQ stopped updating in November when Evazion moved the Danbooru DB to Kubernetes (?!) and now there's a completely different BQ mirror with a different schema... On the plus side, the new mirror seems to be way more comprehensive, and posts isn't too different in schema...

* I assume they gave you the usual 1-month quota. But show them you can use TPUs at all and ask politely, and they may be able to hand you some TPUv3-128s or something. Using TPU pods is really nice.

[–]MrSmilingWolf[S] 0 points1 point  (3 children)

I've checked the new danbooru1 dump, and from a quick look it seems two tables will have to be dumped to have the same data as the previous years: posts and tags. The former only has tags as a list of strings, the latter holds tag ids and category.

Now, I don't think tags is strictly necessary for a classifier's purposes - in posts tag strings are already separated in general, artist, character, copyright and meta, and post count to select top-k popular tags can be calculated - but do you think you could dump and host it anyway? The tag creation date in particular could be useful for other kinds of analyses.

Meanwhile I'll check what will it take to make fire-eggs' tools compatible with the new schema.

[–]gwern 0 points1 point  (0 children)

My current thinking after consulting with #iqdb is that Youstur's BQ mirror is never going to be restored or updated, so what I'm going to do is bite the bullet: I'll provide the Youstur JSON up to November 2021 and also dump each of the tables in danbooru1, and announce that future releases will contain only the danbooru1 tables and users should migrate their stuff permanently with the Youstur as a bridge.

The other tables should open up some interesting possibilities, like doing contrastive learning on caption text as well as tags, which would induce more OCR+translation capabilities (particularly with any text encoder pretrained or jointly trained on English+Japanese text corpuses).

[–]gwern 0 points1 point  (1 child)

Danbooru2021 is live.

[–]MrSmilingWolf[S] 0 points1 point  (0 children)

Best news of the week! Thank you for your work!

I've been thinking a lot about re: pretrained weights the past week, and I'm not going to use them in the immediate future. In the short term I really want to prod at the different architectures and learn from their eventual modes of failure. And there's this other loss (ASL, arXiv:2009.14119) that I wanted to try out but couldn't risk GPU time on previously.

I plan to implement and train ConvNext shortly though, so I will be able to load good pretrained weights (87.8% on PWC, i22k weights have been released) when the time comes.

[–]MrSmilingWolf[S] 1 point2 points  (0 children)

To whom it may concern: I just uploaded a few new model weights, together with data about model performance data across a few different variant/mixup/activation settings.

[–]gwern 0 points1 point  (2 children)

BTW, have you ever considered using transfer learning instead of training a NFNet from scratch? If you look at Paperswithcode, I see released checkpoints as high as 87% ImageNet top-1 (SwinL) or 88% (BEiT-Large). The performance usually transfers downstream to tasks like semantic segmentation, which seems close in spirit to tagging.

[–]MrSmilingWolf[S] 0 points1 point  (1 child)

I did initially, but I remember thinking that 71M parameters was waaay too much for whatever that network was doing, and that was the smallest variant. So I scaled the parameters down and went my own merry way, later adopting timm's Lx variants as a tentative to follow some standard, with some further simplification to keep them even more simple and portable - ReLU instead of SiLU, and no ECA/SE, to make them a tad more mobile-friendly.

However I noticed just now that NFNet F4 and BEiT-Large have got a comparable amount of parameters (in the low 300M), have been tested at similar image sizes (512px), but have different pretraining - Imagenet 1k for NFNet, Imagenet 22k for BEiT - and I wonder, given a budget in either epochs or TPU hours, which one could be finetuned better? And would they converge to similar scores at some point? How fast? How well does pre-training on real life images transfer to drawings anyway?

Now that's some questions I could try and answer for science.

[–]gwern 0 points1 point  (0 children)

I ignored NFNet there because PWC notes that it's pretrained on JFT-*, and Google pretty much never (ever?) releases checkpoints trained on JFT datasets. So you only get the lesser I1k pretrained model for NFNet-F4, which drops you down to 85%. Noticeably worse than an I21k BEiT at 88%. But since you're already using NFNet so heavily, sacrificing 3% may be worth the familiarity & tooling compared to learning BEiT. I expect it'd still work better than pretraining from scratch, at least in terms of saving compute. At least in theory, even if D2020 is big enough that the pretraining prior doesn't help in the limit, the pretrained ought to converge faster, so you can spend that compute elsewhere. Really, you're sabotaging yourself by not using the best pretrained model you can find.

Another good trick is precomputing CLIP embeddings (or a better CLIP-like, surely someone's released checkpoints for one of the improved ones) and using them to inject knowledge, either as conditioning or an auxiliary loss. You can do that with GANs etc and they learn way faster when they start with an embedding eg Projected GAN.

[–]insufficient_qualia 0 points1 point  (0 children)

How does it fare on this benchmark?

https://danbooru.donmai.us/posts/3343112