Pretty much as the title says, because of course one more of those things had to drop before the year's end
Repo: https://github.com/SmilingWolf/SW-CV-ModelZoo
Release (more to follow): https://github.com/SmilingWolf/SW-CV-ModelZoo/releases/tag/NFNetL1V1-100-0.57141
Short summary:
Pros:
- probably trained longer than any other similar project
- am I the only one using focal loss for this? I know DeepDanbooru implements it, but was it ever used? It did miracles for my recall, esp. for rarer classes. Ping /u/KichangKim
- uses mixup, altough whether if this is an advantage or a mere technical note is up for debate.
- still, resuming training for 25 epochs with mixup with alpha = 0.2 resulted in a 0.22 improvement overall, and a lot more on rarer classes, think it was about 1%?
- did the extra epochs help? Was it really mixup? We may never know
- using a set of images from ID 4970000 to 5000000 as validation and an intersection set of tags common to both my training data and DeepDanbooru's, resulting in 2199 different classes tested on 28995 images for a total of 811739 tags, my network obtains about 3.2% better F1 score at the intersection between P-R curves.
Cons:
- code sucks. Project structure sucks. It's a bunch of scripts, in a lot of places interweaved with snippets from the internet, held together by the Great Blue Sky's good will
- for one, the AGC code is from Sayak Paul's notebook, and most of analyze_metrics.py is from the internet too, tho I've forgotten where from exactly, mixup code straight up from keras.io
- less tags than DeepDanbooru (2380 vs 7811, or approx 1/3)
- only trained on a subset of the 512px SFW set
- ID modulos between 0000 and 0599
- images with less than 15 tags were pruned
- tags that had an abnormally low number of samples within this set were pruned. No use feeding the model with zeros on NSFW classes
- messy training. First 60 epochs done with std and mean normalization, then switched to simple 0-1 scaling in epochs 61-75, then added mixup during epochs 76-100.
- in all cases learning rate was dropped once at 60% and once at 90% of the scheduled training epochs
- still, precision and recall on the validation set improved at the end of each one of the three training cycles
- did not sweep clipping values for AGC. 0.02 was chosen based on the performance of NFResNets with the smallest batch sizes reported in the original paper
- caveats apply: DeepDanbooru v3 works on 512x512 images, mine on 320x320 ones
- it is also smaller than DeepDanbooru v3, but 180 MB of f32 weights is still pretty hefty
Things I've noticed:
- DeepDream doesn't give images as nice as with BatchNorm networks. I speculate this is due to the AveragePooling2D in the residual paths sponging up variance
- experimentally determined that adding a x * 2 between the pooling layer and the 1x1 convolution preserves variance better and in fact gives better DeepDreams
- note: paper authors DID notice this happened, but also noted this didn't influence training in any meaningful way and didn't implement any corrective action
- With this particular dataset, with those labels, using these particular networks, with this batch size (32) and focal loss, a much higher learning rate than I expected was necessary
- needed a warm-up for an epoch or two at 0.0125, then switched into high gear and went with 0.3. Worked great for me
Things I'm using this for:
- autotagger for cosplay sets in my Hydrus archive (it works decently well for real life cosplay photos)
- clustering twitter galleries before manual processing for eg. franchise/character tagging. Grouping stuff together with feature extraction, UMAP and HDBSCAN makes things somewhat easier
- similar images search engine. "Is this photo a missing piece from a set I already have?" "Can I find in my archives some more images with eg. open grassy land, no humans, and ravines, similar to this one I already have here?", stuff like that
- for (drawn) people, I noticed that if it can't find other images of the same character it tries to match the pose, the hair color and more generic attributes. Useful for moodboards?
Things I'd like to try:
- BIGGER networks! Money doesn't grow on tree tho, and bloody AWS costs a fair amount. Apply for Google TPUs perhaps?
- More networks! Model zoo! Point is, I'd like to find out how well new archs, bigger archs, mobile archs work outside of Imagenet. RegNets, for example, were an utter disappointment
Misc:
- It takes a bit of fiddling to squeeze every bit of performance from NFNets when converting them to the ONNX format with tf2onnx. NFNetL1 is ready, other ones will follow
- It takes more fiddling to correctly convert NFNets to TFLite and int8 quantize them, since grouped convs aren't supported yet. Again, NFNetL1 is ready, other ones will follow
- note: the quantized TFLite model hasn't gone through rigorous testing, I only checked the results were close enough to the original ones on a couple of images.
Want to add to the discussion?
Post a comment!