×
all 48 comments

[–]Goldkoron[S] 19 points20 points  (28 children)

Training using https://github.com/victorchall/EveryDream-trainer which is sort of a deviation from dreambooth and somewhat of a hybrid between traditional training and dreambooth. I am using a large dataset consisting of frames from the anime and manually captioned hundreds of images.

For generating this image I generated a txt2img at 640x640, SD upscaled to 1280x1280, then inpaint at full resolutioned both characters to improve their overall accuracy and quality.

This test model is still only 60% trained and my dataset only covers episodes 1-8 so far, still working on testing different caption methods and expanding dataset to rest of the anime. The goal is to be able to generate any named character from the anime, and most of the locations.

[–]cyber-meow 5 points6 points  (23 children)

Great work. I think it will be nice if someone can build a general workflow to train from any anime. Of course it will not be as good as the ones with a lot of human interventions such as proper captioning and selection of the best frames, but still many things can be automatized, such as character classification, tagging, auto-cropping etc. It is good to know that someone gets promising results when training from anime frames.

[–]Goldkoron[S] 6 points7 points  (22 children)

My friend uses something on his end to automatically extract frames of an anime for me, skip duplicates, and auto crop and resize to 512x512. As for automating selection of best frames I don't know how to do that. I manually select about 60-80 frames per episode and then have to manually caption them all with character names, locations, and certain features like outfits, etc

[–]cyber-meow 1 point2 points  (14 children)

I think the later part is what can be partially solved in a fairly easy way by training a proper tagger beforehand. Yet selecting good images is a subjective question. Selecting diverse images that satisfy a certain criterion should be simpler but still difficult.

[–]MuskelMagier 0 points1 point  (13 children)

For anime there is already a good tagger with deepbooru

[–]gwern 1 point2 points  (8 children)

One could also generate descriptions with a good image captioner like BLIP. That's how the Pokemon & Naruto DB models were done.

EDIT: Also remember, you don't need text+image paired data to finetune SD. SD runs in unconditional mode just fine. You can just train on a large dump of unlabeled images. Or if the text captions are low quality in a large dataset, delete them. Or train with a large dump + small labeled dataset, if the built-in text understanding is inadequate. Or...

[–]MuskelMagier 1 point2 points  (4 children)

i run the pic above with both and deepbooru is far more detailed and accurate especially because it is already trained on anime

[–]gwern 1 point2 points  (3 children)

I'm sure, but that doesn't mean it's better. Waifu-Diffusion (which I assume OP is using although he doesn't mention it) is already quite knowledgeable in general, the point is to shovel as much Made in Abyss imagery through it as possible to capture all the possible scenes & styles & objects. The tags/descriptions aren't really that important. In fact, OP should probably be spending more compute on just training unconditional WD on random frames, rather than spending all his time curating and selecting and writing captions.

[–]Goldkoron[S] 0 points1 point  (2 children)

I am relying on the fact that the AI training already recognizes elements in images just fine without captioning, that's why the only features I am captioning are stuff that it can't possibly infer on its own like specific character names, locations from the anime, and other features it has trouble with like the different color whistles.

[–]gwern 0 points1 point  (1 child)

Yes, that makes sense, but if you are training only on supervised, you are using small n because you are labeling them yourself and are but one man; learning those associations may be difficult if the underlying visual representation is weak (maybe all those whistles just blur together as random jewelry), because the labels are not very informative and it will take a lot of labels to give good visual quality. Learning lots of things is hard with only a few hundred images. If you used unsupervised learning here, with as large m as you can afford (just run your GPU 24/7 on literally every frame*), then the visual representation will be much better learned, and all the fine details like different-color whistles, and then your painfully hand-annotated small n only needs to do a little bit of work in associating the text "white whistle" with the already-learned visual concept white whistle. It is not hard to learn what 'white whistle' means when the model already knows 'white' and 'whistle' in general; it is harder to learn how to draw a white whistle which looks exactly, at every scale and angle, like a Made in Abyss white whistle. The former is what text captions teach the model, but for the latter, all you need is raw images.

How important is this? Dunno. But certainly seems worth a shot if your labeling pipeline is unable to saturate your GPU resources (which seems likely to me unless you've been abusing meth).

* an episode is, what, 3000 drawings? So 2 seasons, ballpark the non-clipshow movie at 5 episodes of drawings (probably more because big budget), you should get easily m = 90,000. And you say you have 60 frames per episode labeled and 8 episodes done, so n = 480 or 187x smaller. You can see the advantage of adding in unconditional training instead of confining yourself solely to text-conditional training: there are not many things where adding 187x the data won't help.

[–]Freonr2 0 points1 point  (2 children)

There's an autocaption with BLIP in the everydream tools repo here:

https://github.com/victorchall/EveryDream

Uses the naming convention for use with the trainer of simply "your caption here.ext". There's a script as well to replace generic pronouns like "a man" or "a person" you can chain to help automate more of that, just gets tricky when you have multiple subjects in frame.

Deepbooru seems to be doing weird things with underscores and just tags a bunch of stuff without the sentence structure from what I can see? I'm not sure CLIP will handle those as well as a more "alt-text" like caption that has an English sentence structure along with most often adding things like "with a mountain in the background" or "in a black suit and tie". Haven't really looked that hard into the Deepbooru stuff though.

BLIP is doing some extra space magic to form the sentences and it might be possible to tweak/train the model to understand new tags and still get meaningful captions from it instead of just a plain list of tags I imagine.

[–]gwern 0 points1 point  (1 child)

Deepbooru seems to be doing weird things with underscores and just tags a bunch of stuff without the sentence structure from what I can see?

Yes, image boorus are tag-based. Just a set of tags describing an image, like a bag-of-words, with underscores for spaces.

So it doesn't come with any sentence structure by default, and it won't describe any relationships which aren't defined by a tag. (So an image captioner might write of OP, "a young girl crouching next to a young boy to the right standing on grass, low camera shot of them", but in Danbooru tag-ese, this would equate to something like 1girl 1boy crouching dutch_angle grass young_girl young_boy. Unless someone had defined a super-specific tag in advance like girl_left_of_boy, which is unlikely, the 'standing next to' part would be left untagged.)

This makes BLIP and DeepDanbooru complementary to some extent: BLIP can write more natural English captions which encode relationships, while DD can consistently and accurately flag parts of the image that a non-anime-tuned model might be confused by or ignore.

Of course, how much it matters is itself an empirical matter. Lots of image caption tools are too dumb to really encode such relationships, so they aren't much better than a bag-of-words to begin with. For example CLIP, due to the contrastive learning and small text encoder and perhaps other factors, is very bag-of-word-like already. And it apparently handles tag formatting pretty well, too, see Waifu-Diffusion. So combining BLIP captions & DD inferred tags may not be as useful as one might hope.

[–]Freonr2 0 points1 point  (0 children)

Yeah I haven't done any anime stuff,

BLIP is fairly smart about describing artwork, video game screenshots, etc. It understand foreground and background, or things behind people ("standing in front of a window"), sky/clouds/sun/moon, sitting in chairs, standing next to each other (sometimes), stuff like that at least, even seems to understand dragons even if SD itself can't draw them very well. Have not done anything with anime other than my conversations with Goldkoron and his project.

It still hilariously loves to say people have cellphones in their hand. I imagine it became schizophrenic being trained on cropped images with captions describing of out-of-frame objects...

[–]cyber-meow 0 points1 point  (2 children)

The whole point is that any general tagger or captioner does not have knowledge about a random anime, and would fall short when you want to prompt a specific character or place as what the op is trying to do here. That's why it should be better to train or fine tune a tagger beforehand.

Alternatively, we can have unconditional models and separate taggers for guidance, but it seems that this approach generally has less success than incorporating it during training.

[–]JuusozArt 0 points1 point  (1 child)

I considered this at one point, and I came to the conclusion that training a tagger on a TV show or a movie really is not worth the effort. In order to train the tagger, you would need to extract the frames, tag them and then train the AI on them. That's what we needed to do for the image model to begin with!

[–]cyber-meow 0 points1 point  (0 children)

The key here is a good few shot learning algorithm. If you have 5000 images. The ideal situation is that you only need to tag 100 of them to train the tagger and use it for the remaining 4900 images.

[–]ohmusama 0 points1 point  (0 children)

I have deepbooru enabled on automatic's UI, but it never returns results. Any suggestions? (It's definitely installed, as it starts to do something and then just stops)

[–]pilgermann 0 points1 point  (6 children)

Very cool. Are you running locally or on a colab/server, and what sort of step load are you looking at to properly train that many images? I've only ever attempted single styles or subjects on a local 3090.

[–]Goldkoron[S] 0 points1 point  (5 children)

Locally on a 3090. Not sure what you mean by step load but this is at 125 repeats with a sample set of 581 images and using a batch size of 5. I consider 200 repeats (number of times every image in dataset gets trained) to be fully trained for a model.

[–]pilgermann 1 point2 points  (4 children)

Sorry -- that's what I meant (how many steps I should have just said, so 200 x 581, which is 116k or so). A lot of steps.

So because this is forked from the JoePenna line of Dreambooths, I'm guessing this means it's not using all the xformers efficiencies (like here: https://github.com/TheLastBen/fast-stable-diffusion). So you must be running training like, what, 30 hoursish?

[–]Goldkoron[S] 0 points1 point  (3 children)

I don't think it is using xformers, last I heard xformers didn't actually boost training speed at all but this was weeks ago. As for training time with the current repo I am using, it trains about 2.3 images per second. About 14.8 hours to train 200 repeats with 581 images. It will more than double though when I complete the dataset.

[–]pilgermann 0 points1 point  (2 children)

Interesting abou xformers (maybe it's something else speeding up these fundamentally linux-only variants of Dreambooth)?

In any case, sounds like the victorchall repo you're using is operating more efficiently than the joepenna I've been using (given we have the same gpu and probably somewhat comparable rigs).

I am inspired to take on a similar project. Thanks for the info.

[–]Goldkoron[S] 0 points1 point  (1 child)

Yeah, I was using joepenna up until yesterday actually. The victorchall repo seems to use a bit less VRAM on batch size 1, and since you can use higher batch sizes it can get overall faster. uses 23GB on batch size 5. Batch size 6 is possible but you don't get much GPU headroom to do other things on PC while it's training.

It's not without bugs/issues still though. I had to disable logging images (by setting frequency to 999999) because for some reason all my VRAM headroom would get eaten up after the first logging step and it doesn't give it back.

[–]pilgermann 0 points1 point  (0 children)

Good to know about the image logging. And actually that probably? explains the xformers — the repos with xformers have lower vram overhead, so you can process more at once, not process individual images faster.

Edit: Just so I'm not causing confusion for anyone else who stumbles on this thread, I was also running Joepenna at a low batch size and probably could have cranked it up higher as that repo has also grown a bit more efficient since it was first released. All of the Dreambooths can run on lower vram gpus now, though some moreso than others.

[–]rookan 9 points10 points  (0 children)

This image looks incredible

[–]InnoSang 8 points9 points  (2 children)

Wait, it isn't a real image from the show ?

[–]Goldkoron[S] 7 points8 points  (0 children)

Nope. It really does look like promotional art to me, was surprised myself how this image came out.

[–]prozacgod 0 points1 point  (0 children)

bruh, how, the kids got like 7 fingers!?

haha!

But seriously, it's damned good!

[–]ThrowawayBigD1234 1 point2 points  (0 children)

It looks just like the anime.

[–]Caffdy 1 point2 points  (0 children)

You sick fuck! /s

[–]blueSGL 1 point2 points  (1 child)

I'd be interested in this model less for the characters and more for the gorgeous landscape and creature shots.

[–]Goldkoron[S] 4 points5 points  (0 children)

That's one of the primary reasons I picked this anime, I shared some landscape shots in this post on Made in Abyss subreddit https://old.reddit.com/r/MadeInAbyss/comments/yjw8j4/early_look_at_made_in_abyss_ai_image_generation/

As for creatures, I am training some of the monsters that show up for more than a few seconds, but otherwise I am currently captioning brief frames of random creatures under the layer they are in.

[–]porkypuff 1 point2 points  (3 children)

can it do bondrewd yet

[–]Goldkoron[S] 4 points5 points  (1 child)

Not yet, need to get further into the series for that.

[–]porkypuff 0 points1 point  (0 children)

i wish you the best of luck, then

[–]wavymulder 4 points5 points  (0 children)

Dad of the Year

[–]remghoost7 1 point2 points  (2 children)

Super neat project. I adore the artwork of this show.

Any plans to release your model?

It'd be cool to play around with it even in an unfinished state.

[–]Goldkoron[S] 4 points5 points  (1 child)

https://drive.google.com/drive/u/0/folders/1FxFitSdqMmR-fNrULmTpaQwKEefi4UGI you can download test models I upload to this folder, currently uploading 1-8 model which should appear in an hour or 2. Still very unfinished and I am now testing new captioning methods to see how they work out. If you want to follow the project more closely and ask questions about what tags to use, you can join me and my friend's small discord for it https://discord.gg/jcuYDNQ6

[–]remghoost7 0 points1 point  (0 children)

Very cool.

I might hop over to the discord.

Thank you!

[–]asking_for_a_friend0 0 points1 point  (1 child)

OH GOD PLZ NO

[–]MattVanAndel 3 points4 points  (0 children)

Ma ma maaaaaa!

[–]239990 0 points1 point  (2 children)

The only problem with this anime is that characters are like 10 years old and its pretty creepy

[–]NateBerukAnjing 0 points1 point  (1 child)

so people can't draw children now, ok

[–]JuusozArt 0 points1 point  (0 children)

I think it's more the fact that they are shown half-naked in some points.