Made in Abyss dreambooth model I am working on : StableDiffusion

[–]Goldkoron[S] 19 points20 points21 points 7 months ago (28 children)

[–]cyber-meow 5 points6 points7 points 7 months ago (23 children)

[–]Goldkoron[S] 6 points7 points8 points 7 months ago (22 children)

[–]cyber-meow 1 point2 points3 points 7 months ago (14 children)

[–]MuskelMagier 0 points1 point2 points 7 months ago (13 children)

[–]gwern 1 point2 points3 points 7 months ago* (8 children)

[–]MuskelMagier 1 point2 points3 points 7 months ago (4 children)

[–]gwern 1 point2 points3 points 7 months ago (3 children)

[–]Goldkoron[S] 0 points1 point2 points 7 months ago (2 children)

[–]gwern 0 points1 point2 points 7 months ago* (1 child)

Yes, that makes sense, but if you are training only on supervised, you are using small n because you are labeling them yourself and are but one man; learning those associations may be difficult if the underlying visual representation is weak (maybe all those whistles just blur together as random jewelry), because the labels are not very informative and it will take a lot of labels to give good visual quality. Learning lots of things is hard with only a few hundred images. If you used unsupervised learning here, with as large m as you can afford (just run your GPU 24/7 on literally every frame*), then the visual representation will be much better learned, and all the fine details like different-color whistles, and then your painfully hand-annotated small n only needs to do a little bit of work in associating the text "white whistle" with the already-learned visual concept white whistle. It is not hard to learn what 'white whistle' means when the model already knows 'white' and 'whistle' in general; it is harder to learn how to draw a white whistle which looks exactly, at every scale and angle, like a Made in Abyss white whistle. The former is what text captions teach the model, but for the latter, all you need is raw images.

How important is this? Dunno. But certainly seems worth a shot if your labeling pipeline is unable to saturate your GPU resources (which seems likely to me unless you've been abusing meth).

* an episode is, what, 3000 drawings? So 2 seasons, ballpark the non-clipshow movie at 5 episodes of drawings (probably more because big budget), you should get easily m = 90,000. And you say you have 60 frames per episode labeled and 8 episodes done, so n = 480 or 187x smaller. You can see the advantage of adding in unconditional training instead of confining yourself solely to text-conditional training: there are not many things where adding 187x the data won't help.

continue this thread

[–]Freonr2 0 points1 point2 points 7 months ago* (2 children)

There's an autocaption with BLIP in the everydream tools repo here:

https://github.com/victorchall/EveryDream

Uses the naming convention for use with the trainer of simply "your caption here.ext". There's a script as well to replace generic pronouns like "a man" or "a person" you can chain to help automate more of that, just gets tricky when you have multiple subjects in frame.

Deepbooru seems to be doing weird things with underscores and just tags a bunch of stuff without the sentence structure from what I can see? I'm not sure CLIP will handle those as well as a more "alt-text" like caption that has an English sentence structure along with most often adding things like "with a mountain in the background" or "in a black suit and tie". Haven't really looked that hard into the Deepbooru stuff though.

BLIP is doing some extra space magic to form the sentences and it might be possible to tweak/train the model to understand new tags and still get meaningful captions from it instead of just a plain list of tags I imagine.

[–]gwern 0 points1 point2 points 7 months ago (1 child)

Deepbooru seems to be doing weird things with underscores and just tags a bunch of stuff without the sentence structure from what I can see?

Yes, image boorus are tag-based. Just a set of tags describing an image, like a bag-of-words, with underscores for spaces.

So it doesn't come with any sentence structure by default, and it won't describe any relationships which aren't defined by a tag. (So an image captioner might write of OP, "a young girl crouching next to a young boy to the right standing on grass, low camera shot of them", but in Danbooru tag-ese, this would equate to something like 1girl 1boy crouching dutch_angle grass young_girl young_boy. Unless someone had defined a super-specific tag in advance like girl_left_of_boy, which is unlikely, the 'standing next to' part would be left untagged.)

This makes BLIP and DeepDanbooru complementary to some extent: BLIP can write more natural English captions which encode relationships, while DD can consistently and accurately flag parts of the image that a non-anime-tuned model might be confused by or ignore.

Of course, how much it matters is itself an empirical matter. Lots of image caption tools are too dumb to really encode such relationships, so they aren't much better than a bag-of-words to begin with. For example CLIP, due to the contrastive learning and small text encoder and perhaps other factors, is very bag-of-word-like already. And it apparently handles tag formatting pretty well, too, see Waifu-Diffusion. So combining BLIP captions & DD inferred tags may not be as useful as one might hope.

[–]Freonr2 0 points1 point2 points 7 months ago (0 children)

[–]cyber-meow 0 points1 point2 points 7 months ago* (2 children)

[–]JuusozArt 0 points1 point2 points 7 months ago (1 child)

[–]cyber-meow 0 points1 point2 points 7 months ago (0 children)

[–]ohmusama 0 points1 point2 points 7 months ago (0 children)

[–]pilgermann 0 points1 point2 points 7 months ago (6 children)

[–]Goldkoron[S] 0 points1 point2 points 7 months ago (5 children)

[–]pilgermann 1 point2 points3 points 7 months ago (4 children)

[–]Goldkoron[S] 0 points1 point2 points 7 months ago (3 children)

[–]pilgermann 0 points1 point2 points 7 months ago (2 children)

[–]Goldkoron[S] 0 points1 point2 points 7 months ago (1 child)

[–]pilgermann 0 points1 point2 points 7 months ago* (0 children)

[+][deleted] 7 months ago (3 children)

[–]rookan 9 points10 points11 points 7 months ago (0 children)

[–]InnoSang 8 points9 points10 points 7 months ago (2 children)

[–]Goldkoron[S] 7 points8 points9 points 7 months ago (0 children)

[–]prozacgod 0 points1 point2 points 7 months ago (0 children)

[–]ThrowawayBigD1234 1 point2 points3 points 7 months ago (0 children)

[–]Caffdy 1 point2 points3 points 7 months ago (0 children)

[–]blueSGL 1 point2 points3 points 7 months ago (1 child)

[–]Goldkoron[S] 4 points5 points6 points 7 months ago (0 children)

[–]porkypuff 1 point2 points3 points 7 months ago (3 children)

[–]Goldkoron[S] 4 points5 points6 points 7 months ago (1 child)

[–]porkypuff 0 points1 point2 points 7 months ago (0 children)

[–]wavymulder 4 points5 points6 points 7 months ago (0 children)

[–]remghoost7 1 point2 points3 points 7 months ago (2 children)

[–]Goldkoron[S] 4 points5 points6 points 7 months ago* (1 child)

[–]remghoost7 0 points1 point2 points 7 months ago (0 children)

[–]asking_for_a_friend0 0 points1 point2 points 7 months ago (1 child)

[–]MattVanAndel 3 points4 points5 points 7 months ago (0 children)

[–]239990 0 points1 point2 points 7 months ago (2 children)

[–]NateBerukAnjing 0 points1 point2 points 7 months ago (1 child)

[–]JuusozArt 0 points1 point2 points 7 months ago (0 children)

StableDiffusion

MODERATORS

Welcome to Reddit.

Want to add to the discussion?