Yeah. Regarding Dall-E 3 specifically: Few people know that there was an unnamed model between Dall-E 2 and 3: “Dall-E 2.5”, as I like to call it. It was initially used in Microsoft’s Bing Image Creator. It often produced surprisingly good aesthetics, especially with very short or unspecific prompts.
Then it got replaced with Dall-E 3, which produces substantially fewer visual errors (like e.g. too many limbs), and it has a much better complex prompt following ability, but its style is also way worse than 2.5. Like apparently most other text-to-image models, Dall-E 3 largely produces tacky, aesthetically worthless kitsch. RIP Dall-E 2.5, which is now completely unavailable. :(
A few comparisons using old images I did a while ago:
Somewhat relatedly: when I started this post I planned to argue you should use midjourney instead of DallE, but then when I went to test it I found that Midjourney had become more generic in some way that was hard to place. I think it’s still better than DallE default but not a slam dunk
I guess the bad aesthetics are to some extent a side effect of some training/fine-tuning step that improves some other metric (like prompt following), and they don’t have a person who knows/cares about art enough to block “improvements” with such side effects.
In case of Dall-E the history was something like this: Dall-E 1/2: No real style, generations did look presumably like an average prediction from the training sample, e.g. like a result from Google Images. Dall-E 2.5: Mostly good aesthetics, e.g. portraits tend to have dramatic lighting and contrast. Dall-E 3: Very tacky aesthetics, probably not intentional but a side effect from something else.
So I guess the bad style would be in general a mostly solvable problem (or one which could be weighed against other metrics), if the responsible people are even aware there is a problem. Which they might not be, given that they probably have a background in CS rather than art.
I guess the bad aesthetics are to some extent a side effect of some training/fine-tuning step that improves some other metric (like prompt following), and they don’t have a person who knows/cares about art enough to block “improvements” with such side effects.
Also probably a lot of it is just mode collapse from simple preference learning optimization. Each of your comparisons shows a daring, risky choice which a rater might not prefer, vs a very bland, neutral, obvious, colorful output. A lot of the image generations gains are illusory, and caused simply by a mode-collapse down onto a few well-rated points:
Our experiments suggest that realism and consistency can both be improved simultaneously; however, there exists a clear tradeoff between realism/consistency and diversity. By looking at Pareto optimal points, we note that earlier models are better at representation diversity and worse in consistency/realism, and more recent models excel in consistency/realism while decreasing the representation diversity.
Same problem as tuning LLMs. It’s a sugar-rush, like spending Mickey Mouse bucks at Disney World: it gives you the illusion of progress and feels like it’s free, but in reality you’ve paid for every ‘gain’.
I found that Midjourney had become more generic in some way that was hard to place.
What you can try doing is enabling the personalization (or use mine), to drag it away from the generic MJ look, and then experimenting with the chaos sampling option to find something more interesting you can then work with & vary.
We’ve talked (a little) about integrating Flux more into LW, to make it easier to make good images. (maybe with a soft-nudge towards using “LessWrong watercolor style” by default if you don’t specify something else),
Although something habryka brought up is a lot of people’s images seem to be coming from substack, which has it’s own (bad) version of it.
I’m told (by the ‘simple parameters’ section of this guide, which I have not had the opportunity to test but which to my layperson’s eye seems promisingly mechanistic in approach) that adjusting the stylize parameter to numbers lower than its default 100 turns down the midjourney-house-style effect (at the cost of sometimes tending to make things more collage-y and incoherent as values get lower), and that increasing the weird parameter above its default 0 will effectively push things to be unlike the default style (more or less).
Yeah. Regarding Dall-E 3 specifically: Few people know that there was an unnamed model between Dall-E 2 and 3: “Dall-E 2.5”, as I like to call it. It was initially used in Microsoft’s Bing Image Creator. It often produced surprisingly good aesthetics, especially with very short or unspecific prompts.
Then it got replaced with Dall-E 3, which produces substantially fewer visual errors (like e.g. too many limbs), and it has a much better complex prompt following ability, but its style is also way worse than 2.5. Like apparently most other text-to-image models, Dall-E 3 largely produces tacky, aesthetically worthless kitsch. RIP Dall-E 2.5, which is now completely unavailable. :(
A few comparisons using old images I did a while ago:
Somewhat relatedly: when I started this post I planned to argue you should use midjourney instead of DallE, but then when I went to test it I found that Midjourney had become more generic in some way that was hard to place. I think it’s still better than DallE default but not a slam dunk
I guess the bad aesthetics are to some extent a side effect of some training/fine-tuning step that improves some other metric (like prompt following), and they don’t have a person who knows/cares about art enough to block “improvements” with such side effects.
In case of Dall-E the history was something like this: Dall-E 1/2: No real style, generations did look presumably like an average prediction from the training sample, e.g. like a result from Google Images. Dall-E 2.5: Mostly good aesthetics, e.g. portraits tend to have dramatic lighting and contrast. Dall-E 3: Very tacky aesthetics, probably not intentional but a side effect from something else.
So I guess the bad style would be in general a mostly solvable problem (or one which could be weighed against other metrics), if the responsible people are even aware there is a problem. Which they might not be, given that they probably have a background in CS rather than art.
Also probably a lot of it is just mode collapse from simple preference learning optimization. Each of your comparisons shows a daring, risky choice which a rater might not prefer, vs a very bland, neutral, obvious, colorful output. A lot of the image generations gains are illusory, and caused simply by a mode-collapse down onto a few well-rated points:
Same problem as tuning LLMs. It’s a sugar-rush, like spending Mickey Mouse bucks at Disney World: it gives you the illusion of progress and feels like it’s free, but in reality you’ve paid for every ‘gain’.
What you can try doing is enabling the personalization (or use mine), to drag it away from the generic MJ look, and then experimenting with the
chaos
sampling option to find something more interesting you can then work with & vary.it’s sad that open source models like Flux have a lot of potential for customized workflows and finetuning but few people use them
We’ve talked (a little) about integrating Flux more into LW, to make it easier to make good images. (maybe with a soft-nudge towards using “LessWrong watercolor style” by default if you don’t specify something else),
Although something habryka brought up is a lot of people’s images seem to be coming from substack, which has it’s own (bad) version of it.
I’m told (by the ‘simple parameters’ section of this guide, which I have not had the opportunity to test but which to my layperson’s eye seems promisingly mechanistic in approach) that adjusting the
stylize
parameter to numbers lower than its default 100 turns down the midjourney-house-style effect (at the cost of sometimes tending to make things more collage-y and incoherent as values get lower), and that increasing theweird
parameter above its default 0 will effectively push things to be unlike the default style (more or less).Beautiful, thanks for sharing