Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis
Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
MAR: Autoregressive Image Generation without Vector Quantization
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound
Self-conditioned Image Generation via Generating Representations
Rethinking FID: Towards a Better Evaluation Metric for Image Generation
Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion
Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders (SSAT)
Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models
Generalizable Synthetic Image Detection via Language-guided Contrastive Learning
CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval
Masked Diffusion Transformer is a Strong Image Synthesizer
PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling
John Carmack’s ‘Different Path’ to Artificial General Intelligence
JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models
Muse: Text-To-Image Generation via Masked Generative Transformers
MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis
Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
PatchDropout: Economizing Vision Transformers Using Patch Dropout
CMAE: Contrastive Masked Autoencoders are Stronger Vision Learners
OmniMAE: Single Model Masked Pretraining on Images and Videos
M3AE: Multimodal Masked Autoencoders Learn Transferable Representations
CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers
Hide-and-Seek: A Data Augmentation Technique for Weakly-Supervised Localization and Beyond
2022-maskdistill-table1-systematiccomparisonofmaskedimagemodelingmethodsbyteacherstudentheadnormalizationlossfunction.png
2022-rust-figure3-pixelreconstructionsofpredictedpixelsoftextsamplesoverthecourseoftraining.png
https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/
https%253A%252F%252Farxiv.org%252Fabs%252F2409.16211%2523bytedance.html
Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion
https%253A%252F%252Farxiv.org%252Fabs%252F2305.09636%2523google.html
Masked Diffusion Transformer is a Strong Image Synthesizer
https%253A%252F%252Farxiv.org%252Fabs%252F2301.07088%2523bytedance.html
TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models
https%253A%252F%252Farxiv.org%252Fabs%252F2301.01296%2523microsoft.html
Muse: Text-To-Image Generation via Masked Generative Transformers
https%253A%252F%252Farxiv.org%252Fabs%252F2301.00704%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2212.05199%2523google.html
https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DwmGlMhaBe0.html
MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis
https%253A%252F%252Farxiv.org%252Fabs%252F2211.09117%2523google.html
Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
https%253A%252F%252Farxiv.org%252Fabs%252F2211.07636%2523baai.html
CMAE: Contrastive Masked Autoencoders are Stronger Vision Learners
https%253A%252F%252Farxiv.org%252Fabs%252F2207.13532%2523bytedance.html
https%253A%252F%252Farxiv.org%252Fabs%252F2207.06405%2523facebook.html
OmniMAE: Single Model Masked Pretraining on Images and Videos
https%253A%252F%252Farxiv.org%252Fabs%252F2206.08356%2523facebook.html
M3AE: Multimodal Masked Autoencoders Learn Transferable Representations
https%253A%252F%252Farxiv.org%252Fabs%252F2205.14204%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2205.09113%2523facebook.html
CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers
https%253A%252F%252Farxiv.org%252Fabs%252F2204.14217%2523baai.html
https%253A%252F%252Farxiv.org%252Fabs%252F2111.09886%2523microsoft.html
https%253A%252F%252Farxiv.org%252Fabs%252F2111.06377%2523facebook.html
Wikipedia Bibliography: