Bibliography:

  1. ‘autoencoder NN’ tag

  2. Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

  3. MaskBit: Embedding-free Image Generation via Bit Tokens

  4. Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

  5. Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

  6. MAR: Autoregressive Image Generation without Vector Quantization

  7. SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

  8. σ-GPTs: A New Approach to Autoregressive Models

  9. Rethinking Patch Dependence for Masked Autoencoders

  10. Rich Human Feedback for Text-to-Image Generation

  11. Self-conditioned Image Generation via Generating Representations

  12. Rethinking FID: Towards a Better Evaluation Metric for Image Generation

  13. Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion

  14. Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders (SSAT)

  15. Vision Transformers Need Registers

  16. Diffusion Models Beat GANs on Image Classification

  17. Test-Time Training on Video Streams

  18. Rosetta Neurons: Mining the Common Units in a Model Zoo

  19. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

  20. Generalizable Synthetic Image Detection via Language-guided Contrastive Learning

  21. SoundStorm: Efficient Parallel Audio Generation

  22. A Cookbook of Self-Supervised Learning

  23. CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval

  24. Masked Diffusion Transformer is a Strong Image Synthesizer

  25. PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling

  26. John Carmack’s ‘Different Path’ to Artificial General Intelligence

  27. JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

  28. MUG: Vision Learners Meet Web Image-Text Pairs

  29. TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models

  30. Muse: Text-To-Image Generation via Masked Generative Transformers

  31. MAGVIT: Masked Generative Video Transformer

  32. Scaling Language-Image Pre-training via Masking

  33. MaskDistill: A Unified View of Masked Image Modeling

  34. MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

  35. Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces

  36. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

  37. Exploring Long-Sequence Masked Autoencoders

  38. TVLT: Textless Vision-Language Transformer

  39. Test-Time Training with Masked Autoencoders

  40. PatchDropout: Economizing Vision Transformers Using Patch Dropout

  41. CMAE: Contrastive Masked Autoencoders are Stronger Vision Learners

  42. PIXEL: Language Modeling with Pixels

  43. Masked Autoencoders that Listen

  44. OmniMAE: Single Model Masked Pretraining on Images and Videos

  45. M3AE: Multimodal Masked Autoencoders Learn Transferable Representations

  46. Masked Autoencoders As Spatiotemporal Learners

  47. CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

  48. Should You Mask 15% in Masked Language Modeling?

  49. MaskGIT: Masked Generative Image Transformer

  50. SimMIM: A Simple Framework for Masked Image Modeling

  51. MAE: Masked Autoencoders Are Scalable Vision Learners

  52. Hide-and-Seek: A Data Augmentation Technique for Weakly-Supervised Localization and Beyond

  53. design#future-tag-features

    [Transclude the forward-link's context]

  54. 2022-maskdistill-table1-systematiccomparisonofmaskedimagemodelingmethodsbyteacherstudentheadnormalizationlossfunction.png

  55. 2022-rust-figure3-pixelreconstructionsofpredictedpixelsoftextsamplesoverthecourseoftraining.png

  56. https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/

  57. 759e50847f76f686f333cf2d92d46f60f3539f48.html

  58. https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/

  59. 9f1046554d05bbbd0075a00b4502abf67cabf7ae.html

  60. https://github.com/facebookresearch/jepa

  61. ad8c5fa97c6ea2c2011978d7fd9a6e215d7aef57.html

  62. https://laion.ai/blog/paella/

  63. 27a7a89edf890ac0e66ca0412e7e64df36134628.html

  64. MaskBit: Embedding-free Image Generation via Bit Tokens

  65. https%253A%252F%252Farxiv.org%252Fabs%252F2409.16211%2523bytedance.html

  66. Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

  67. https%253A%252F%252Farxiv.org%252Fabs%252F2407.15811.html

  68. Rethinking Patch Dependence for Masked Autoencoders

  69. https%253A%252F%252Farxiv.org%252Fabs%252F2401.14391.html

  70. Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion

  71. https%253A%252F%252Farxiv.org%252Fabs%252F2311.01017.html

  72. Test-Time Training on Video Streams

  73. Yu Sun

  74. https%253A%252F%252Farxiv.org%252Fabs%252F2307.05014.html

  75. Rosetta Neurons: Mining the Common Units in a Model Zoo

  76. https%253A%252F%252Farxiv.org%252Fabs%252F2306.09346.html

  77. SoundStorm: Efficient Parallel Audio Generation

  78. https%253A%252F%252Farxiv.org%252Fabs%252F2305.09636%2523google.html

  79. Masked Diffusion Transformer is a Strong Image Synthesizer

  80. https%253A%252F%252Farxiv.org%252Fabs%252F2303.14389.html

  81. MUG: Vision Learners Meet Web Image-Text Pairs

  82. https%253A%252F%252Farxiv.org%252Fabs%252F2301.07088%2523bytedance.html

  83. TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models

  84. https%253A%252F%252Farxiv.org%252Fabs%252F2301.01296%2523microsoft.html

  85. Muse: Text-To-Image Generation via Masked Generative Transformers

  86. https%253A%252F%252Farxiv.org%252Fabs%252F2301.00704%2523google.html

  87. MAGVIT: Masked Generative Video Transformer

  88. https%253A%252F%252Farxiv.org%252Fabs%252F2212.05199%2523google.html

  89. MaskDistill: A Unified View of Masked Image Modeling

  90. https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DwmGlMhaBe0.html

  91. MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

  92. https%253A%252F%252Farxiv.org%252Fabs%252F2211.09117%2523google.html

  93. Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces

  94. https%253A%252F%252Farxiv.org%252Fabs%252F2211.07292.html

  95. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

  96. https%253A%252F%252Farxiv.org%252Fabs%252F2211.07636%2523baai.html

  97. TVLT: Textless Vision-Language Transformer

  98. Mohit Bansal

  99. https%253A%252F%252Farxiv.org%252Fabs%252F2209.14156.html

  100. CMAE: Contrastive Masked Autoencoders are Stronger Vision Learners

  101. https%253A%252F%252Farxiv.org%252Fabs%252F2207.13532%2523bytedance.html

  102. PIXEL: Language Modeling with Pixels

  103. https%253A%252F%252Farxiv.org%252Fabs%252F2207.06991.html

  104. Masked Autoencoders that Listen

  105. https%253A%252F%252Farxiv.org%252Fabs%252F2207.06405%2523facebook.html

  106. OmniMAE: Single Model Masked Pretraining on Images and Videos

  107. https%253A%252F%252Farxiv.org%252Fabs%252F2206.08356%2523facebook.html

  108. M3AE: Multimodal Masked Autoencoders Learn Transferable Representations

  109. Sergey Levine

  110. https%253A%252F%252Farxiv.org%252Fabs%252F2205.14204%2523google.html

  111. Masked Autoencoders As Spatiotemporal Learners

  112. https%253A%252F%252Farxiv.org%252Fabs%252F2205.09113%2523facebook.html

  113. CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

  114. https%253A%252F%252Farxiv.org%252Fabs%252F2204.14217%2523baai.html

  115. SimMIM: A Simple Framework for Masked Image Modeling

  116. https%253A%252F%252Farxiv.org%252Fabs%252F2111.09886%2523microsoft.html

  117. MAE: Masked Autoencoders Are Scalable Vision Learners

  118. Ross Girshick

  119. https%253A%252F%252Farxiv.org%252Fabs%252F2111.06377%2523facebook.html