Building a DreamBooth

Date

September 14, 2022 → January 12, 2023

All of the images on the home page were generated by DreamBooth.

Stable Diffusion Explained
Fine-Tuning Stable Diffusion with Dreambooth
Training DreamBooth
Appendix - Stable Diffusion On-Ramp
General Deep Learning Background
Transformers and Attention
Diffusion Models

Stable diffusion is a machine learning model that can generate detailed images based on text descriptions. It is developed by the CompVis group at LMU Munich and released by a collaboration of Stability AI, CompVis LMU, and Runway. The model's code and weights are publicly available, and it can run on most consumer hardware equipped with a GPU. Stable diffusion is used for tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. It is a latent diffusion model, a type of deep generative neural network.

DreamBooth is a deep learning model that can fine-tune existing text-to-image models. It was developed by researchers from Google Research and Boston University in 2022. DreamBooth can be applied to other text-to-image models, allowing them to generate more personalized and fine-tuned outputs after training on a small number of images of a subject. This makes DreamBooth useful for generating more detailed and accurate images based on text descriptions. It was originally developed using Google's own Imagen text-to-image model.

Stable Diffusion Explained

Stable Diffusion is a deep learning model that uses a technique called latent diffusion to generate detailed images based on text descriptions. This technique is a variant of the diffusion model, which was introduced in 2015.

High-Resolution Image Synthesis with Latent Diffusion Models

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining.

arxiv.org

The Stable Diffusion research paper by Rombach et al.

The process of stable diffusion involves three main components: a variational autoencoder (VAE), a U-Net, and an optional text encoder. The VAE compresses an input image into a lower-dimensional latent space, capturing its fundamental semantic meaning. Gaussian noise is then iteratively applied to this latent representation during forward diffusion.

Stable diffusion process, conditioned on arbitrary text [source: Rombach et al.]

The U-Net block, which is composed of a ResNet backbone, then denoises the output from the forward diffusion process, generating a latent representation of the image. Finally, the VAE decoder converts this representation back into pixel space, generating the final image.

The denoising step in this process can be flexibly conditioned on various types of input data, such as text or images, using a cross-attention mechanism. The model is trained on pairs of images and captions from the LAION-5B dataset, which contains billions of image-text pairs scraped from the web. The trained model can then be used to generate images based on text prompts, or to perform other tasks such as image inpainting or translation.

Fine-Tuning Stable Diffusion with Dreambooth

DreamBooth is a deep learning model that can be used to fine-tune existing text-to-image models, allowing them to generate more personalized and specific outputs. It was developed by researchers at Google Research and Boston University in 2022.

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

Large text-to-image models achieved a remarkable leap in the evolution of AI, enabling high-quality and diverse synthesis of images from a given text prompt. However, these models lack the ability to mimic the appearance of subjects in a given reference set and synthesize novel renditions of them in different contexts.

arxiv.org

The DreamBooth research paper by Ruiz et al.

To implement DreamBooth, a small set of images depicting a specific subject is used to fine-tune a pretrained text-to-image model. This usually involves three to five images, paired with text prompts that contain the name of the class the subject belongs to and a unique identifier. For example, if the subject is a person, the text prompt might be "person - Rukmal".

A class-specific prior preservation loss is applied during training, which encourages the model to generate diverse instances of the subject based on what it has already learned about the original class. Pairs of low-resolution and high-resolution images are also used to fine-tune the model's ability to maintain fine details in the generated images.

The fine-tuning process used in DreamBooth [source: Ruiz et al.]

Once fine-tuned, DreamBooth can be used to generate more specific and personalised images based on text prompts. For example, it could be used to generate images of specific individuals, or to render known subjects in different contexts and situations. This can be useful for a variety of applications, but is generally too computationally intensive for hobbyist users to implement.

Training DreamBooth

My implementation of DreamBooth was a fine-tuned version of Stable Diffusion v1-5. It was trained on a p3.8xlarge EC2 instance on AWS. Once the (still hacky) script was done, training took about 20 minutes, with a total of 18 training images. The authors recommended using 3-5 images, but I found that using 20, combined with a lower learning rate yielded better results.

Following recommendations by Ruiz et al., both the diffusion U-Net and the text encoder were fine-tuned with the training images. Additionally, prior-preservation was used to avoid overfitting and language drift in the final model, as recommended by the original authors.

Appendix - Stable Diffusion On-Ramp

This section contains a list of resources that were extremely helpful in building the background knowledge necessary to build DreamBooth.

General Deep Learning Background

Andrej Karpathy’s series of lectures on Neural Networks, Backpropagation, and LLMs (Neural Networks: Zero to Hero)

The spelled-out intro to neural networks and backpropagation: building micrograd

This is the most step-by-step spelled-out explanation of backpropagation and training of neural networks. It only assumes basic knowledge of Python and a vague recollection of calculus from high school.

www.youtube.com

The spelled-out intro to neural networks and backpropagation: building micrograd

Deep Learning Basics - Introduction and Overview

Deep Learning Basics: Introduction and Overview

An introductory lecture for MIT course 6.S094 on the basics of deep learning including a few key ideas, subfields, and the big picture of why neural networks have inspired and energized an entire new generation of researchers.

www.youtube.com

Deep Learning Basics: Introduction and Overview

Introduction to HuggingFace primitives, NLP models, and using 🤗 Diffusers

Introduction - Hugging Face Course

This course will teach you about natural language processing (NLP) using libraries from the Hugging Face ecosystem - 🤗 Transformers, 🤗 Datasets, 🤗 Tokenizers, and 🤗 Accelerate - as well as the Hugging Face Hub. It's completely free and without ads.

huggingface.co

Transformers and Attention

Transfer Learning and Transformer Models (Machine Learning Tech Talk from Google Research)

Transfer learning and Transformer models (ML Tech Talks)

In this session of Machine Learning Tech Talks, Software Engineer from Google Research, Iulia Turc, will walk us through the recent history of natural language processing, including the current state of the art architecture, the Transformer.

www.youtube.com

Transfer learning and Transformer models (ML Tech Talks)

Stanford CS224N Lecture 14 - Transformers and Self-Attention

Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 - Transformers and Self-Attention

For more information about Stanford's Artificial Intelligence professional and graduate programs, visit: https://stanford.io/3niIw41 Professor Christopher Manning, Stanford University, Ashish Vaswani & Anna Huang, Google http://onlinehub.stanford.edu/ Professor Christopher Manning Thomas M.

www.youtube.com

Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 - Transformers and Self-Attention

University of Waterloo CS480/680 Lecture 19: Attention and Transformer Networks

CS480/680 Lecture 19: Attention and Transformer Networks

Uploaded by Pascal Poupart on 2019-07-16.

www.youtube.com

CS480/680 Lecture 19: Attention and Transformer Networks

DETR: End-to-End Object Detection with Transformers Paper Explained by Yannic Kilcher

DETR: End-to-End Object Detection with Transformers (Paper Explained)

Object detection in images is a notoriously hard task! Objects can be of a wide variety of classes, can be numerous or absent, they can occlude each other or be out of frame. All of this makes it even more surprising that the architecture in this paper is so simple.

www.youtube.com

DETR: End-to-End Object Detection with Transformers (Paper Explained)

Attention is All You Need Paper Explained by Halfling Wizard

Attention Is All You Need - Paper Explained

In this video, I'll try to present a comprehensive study on Ashish Vaswani and his coauthors' renowned paper, "attention is all you need" This paper is a major turning point in deep learning research. The transformer architecture, which was introduced in this paper, is now used in a variety of state-of-the-art models in natural language processing and beyond.

www.youtube.com

Attention Is All You Need - Paper Explained

Diffusion Models

Stable Diffusion with 🧨 Diffusers by Patil et al.

Stable Diffusion with 🧨 Diffusers

using 🧨 Diffusers Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from CompVis, Stability AI and LAION. It is trained on 512x512 images from a subset of the LAION-5B database. LAION-5B is the largest, freely accessible multi-modal dataset that currently exists.

huggingface.co

HuggingFace 🧨 Diffusers library explanation notebook

Google Colaboratory

colab.research.google.com

The Illustrated Stable Diffusion by Jay Alammar

The Illustrated Stable Diffusion

(V2 Nov 2022: Updated images for more precise description of forward diffusion thanks to Jeremy and Hamel. A few more images in this version) AI image generation is the most recent AI capability blowing people's minds (mine included).

jalammar.github.io

The Annotated Diffusion Model by Rogge et al.

The Annotated Diffusion Model

In this blog post, we'll take a deeper look into Denoising Diffusion Probabilistic Models (also known as DDPMs, diffusion models, score-based generative models or simply autoencoders) as researchers have been able to achieve remarkable results with them for (un)conditional image/audio/video generation.

huggingface.co

What are Diffusion Models? By Lilian Weng

What are Diffusion Models?

Updated on 2021-09-19: Highly recommend this blog post on score-based generative modeling by Yang Song (author of several key papers in the references)]. [Updated on 2022-08-27: Added classifier-free guidance, GLIDE, unCLIP and Imagen. [Updated on 2022-08-31: Added latent diffusion model. So far, I've written about three types of generative models, GAN, VAE, and Flow-based models.

lilianweng.github.io

fast.ai Practical Deep Learning for Coders - Part 2, Lecture 9: Deep Learning Foundations to Stable Diffusion

Lesson 9: Deep Learning Foundations to Stable Diffusion, 2022

Lesson 10: https://youtu.be/6StU6UtZEbU Lesson 9A (Deep dive): https://youtu.be/0_BBRNYInx8 Lesson 9B (Math of diffusion): https://youtu.be/mYpjmM7O-30 This lesson starts with a tutorial on how to use pipelines in the Diffusers library to generate images. Diffusers is (in our opinion!) the best library available at the moment for image generation.

www.youtube.com

Lesson 9: Deep Learning Foundations to Stable Diffusion, 2022

Training Stable Diffusion with DreamBooth by Suraj Patil

Training Stable Diffusion with Dreambooth

is a technique to teach new concepts to using a specialized form of fine-tuning. Some people have been using it with a few of their photos to place themselves in fantastic situations, while others are using it to incorporate new styles. 🧨 , but some users reported that it was hard to get great results.

wandb.ai

Training Stable Diffusion with DreamBooth using 🧨 Diffusers by Patil et al.

Training Stable Diffusion with Dreambooth using Diffusers

Dreambooth is a technique to teach new concepts to Stable Diffusion using a specialized form of fine-tuning. Some people have been using it with a few of their photos to place themselves in fantastic situations, while others are using it to incorporate new styles. 🧨 Diffusers provides a Dreambooth training script.

huggingface.co