minDALL-E on Conceptual Captions
minDALL-E
, named after minGPT, is a 1.3B text-to-image generation model trained on 14 million image-text pairs for non-commercial purposes.
Environment Setup
- Basic setup
PyTorch == 1.8.0
CUDA >= 10.1
- Other packages
pip install -r requirements.txt
Model Checkpoint
- Model structure (two-stage autoregressive model)
- Stage1: Unlike the original DALL-E [1], we replace Discrete VAE with VQGAN [2] to generate high-quality samples effectively. We slightly fine-tune vqgan_imagenet_f16_16384, provided by the official VQGAN repository, on FFHQ [3] as well as ImageNet.
- Stage2: We train our 1.3B transformer from scratch on 14 million image-text pairs from CC3M [4] and CC12M [5]. For the more detailed model spec, please see configs/dalle-1.3B.yaml.
- You can download the pretrained models including the tokenizer from this link. This will require about 5GB space.
Sampling
- Given a text prompt, the code snippet below generates candidate images and re-ranks them using OpenAI’s CLIP [6].
- This has been tested under a single V100 of 32GB memory. In the case of using GPUs with limited memory, please lower down num_candidates to avoid OOM.
[…]
Samples (Top-K=256, Temperature=1.0)
- “a painting of a {cat, dog} with sunglasses in the frame”
- “a large {pink, black} elephant walking on the beach”
- “Eiffel tower on a {desert, mountain}”
More
There’s dalle-mini, a colab where you can run it to test it
Robin Edgar
Organisational Structures | Technology and Science | Military, IT and Lifestyle consultancy | Social, Broadcast & Cross Media | Flying aircraft
robin@edgarbv.com
https://www.edgarbv.com