User:Aswiki1

[Lead]

Architecture
Stable Diffusion uses a variant of diffusion model (DM), called latent diffusion model (LDM). Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise to training images, and can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the Variational autoencoder, U-Net, and an optional text encoder. The VAE encoder compresses the image in pixel space to a smaller dimensional latent space. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of a ResNet backbone, denoises the forward diffusion outputs back into compressed image information in the latent space, which is then converted back to the image in pixel space via the VAE decoder. The denoising step may be conditioned on a string of text, an image, or some other data. An encoding of the conditioning data is exposed to the denoising U-Nets via a cross-attention mechanism. For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space. Researchers point to reduced computational requirements for training and generation as an advantage of LDMs.

Training data
Stable Diffusion was trained on pairs of images and captions taken from LAION-5B, a publicly available dataset derived from Common Crawl data scraped from the web, where 5 billion image-text pairs were classified based on language, filtered into separate datasets by resolution, a predicted likelihood of containing a watermark, and predicted “aesthetic” score (e.g. subjective visual quality). The dataset was created by LAION, a German non-profit which receives funding from Stability AI. The Stable Diffusion model was trained on three subsets of LAION-5B: laion2B-en, laion-high-resolution, and laion-aesthetics v2 5+. A third-party analysis of the model's training data identified that out of a smaller subset of 12 million images taken from the original wider dataset used, approximately 47% of the sample size of images came from 100 different domains, with Pinterest taking up 8.5% of the subset, followed by websites such as WordPress, Blogspot, Flickr, DeviantArt and Wikimedia Commons.

Training procedures
The model was initially trained on the laion2B-en and laion-high-resolution subsets, with the last few rounds of training done on "LAION-Aesthetics v2 5+", a subset of 600 million captioned images which the LAION-Aesthetics Predictor V2 predicted that humans would, on average, give a score of at least 5 out of 10 when asked to rate how much they liked them. The LAION-Aesthetics v2 5+ subset also excluded low-resolution images and images which LAION-5B-WatermarkDetection identified as carrying a watermark with greater than 80% probability. Final rounds of training additionally dropped 10% of text conditioning to improve Classifier-Free Diffusion Guidance.

The model was trained using 256 Nvidia A100 GPUs on Amazon Web Services for a total of 150,000 GPU-hours, at a cost of $600,000.

Technical limitations
Stable Diffusion has issues with degradation and inaccuracies in certain instances. Since the model was trained on a dataset that consists of 512x512 resolution images, it tends to degrade in its quality when it deviates from its "expected" 512x512 resolution images. Another challenge is its notable inaccuracy in generating human limbs due to poor data quality of limbs in the LAION database. The model is insufficiently trained to understand human limbs and faces due to the lack of features in the database, and prompting the model to generate images of such type can confound the model. In addition to human limbs, generating animal limbs have been observed to be a challenge as well, with the reported failure rate of 25% when trying to generate an image of a horse.

Accessibility for individual developers can also be a problem. In order to customize the model for new use cases that are not included in the dataset such as generating anime characters ("waifu diffusion"), new data and further training are required. However, this fine-tuning process is sensitive to the quality of data; low resolution images or different resolutions from the original data can not only fail to learn the new task but degrade the overall performance of the model. Even when the model is additionally trained on high quality images, it is difficult for individuals to run models in consumer electronics. For example, waifu-diffusion requires a minimum 30GB of VRAM, which exceeds the usual resource provided in Graphics Processing Units (GPU), such as NVIDIA’s 30XX series having around 12 GB.

Text to image generation
The text to image sampling script within Stable Diffusion, known as "txt2img", consumes a text prompt, in addition to assorted option parameters covering sampling types, output image dimensions, and seed values. The script outputs an image file based on the model's interpretation of the prompt. Generated images are tagged with an invisible digital watermark to allow users to identify an image as generated by Stable Diffusion, although this watermark loses its effectiveness if the image is resized or rotated.

Each txt2img generation will involve a specific seed value which affects the output image; users may opt to randomize the seed in order to explore different generated outputs, or use the same seed to obtain the same image output as a previously generated image. Users are also able to adjust the number of inference steps for the sampler; a higher value takes a longer duration of time, however a smaller value may result in visual defects. Another configurable option, the classifier-free guidance scale value, allows the user to adjust how closely the output image adheres to the prompt; more experimentative or creative use cases may opt for a lower value, while use cases aiming for more specific outputs may use a higher value.

Additional text2img features are provided by front-end implementations of Stable Diffusion, which allow users to modify the weight given to specific parts of the text prompt. Emphasis markers allow users to add or reduce emphasis to keywords by enclosing them with brackets. An alternative method of adjusting weight to parts of the prompt is “negative prompts.” Negative prompts are a feature included in some front-end implementations and allow the user to specify prompts which the model should avoid during image generation. The specified prompts may be undesirable image features that would otherwise be present within image outputs due to the positive prompts provided by the user, or due to how the model was originally trained.

Image modification
Stable Diffusion also includes another sampling script, "img2img," which consumes a text prompt, path to an existing image, and strength value between 0.0 and 1.0. The script outputs a new image based on the original image that also features elements provided within the textual prompt. The strength value denotes the amount of noise added to the output image. A higher strength value produces more variation within the image but may produce an image that is not semantically consistent with the prompt provided.

The ability of img2img to add noise to the original image makes it potentially useful for data anonymization and data augmentation, in which the visual features of image data are changed and anonymized. The same process may also be useful for image upscaling, in which the resolution of an image is increased, with more detail potentially being added to the image. Additionally, Stable Diffusion has been experimented with as a tool for image compression. Compared to JPEG and WebP, the recent methods used for image compression in Stable Diffusion face limitations in preserving small text and faces.

Additional use-cases for image modification via img2img are offered by numerous front-end implementations of the Stable Diffusion model. Inpainting involves selectively modifying a portion of an existing image delineated by a user-provided mask, which fills the masked space with newly generated content based on the provided prompt. Conversely, outpainting extends an image beyond its original dimensions, filling the previously empty space with content generated based on the provided prompt.

Image usage
Stable Diffusion claims no rights on generated images and freely gives users the rights of usage to any generated images from the model provided that the image content is not illegal or harmful to individuals. Users forfeit any copyright and intellectual property claims to any of the model’s output, as generated images are considered to be in the public domain and available for use by any individual. The freedom provided to users over image usage has caused controversy over the ethics of ownership, as Stable Diffusion and other similar machine learning image synthesis products are trained from copyrighted images without the owner’s consent.

As visual styles and compositions are not subject to copyright, some argue that users of Stable Diffusion who generate images of artworks are not infringing upon the copyright of visually similar works; however, individuals depicted in generated images may still be protected by personality rights if their likenesses are used, and intellectual property such as recognizable brand logos still remain protected by copyright. Nonetheless, visual artists have expressed concern that widespread usage of image synthesis software such as Stable Diffusion may eventually lead to human artists, along with photographers, models, cinematographers and actors, to gradually lose commercial viability against AI-based competitors.

Stable Diffusion is notably more permissive in the types of content users may generate, such as violent or sexually explicit imagery, in comparison to similar machine learning image synthesis products from other companies. Addressing concerns that the model may be used for abusive purposes, CEO of Stability AI Emad Mostaque explains that "(it is) peoples' responsibility as to whether they are ethical, moral, and legal in how they operate this technology", and that putting the capabilities of Stable Diffusion into the hands of the public would result in the technology providing a net benefit overall, even in spite of the potential negative consequences. In addition, Mostaque argues that the intention behind the open availability of Stable Diffusion is to end corporate control and dominance over such technologies, who have previously only developed closed AI systems for image synthesis. This is reflected by the fact that any restrictions Stability AI places on the content that users may generate can easily be bypassed due to the open-source nature of the license which Stable Diffusion was released under.

Limitations
Bias:

Stable Diffusion was primarily trained on data from LAION-2B(en), which consisted primarily of images with English descriptions. As a result, generated images reinforce social biases and are from a western perspective, as the creators note that the model lacks data from other communities and cultures. The model gives more accurate results for prompts that are written in English when compared to those written in other languages, and western or white cultures are often the default representation.