Current text-to-image era (T2I) fashions, resembling Secure Diffusion and Imagen, have made vital progress in producing high-resolution photographs based mostly on textual content descriptions. Nevertheless, many generated photographs nonetheless endure from points like artifacts (e.g., distorted objects, textual content and physique components), misalignment with textual content descriptions, and low aesthetic high quality. For instance, the immediate within the picture under says, “A panda using a motorbike”, nonetheless the generated picture exhibits two pandas, with further undesired artifacts, together with distorted panda noses and wheel spokes.
Impressed by the success of reinforcement studying from human suggestions (RLHF) for big language fashions (LLMs), we discover whether or not studying from human suggestions (LHF) may also help enhance picture era fashions. When utilized to LLMs, human suggestions can vary from easy desire rankings (e.g., “thumb up or down”, “A or B”), to extra detailed responses like rewriting a problematic reply. Nevertheless, present work on LHF for T2I primarily focuses on easy responses like desire rankings, since fixing a problematic picture typically requires superior abilities (e.g., enhancing), making it too tough and time consuming.
In “Wealthy Human Suggestions for Textual content-to-Picture Technology“, we design a course of to acquire wealthy human suggestions for T2I that’s each particular (e.g., telling us what’s fallacious in regards to the picture and the place) and simple to acquire. We exhibit the feasibility and advantages of LHF for T2I. Our major contributions are threefold:
- We curate and launch RichHF-18K, a human suggestions dataset protecting 18K photographs generated by Secure Diffusion variants.
- We prepare a multimodal transformer mannequin, Wealthy Automated Human Suggestions (RAHF), to foretell several types of human suggestions, resembling implausibility scores, heatmaps of artifact places, and lacking or misaligned textual content/key phrases.
- We present that the anticipated wealthy human suggestions may be leveraged to enhance picture era and that the enhancements generalize to fashions (resembling Muse) past these used for information assortment (Secure Diffusion variants).
To one of the best of our information, that is the primary wealthy suggestions dataset and mannequin for state-of-the-art text-to-image era.