October 23, 2023
Kevin Lin∗, Zhengyuan Yang∗, Linjie Li, Jianfeng Wang, Lijuan Wang∗♠
Microsoft Corporation
{keli,zhengyang,lindsey.li,jianfw,lijuanw}@microsoft.com
https://design-bench.github.io/
Abstract
We introduce DEsignBench, a text-to-image (T2I) generation benchmark tailored for visual design scenarios. Recent T2I models like DALL-E 3 [8, 67, 66] and others, have demonstrated remarkable capabilities in generating photorealistic images that align closely with textual inputs. While the allure of creating visually captivating images is undeniable, our emphasis extends beyond mere aesthetic pleasure. We aim to investigate the potential of using these powerful models in authentic de- sign contexts. In pursuit of this goal, we develop DEsignBench, which incorporates test samples designed to assess T2I models on both “design technical capability” and “design application scenario.” Each of these two dimensions is supported by a diverse set of specific design categories. We explore DALL-E 3 together with other leading T2I models on DEsignBench, resulting in a comprehensive visual gallery for side-by-side comparisons. For DEsignBench benchmarking, we per- form human evaluations on generated images in DEsignBench gallery, against the criteria of image-text alignment, visual aesthetic, and design creativity. Our evaluation also considers other specialized design capabilities, including text rendering, layout composition, color harmony, 3D design, and medium style. In addition to human evaluations, we introduce the first automatic image generation evaluator powered by GPT-4V. This evaluator provides ratings that align well with human judgments, while being easily replicable and cost-efficient. A high-resolution version is available at this link.
Contents
1.1 Motivation and Overview………………………………………………………………………………………………… 5
2 DALL-E 3 Basics and DEsignBench Settings 7
2.1 DALL-E 3’s Working Modes…………………………………………………………………………………………… 7
2.2 T2I Generation Capability Overview……………………………………………………………………………….. 7
3 Design Technical Capability 13
3.1 Text Rendering and Typography……………………………………………………………………………………. 13
3.2 Layout and Composition……………………………………………………………………………………………….. 13
3.3 Color Harmony……………………………………………………………………………………………………………… 14
3.4 Medium and Style…………………………………………………………………………………………………………. 14
3.5 3D and Cinematography………………………………………………………………………………………………… 14
4 Design Scenario 31
4.1 Infographics Design………………………………………………………………………………………………………. 31
4.2 Animation/Gaming Design……………………………………………………………………………………………. 31
4.3 Product Design………………………………………………………………………………………………………………. 32
4.4 Visual Art Design………………………………………………………………………………………………………….. 32
5 DEsignBench and Evaluation Results 56
5.1 Evaluation Method and Metric………………………………………………………………………………………. 56
5.2 Compared T2I Models…………………………………………………………………………………………………… 57
5.3 Evaluation Results…………………………………………………………………………………………………………. 58
5.4 Limitations of DALL-E 3………………………………………………………………………………………………. 69
6 Conclusions 73
Comparisons among SDXL, Midjourney, Ideogram, Firefly2, and DALL-E 3 80
List of Figures
- DEsignBench overview…………………………………………………………………………………………………………. 6
- ChatGPT prompt expansion…………………………………………………………………………………………………… 9
- prompt following: detailed descriptions…………………………………………………………………………………. 10
- prompt following: uncommon scenes……………………………………………………………………………………. 11
- other challenge prompts………………………………………………………………………………………………………. 12
- text rendering: stylized text………………………………………………………………………………………………….. 15
- text rendering: low-frequency words…………………………………………………………………………………….. 16
- text rendering: long text………………………………………………………………………………………………………. 17
- layout and composition: diagram, chart, table, calendar…………………………………………………………… 18
- layout and composition: multi-panel layout……………………………………………………………………………. 19
- color harmony: impression sunrise……………………………………………………………………………………….. 20
- color harmony: starry night………………………………………………………………………………………………….. 21
- medium and style: cats 1……………………………………………………………………………………………………… 22
- medium and style: cats 2……………………………………………………………………………………………………… 23
- medium and style: cats 3……………………………………………………………………………………………………… 24
- 3D and cinematography: shape and lighting…………………………………………………………………………… 25
- 3D and cinematography: lighting effect…………………………………………………………………………………. 26
- 3D and cinematography: camera view points…………………………………………………………………………. 27
- 3D and cinematography: camera settings and lens………………………………………………………………….. 28
- 3D and cinematography: crowded scene 1……………………………………………………………………………… 29
- 3D and cinematography: crowded scene 2……………………………………………………………………………… 30
- infographics design: storybook, poster, and menu…………………………………………………………………… 33
- infographics design: industrial drafts, floorplans, and GUI………………………………………………………. 34
- infographics design: ads, marketing posters, and book covers………………………………………………….. 35
- infographics design: movie poster, ads………………………………………………………………………………….. 36
- infographics design: logo and postcards………………………………………………………………………………… 37
- infographics design: greeting cards………………………………………………………………………………………. 38
- infographics design: coloring book……………………………………………………………………………………….. 39
- product design: sticker………………………………………………………………………………………………………… 40
- animation design: cinematic scenes………………………………………………………………………………………. 41
- animation design: six-panel comic strip………………………………………………………………………………… 42
- animation design: six-panel comic strip………………………………………………………………………………… 43
- animation design: six-panel comic strip………………………………………………………………………………… 44
- animation design: storyboard……………………………………………………………………………………………….. 45
- animation design: cartoon, emoji, anime……………………………………………………………………………….. 46
- gaming design: gaming 1…………………………………………………………………………………………………….. 47
- gaming design: gaming 2…………………………………………………………………………………………………….. 48
- product design: product and jewellery 1………………………………………………………………………………… 49
- product design: product and jewellery 2………………………………………………………………………………… 50
- product design: fashion……………………………………………………………………………………………………….. 51
- product design: change clothes…………………………………………………………………………………………….. 52
- visual art design: 3D sculpture and historical art…………………………………………………………………….. 53
- visual art design: historical art, time-space travel……………………………………………………………………. 54
- visual art design: knolling……………………………………………………………………………………………………. 55
- Human evaluation results on DEsignBench.…………………………………………………………………………… 59
- comparison between GPT-4V and human judgments on DEsignBench………………………………………. 60
- GPT-4V evaluation on DEsignBench…………………………………………………………………………………….. 61
- GPT-4V evaluation on DEsignBench…………………………………………………………………………………….. 62
- failure cases: uncommon scenes…………………………………………………………………………………………… 70
- failure cases: document design…………………………………………………………………………………………….. 71
- failure cases: image generation…………………………………………………………………………………………….. 72
- text rendering comparisons………………………………………………………………………………………………….. 81
- text rendering comparisons………………………………………………………………………………………………….. 82
- layout and document comparisons………………………………………………………………………………………… 83
- layout and document comparisons………………………………………………………………………………………… 84
- color comparisons………………………………………………………………………………………………………………. 85
- color comparisons………………………………………………………………………………………………………………. 86
- artistic medium comparisons………………………………………………………………………………………………… 87
- artistic medium comparisons………………………………………………………………………………………………… 88
- style and 3D comparisons……………………………………………………………………………………………………. 89
- style and 3D comparisons……………………………………………………………………………………………………. 90
- camera settings comparisons………………………………………………………………………………………………… 91
- color comparisons………………………………………………………………………………………………………………. 92
- crowded scene comparisons…………………………………………………………………………………………………. 93
- crowded scene comparisons…………………………………………………………………………………………………. 94
- storybooks, academic posters, and menus comparisons……………………………………………………………. 95
- storybooks, academic posters, and menus comparisons……………………………………………………………. 96
- industrial drafts, floorplans, and GUI comparisons………………………………………………………………….. 97
- industrial drafts, floorplans, and GUI comparisons………………………………………………………………….. 98
- ads, posters, and book cover comparisons………………………………………………………………………………. 99
- ads, posters, and book cover comparisons…………………………………………………………………………….. 100
- movie posters and ads comparisons…………………………………………………………………………………….. 101
- movie posters and ads comparisons…………………………………………………………………………………….. 102
- infographics design comparisons………………………………………………………………………………………… 103
- infographics design comparisons………………………………………………………………………………………… 104
- cinematic scene comparisons……………………………………………………………………………………………… 105
- cinematic scene comparisons……………………………………………………………………………………………… 106
- comic strip comparisons…………………………………………………………………………………………………….. 107
- comic strip comparisons…………………………………………………………………………………………………….. 108
- storyboard comparisons…………………………………………………………………………………………………….. 109
- storyboard comparisons…………………………………………………………………………………………………….. 110
- cartoon comparisons…………………………………………………………………………………………………………. 111
- cartoon comparisons…………………………………………………………………………………………………………. 112
- game design comparisons………………………………………………………………………………………………….. 113
- game design comparisons………………………………………………………………………………………………….. 114
- product design comparisons……………………………………………………………………………………………….. 115
- product design comparisons……………………………………………………………………………………………….. 116
- product design comparisons……………………………………………………………………………………………….. 117
- product design comparisons……………………………………………………………………………………………….. 118
- fashion design comparisons……………………………………………………………………………………………….. 119
- fashion design comparisons……………………………………………………………………………………………….. 120
- camera settings comparisons………………………………………………………………………………………………. 121
- color comparisons…………………………………………………………………………………………………………….. 122
- 3d art comparisons……………………………………………………………………………………………………………. 123
- 3d art comparisons……………………………………………………………………………………………………………. 124
- historical art comparisons………………………………………………………………………………………………….. 125
- historical art comparisons………………………………………………………………………………………………….. 126
- knolling design comparisons………………………………………………………………………………………………. 127
- knolling design comparisons………………………………………………………………………………………………. 128
- DEsignBench logo design by DALL-E 3.…………………………………………………………………………….. 129
1 Introduction
1.1 Motivation and Overview
Advancements in text-to-image (T2I) generation [1–3, 30, 37, 89, 24, 76, 96, 14, 85, 86, 78, 73, 8, 67, 42] have shown remarkable capabilities in generating high-fidelity images that follow the user input text prompts. Many known challenges [58, 80, 26, 39], such as the prompt “A horse riding an astronaut” to test prompt following, text rendering, and distortions in the human face and hands generation, have been significantly improved by recent advancements, with examples postponed in Section 2. The rapid advancement naturally raises a question: what is the next goal to make T2I generation even more practically valuable? In this work, we focus on designing scenarios, and examine how the state-of-the-art T2I models can assist visual design [82, 72, 50, 51, 38, 52, 70, 100], in addition to merely generating visually pleasant results.
To this end, we present a new evaluation benchmark named DEsignBench to examine T2I models’ capabilities in assisting visual design. In addition to the base T2I capabilities in standard T2I benchmarks [44, 80, 96, 34, 21, 39], DEsignBench evaluates visual design from two unique perspectives, i.e., the core design technical capability and the design application scenarios. We then collect evaluation prompts organized into each category and aspect. We collect the results of the state-of-the-art T2I models [73, 3, 2, 1, 8, 67] into our DEsignBench gallery, and perform both human and GPT-4V [68, 69, 93] evaluations on the DEsignBench. Figure 1 overviews the DEsignBench structure, with each component detailed as follows.
DEsignBench topology. DEsignBench categorizes the visual design abilities to examine into two categories, namely the design technical capability and the design application scenario. The design technical capability separately zooms into each core technical capability required for visual design, including text rendering and typography [55, 11], layout and composition [81, 71], color harmony [4, 63], medium and artistic style [56], and 3D and cinematography [60, 13]. We further define sub-categories under each capability, and manually craft text prompts accordingly. The design application scenario focuses on the real design application, which usually requires the seamless integration of multiple design technical capabilities. Example categories include infographics, animation, gaming, product, and visual art.
DEsignBench data and gallery. Based on the DEsignBench topology, we organize samples into an evaluation set of 215 prompts, with corresponding design category tags, leading to a new challenging generation benchmark focused on visual design. We collect images generated by the state-of-the-art T2I models (SDXL v1.0 [73], Midjourney v5.2 [3], Ideogram [2], Firefly 2 [1], and DALL-E 3 [8, 67]), and formulate them into the DEsignBench gallery for side-by-side qualitative comparisons.
DEsignBench evaluation. We conduct the human evaluation [75, 96, 80, 73] on images in the DEsignBench gallery, assessing them based on three primary criteria: visual aesthetics, image-text alignments, and design creativity. The design creativity aspect asks human annotators to evaluate if the generated image is a novel design, i.e., whether it showcases unique and innovative interpretations of the input prompt and brings a fresh perspective. Additionally, the evaluation also considers five other design-specific capabilities, i.e., text rendering, composition and layout, color harmony, 3D and cinematography, and medium and style, each paired with specific annotation guidelines.
Furthermore, we explore the automatic evaluation pipeline, which provides a more cost-effective approach with reproducible results. Automatic evaluation with large language models has shown promise in various natural language processing [18, 53, 27] and vision-language understanding tasks [97]. However, T2I evaluation is more complicated. It requires both a high-level semantic understanding (e.g., image-text alignment and a detailed visual comparison across two images (e.g., visual aesthetic ranking), not to mention several other design-specific criteria. Following prior studies that take large multimodal models (LMMs) [68, 69, 93, 62] for T2I image-text alignment evaluation [8, 5, 95], we propose a pairwise model rating based on GPT-4V that comprehensively evaluates all aspects as a human annotator. The high consistency with human rating indicates the effectiveness of the proposed LMM-based T2I evaluation.
Our contributions are summarized as follows.
- We explore DALL-E 3 on imagining visual design. We then present DEsignBench, a new challenging text-to-image generation benchmark focusing on assisting visual design.
- We propose an automatic GPT-4V evaluation for DEsignBench evaluation, which provides reproducible results that align well with human ratings.
- We collect DEsignBench gallery, which side-by-side compares the images generated by various state-of-the-art T2I models (SDXL, Midjourney, Ideogram, Firefly2, DALL-E 3).
Remaining sections are organized as follows. Section 2 uses DALL-E 3 to provide an overview of the state of the art in T2I generation, and justify the experiment settings in DEsignBench. Section 3 and Section 4 introduce the design technical capability and the design application scenario, respectively, using insights from DALL-E 3. The human and GPT-4V quantitative evaluations are discussed in Section 5. Finally, the appendix shows the complete DEsignBench gallery, showcasing output comparisons among SDXL, Midjourney, Ideogram, Firefly2, and DALL-E 3.
2 DALL-E 3 Basics and DEsignBench Settings
In this section, we overview the state-of-the-art T2I generation capability, with explorations on DALL-E 3. We then introduce the experiment settings in DEsignBench.
2.1 DALL-E 3’s Working Modes
ChatGPT prompt expansion. DALL-E 3 [8, 67, 66] adopts ChatGPT [65] for prompt expansion, i.e., converting an input user query into a more detailed text description. As shown in Figure 2, we empirically observe that this prompt expansion (cf ., user input vs. expanded prompt) also benefits other compared T2I models, such as SDXL [73] and Midjourney [3]. Therefore, we take the “expanded prompt” as the default setting in our DEsignBench.
In addition to DALL-E 3’s default prompt expansion behavior in ChatGPT defined by the built-in system prompt, such as generating four prompts sequentially and producing four images, we find it helpful to add extra input prompts to ChatGPT for specialized prompt drafting.
- Generate a detailed description and then generate one image: Longer and more detailed prompts generally lead to better images, i.e., more object details, correct scene texts, and better image quality. We find it helpful to explicitly ask ChatGPT to provide a detailed description, and ease the task by asking for one prompt instead of four, both encourage a more detailed T2I prompt. We find this instruction particularly helpful in generating complicated scenes, such as posters, books, ads, etc., which are otherwise almost impossible to create.
- Exactly repeat the same prompt for one image: For other cases, we may want to shut down the ChatGPT prompt paraphrasing, e.g., changing a few attributes words in a controlled manner or producing the previously generated images. To achieve that, we can simply ask ChatGPT to “exactly repeat the same prompt.”
Multi-round dialogue-based T2I. DALL-E 3 with ChatGPT also naturally supports the multi-round dialogue-based generation. The chat interface allows users to refer to the generation history in generating the next image. For example, one may refer to a specific generated image and give an editing instruction, such as “Change the cloth in the second image into the blue color,” and naturally continue with multi-round editing. Another example is to keep arbitrary visual aspects in the generated image, such as keeping the character appearance or image style when generating a multiple image comic stripe (e.g., in Figures 31–33).
2.2 T2I Generation Capability Overview
We next provide an overview of the DALL-E 3’s generation capability, with popular testing prompts from existing benchmarks or community posts. Overall, we observe that DALL-E 3’s unprecedented prompt following ability allows it to effectively solve many well-known challenge cases. This observation motivates us to go a step further, and construct DEsignBench that considers the more challenging yet valuable scenarios of visual designs.
Prompt following: detailed descriptions. Prompt following is one key challenge in T2I generation. Previous T2I models tend not to strictly follow the text prompt, leading to incorrect objects and attributes [26, 15, 10, 25]. We use the famous failure cases in PartiPrompts [96] to show DALL-E 3’s prompt following capability. As shown in Figure 3, DALL-E 3 generates images with correct object counts, relative size, global and local attributes, minimal object hallucination, and scene text. As further discussed throughout the paper, unprecedented prompt following ability is critical for the imagined design scenarios, allowing designers to use arbitrary text words for image control more confidently.
Prompt following: uncommon scenes. In addition to following complicated long prompts, prompt following also requires models to faithfully generate the uncommon senses, such as the “A horse riding an astronaut.” Following prompts for uncommon sense is essential for design scenarios, which usually involve imaginative creations with uncommon attributes and object combinations. In Figure 4, we examine representative challenging prompts from community posts [59]. DALL-E 3 shows the
Word-level Acc. (%) | Short Words | Challenging Words | Sentences | Total |
Midjourney [3] | 0.0 | 0.0 | 4.3 | 1.1 |
SDXL [73] | 37.9 | 5.0 | 19.0 | 25.0 |
IF [41] | 62.5 | 15.8 | 39.4 | 45.0 |
DALL-E 3 | 83.3 | 31.7 | 62.4 | 65.2 |
Table 1: Word-level text rendering accuracy when selecting the best from N = 4 generated images.
capability to generate uncommon spatial relationships, object shapes, attributes, etc. Such prompt following capability may assist designers in creating their imaginative pieces more easily.
Image generation: text rendering. Text rendering [49, 57, 16, 92, 84] is critical for design scenarios, yet remains to be challenging for previous T2I models [3, 73]. We empirically observe that DALL-E 3 can more reliably render texts in images, though still not perfect on more complicated texts. Table 1 provides a quantitative comparison of the word-level scene text accuracy on 40 constructed samples. Specifically, we run the Microsoft Azure OCR system and compare the exact match words with the text in the input prompt. We generate N = 4 images for each prompt and report the best results. We show additional qualitative results later in Figures 6–10.
Image generation: other challenges. We also examine other common failures shared among previous T2I models, such as hand and face generation, unique art styles, challenging objects, etc. We empirically observe that DALL-E 3 works more reliably in those challenging cases. Figure 5 shows several examples of such “common failures” discussed in previous papers and community posts [58, 77, 80, 26, 39], e.g., detailed and uncommon attribute designs, uncommon scenes, etc. We group the explorations based on their usage in design scenarios, presenting in the next section “design technical capability.”
3 Design Technical Capability
Design encompasses a broad spectrum, from product and advertisement to logo and fashion design. Essential to any design tool is the capacity to produce text, shapes, charts, and diagrams [56, 55, 11]. Beyond these basics, the tool should be adept at crafting layouts that are not only semantically accurate but also aesthetically appealing [81]. Mastery of elements such as 3D, lighting, color palettes, and varied rendering materials and styles is indispensable [60, 71, 13]. In the following section, we highlight DALL-E 3’s competencies in addressing diverse design challenges.
3.1 Text Rendering and Typography
Figure 6 presents six diverse styled text renderings, spanning graffiti art, calligraphy, handwritten texts, mathematical symbols, multilingual scripts, and musical notations. While DALL-E 3 impressively renders English text across different styles, it exhibits some inaccuracies. The math equation, for instance, misinterprets certain operators and signs. While the layout for multilingual rendering appears organized, it struggles with certain languages, particularly Chinese and Japanese. The musical notation, while superficially resembling actual sheet music, includes several inaccuracies, underlining DALL-E 3’s constraints in this domain.
Figure 7 illustrates renderings of infrequently occurring text. This includes misspelled words such as “Happpy Hallooween” and “Baaabas,” and random character sequences like “CVD0p Sstpn6tsp”.
Figure 8 showcases renderings of extended text passages. For instance, “Hierarchical Text-Conditional Image Generation with CLIP Latents.” The compound text “gala apple NET NT 32oz (2 LB) 907g” poses a unique challenge with its amalgamation of words, numerals, and units. Yet, DALL-E 3 produces a layout reminiscent of a store price tag.
Effective typography is more than accurate spelling [11]. Font selection is vital, needing alignment with content and medium. The choice between serif and sans-serif hinges on communication context. Font size is key, with hierarchy distinguishing headings, subheadings, and body text for clarity and visual definition. Figure 32 and 33 depict the rendering Pusheenish font in the dialogue balloons. Figure 24 showcases the font hierarchy rendering in sophisticated posters.
For clear visuals, colors must contrast well with the background and convey intended emotions. Uniform alignment ensures a cohesive, organized text presentation. Figure 23 displays various font colors in GUI design, while Figure 22 showcases DALL-E 3’s alignment capabilities in creating storybook design.
When these facets converge cohesively, typography elevates from a mere conveyance of information to a medium that enhances design aesthetics and user engagement. The “Born Pink” mug in Figure 39 exemplifies this, seamlessly blending handwritten and printed styles, harmonized by color and lighting choices.
3.2 Layout and Composition
Creating a compelling layout and composition in design demands a keen understanding and strategic implementation of several key elements [81], ensuring that the visual space effectively communicates and resonates with the viewer.
Figure 9 displays layouts including block diagrams, pie charts, flow charts, bar graphs, tables, and calendars. While DALL-E 3 generally crafts decent layouts, it sometimes struggles with intricate details.
Figure 10 illustrates multi-panel layouts such as storyboards, how-tos, memes, and comics. Consistency in elements, colors, and patterns is vital in multi-panel designs to unify the composition and guide viewers. Designers utilize flow and movement, directing the viewer’s eye using lines and element arrangements, to ensure a seamless experience.
3.3 Color Harmony
Color harmony is a vital principle in design that ensures various colors in a composition create a cohesive, aesthetically pleasing experience for the viewer [64, 9]. A harmonious color palette can evoke specific emotions, set the tone, and enhance the overall impact of a piece.
Figure 11 displays variations of color palettes in oil paintings inspired by “Impression Sunrise.” These range from Spring, Summer, Autumn, and Winter Palettes to a Romantic Palette and a monochromatic green shade. This serves as a test to see if DALL-E can adeptly control and render color palettes. DALL-E 3 effectively captures the distinct tones associated with different seasons and themes.
Figure 12 presents six color palette variations in oil paintings, inspired by “Starry Night,” testing complementary color harmonies. It’s striking how DALL-E captures and renders these vibrant starry scenes with such vitality and beauty.
3.4 Medium and Style
The artistic medium and style are crucial in visual graphic design [56], defining the work’s expressive potential and emotional resonance. The medium, encompassing the tools, materials, or digital platforms employed, sets the boundaries and opportunities for expression, shaping the tactile and sensory experiences of the audience.
Figure 13 shows examples of sketching a cat in different styles including continuous line drawing, charcoal sketch, stippling sketch, brush and ink sketch, etc. Figure 14 and 15 demonstrate the capability of specifying different art media, including block print, folk art, paint-by-numbers, watercolor wood carving, Lego style, glass blowing, calligraphy, etc. These examples are just a small set of the art styles and media that DALL-E 3 covers. They provide a glimpse of DALL-E 3’s capability of rendering with a broad range of artistic media and styles.
3.5 3D and Cinematography
3D rendering [90] and cinematography [13] are transformative tools in the world of visual representation, allowing for the creation of intricate, lifelike scenes and stories. The depth, perspective, and dynamism brought about by these techniques offer a multi-dimensional view, enhancing the viewer’s experience and immersion.
Figure 16 shows examples of 3D rendering, including basic shapes, spatial relationships, lighting effects, shadow, reflections, and various viewing angles. DALL-E 3 proficiently captures self-shadows and cast shadows and effectively manages reflections on both flat and curved surfaces. The transition between two light sources is smooth. We find that DALL-E 3 sometimes does not follow view angles precisely. For example, the front view rendering is noticeably off.
In Figure 17, we show DALL-E 3’s capabilities of generating special lighting effects including chemiluminescent glow, bioluminescent glow, light-painting, and Aurora glimmering.
Figure 18 shows different camera angles and positions, including closeups, bird-eye level, low and side angles. For close-up shots, DALL-E 3 blurs the background appropriately to enhance the scene depth and puts the focus on the foreground.
Figure 19 shows examples of simulating fisheye and wide angle lenses, slow and faster shutter speeds, instant camera, and tilt shift photography. At the bottom left, DALL-E 3 simulates an instant camera whose photos are usually grainy. At the bottom right, DALL-3 simulates tilt-shift photography with the focus on the lady while gradually blurring her surroundings.
Figure 20 and 21 demonstrate DALL-E 3 capabilities of rendering crowded scenes. Figure 20 shows rendering different numbers of bears. DALL-E 3 correctly generates the desired number of bears when the number is small. When the number gets larger, however, DALL-E 3 makes mistakes (as shown in the last row). Figure 21 generates images of large human crowds on a variety of occasions. We find that DALL-E 3 does a nice job of positioning the texts and rendering them with the correct perspectives. At the bottom left, DALL-E 3 generates an exaggerated scene of a popular burger eatery with a super long serving counter and a large waiting crowd. The exaggeration looks plausible and shows the popularity of the burger.
4 Design Scenario
In the evolving landscape of design [71, 56, 55], the prowess of AI models in various design domains has become an area of keen interest. This comprehensive analysis dives into DALL-E 3’s capabilities across diverse design spectrums, from the intricacies of infographics and the dynamism of animation and gaming to the finesse required in product design and the artistic nuances in visual art. Each subsection sheds light on specific challenges and achievements of DALL-E 3, presenting a holistic view of its strengths and areas of improvement. Through a series of illustrative figures and descriptions, we unravel the depth and breadth of DALL-E 3’s design proficiency, offering insights into its potential and limitations in reshaping the future of design.
4.1 Infographics Design
This section delves into DALL-E 3’s proficiency across a spectrum of infographic designs, from storybook pages and advertisements to menus, GUIs, movie posters, logos, etc.
In Figure 22, storybook pages, research posters, and menus are presented. DALL-E 3 crafts compelling layouts for each. The storybook pages feature text paragraphs, which is a significant challenge for image generation models. While DALL-E 3 struggles with paragraph perfection, individual letters are discernible and many words remain clear.
Figure 23 showcases industrial design drafts, floor plans, and GUI designs, with DALL-E 3 producing commendable text and layout renderings.
Figure 24 depicts assorted advertisement posters and book covers, each with varying text, fonts, and sizes. For example, in the two conference posters in the middle row, there are very small texts at the bottom: “the international conference on learning representation” and “Computer Vision and Pattern Recognition.” It is impressive that DALL-E 3 DALL-E 3 adeptly renders the minute texts, underscoring its meticulous detailing.
Figure 25 shows movie posters, photorealistic advertisement posters, and cartoon book pages. In the movie poster at the top left, DALL-E 3 does a nice job of rendering the main character in a way that smoothly transitions between the two very different color themes. In the advertisement image at the middle left, both the brand name “crispy” and the slogan “unleash the fizz” are spelled correctly, and their rendering follows the curvature of the soda can surface. In addition, the can that the person is holding has the same look as the “Crispy” soda.
Figure 26 and 27 offer glimpses into logo designs, postcards, and themed greeting cards. Logos are sleek, while greeting cards aptly capture seasonal and cultural nuances.
Lastly, Figure 28 displays coloring book pages, where DALL-E 3 retains the signature black and white line drawing style. Figure 29 presents sticker designs set against a pristine background.
4.2 Animation/Gaming Design
This section explores DALL-E 3’s capabilities in animation and game designs, including cinematic scenes, comic strips, storyboards, and in-game scenes.
Figure 30 shows examples of cinematic scenes. DALL-E 3 does a decent job of using closeup shots, scene depth, and lighting to enhance the drama and intensity.
Figure 31, 32, 33 present comic strips across multiple panels. Despite generating each panel independently, DALL-E 3 consistently retains character identities and adeptly positions dialogue bubbles with legible texts.
Figure 34 shows a storyboard of two warriors going from fighting to reconciliation. There are 6 images, and each image is generated independently. DALL-E 3 successfully creates the gradual emotion changes of the two warriors. In addition, DALL-E 3 is able to maintain the identities of the two warriors across the panels.
Figure 35 highlights emojis and varied cartoon styles, spanning Comics, Anime, and Ghibli.
Lastly, Figure 36 and 37 shows examples of various game-related scenarios. DALL-E 3 understands the difference between a game scene (e.g., middle left) and a game-playing environment (bottom left). In addition, it is able to generate a first-person shooter perspective with a GUI panel.
4.3 Product Design
This section explores DALL-E 3’s capabilities in product and fashion designs as well as clothing alterations.
Figure 38 and 39 show a variety of product designs. All the product images generated by DALL-E 3 look elegant with appealing color and texture. The text font matches very well with the corresponding product type. It is interesting to note that in the “Born Pink” mug image at the middle left of Figure39, the letters “B” and “P” share a half letter. The sharing looks so natural that it is hardly noticeable.
Figure 40 presents fashion design examples. The line sketch style gives a professional look. The dresses look appropriate for the corresponding seasons.
Lastly, Figure 41 exhibits clothing alterations. DALL-E 3 adeptly interprets text prompts, adjusting garment colors and styles with precision.
4.4 Visual Art Design
This section explores DALL-E 3’s capabilities in 3D sculpture design, historical art recreation, and time-space travel.
Figure 42 shows examples of 3D sculpture designs. At the middle left, the prompt indicates to add Sun Wukong, who is the beloved Monkey King from the Chinese novel “Journey to the West,” as the fifth statue in Mount Rushmore, but DALL-E 3 mistakenly added three statues of Sun Wukong. Nonetheless, the generated image gives an illusion of being sculpted on the rock.
Figure 43 shows examples of recreating historical arts, including the city life of the capital city in the Tang dynasty and London in 1816. The image at the bottom right is an imagination of Times Square in 2075, which looks futuristic with green buildings and flying vehicles.
Figure 44 shows a variety of knolling examples. We find that DALL-E 3’s knolling design usually contains a lot of detailed elements. Even though the number of elements is sometimes very large, their geometric arrangement is always aesthetically pleasing.
5 DEsignBench and Evaluation Results
DEsignBench evaluates design from two perspectives: (i) Design technical capabilities: we measure the core technical capabilities for visual design, including text rendering and typography [55, 11], layout and composition [81, 71], color harmony [4, 63], medium and artistic style [56], and 3D and cinematography [60, 13]; (ii) Design application scenarios: we consider a variety of real design applications, such as infographic, animation, gaming, visual arts, and product design.
We collect text prompts that encompass a diverse range of design scenarios. In total, we collected 215 user inputs, systematically organized following the data topology introduced in Sections 3,4. Utilizing the ChatGPT interface [65, 68] of DALL-E 3, these collected user inputs were expanded and detailed, resulting in a more nuanced and detailed set of descriptions. As discussed in Section 2, we observed that the expanded text prompts are helpful in improving the design fidelity across all experimented T2I models [73, 3, 2, 1, 8, 67]. Therefore, we conduct the experiments and evaluation using the expanded text prompts.
In the Appendix, we present the DEsignBench gallery containing all images generated by the experimented state-of-the-art T2I models. All the text prompts and images used in the evaluation will be publicly available for future research.
5.1 Evaluation Method and Metric
Human evaluation. We conducted pairwise comparisons to assess the design technical capabilities of current Text-to-Image (T2I) models. We involve five participants who have experience with T2I tools.
As shown in Table 2, each participant was presented with an expanded text prompt followed by two images, each generated by different T2I models. The participants were then instructed to perform a pairwise comparison, employing a diverse set of criteria to judge which of the two given images is preferred. To facilitate a detailed examination, participants were permitted to adjust the image view by zooming in or out, thereby inspecting finer visual details for informed judgment. We refer readers to Table 2 for details on the evaluation criterion and annotation instruction, i.e., the three overall ratings on image-text alignment, aesthetics, and design, and the other five design-specific capabilities.
For each criterion shown in Table 2, participants were directed to choose between two alternatives: (i) Image 1 or (ii) Image 2. Additionally, to glean deeper insights into their rationales, the participants were encouraged to supplement their choices with qualitative feedback.
We also note that certain design-specific capabilities are only evaluated on a subset of prompts. For instance, if a pair of images lacks rendered texts, such pairs are disregarded during the evaluation of the text rendering capability.
Given the rigorous nature of the evaluation process, characterized by an extensive set of inquiries (i.e., 8 questions per pairwise comparison), we strategically reduced a portion of the annotation workload. Consequently, participants were assigned to assess a specific subset of pairwise comparisons, including the following comparisons: DALL-E 3-Midjourney; DALL-E 3-SDXL; Midjourney-SDXL; and Midjourney-Firefly2.
GPT-4V evaluation. Recent studies [29, 53, 19, 18, 91, 101, 33, 20, 87, 23, 47, 46, 97, 45, 5, 31, 99, 54, 74] have underscored the promising capabilities of deploying Large Language Models (LLMs) [68, 65, 88] as automated evaluators across various language and vision-language tasks. With the emergence of Large Multimodal Models (LMMs) [68, 69, 93, 62] such as GPT-4V [69], an intriguing question arises: can GPT-4V be effectively harnessed for T2I evaluations? Following prior studies that take LMMs for image-text alignment evaluation [8, 5, 95], we propose a pairwise model rating based on GPT-4V that comprehensively evaluates all aspects as a human annotator. Table 3 shows the prompt design we used for the experiments. First, GPT-4V takes two images and the text prompt as inputs. Then, GPT-4V compares the two images using the evaluation criteria listed in Table 2, addressing each criterion sequentially. Finally, GPT-4V describes its rationale and then selects one of the two images. In our experiments, we invoke GPT-4V five times, and subsequently report the mean and variance of the results.
Table 2: Example questionnaire for human evaluation. Participants were presented with a text prompt followed by two images. Participants were instructed to compare the two images and answer all the 8 questions. For each question, participants were asked to select one of two options: (i) Image 1 or (ii) Image 2. See Section 5.1 for more details.
5.2 Compared T2I Models
We compare DALL-E 3 with the recent state-of-the-art T2I models, including Midjourney V5.2 [3], Stable Diffusion XL 1.0 (SDXL) [73], Ideogram [2], and Adobe Firefly 2 [1]. Note that some of these models come as part of the integrated software programs with additional functionalities, such as image editing. We omit these features and evaluate exclusively their T2I capabilities.
Each T2I model takes the expanded text prompt as input, and generates four image variations. We randomly select an image without cherry-pick for evaluation. Given 215 text prompts and five T2I models, we have 2150 pairs in total for pairwise comparison.
Table 3: Prompt design for GPT-4V assisted evaluation, where I1 and I2 are the two images, and P is the expanded text prompt. Taking the prompt template filled with P, I1, and I2, GPT-4V will output its thought and select one of the given two images. We highlight the evaluation criterion considered in this example in yellow . The criterion can be replaced with the other ones listed in Table 2. The prompt design is inspired by [8].
5.3 Evaluation Results
Results on human evaluation. Figure 45 shows the category-specific comparison among DALL- E 3, Midjourney, SDXL, and Firefly 2, on DEsignBench. We observe that human annotators prefer the images generated by DALL-E 3 more than those of Midjourney and SDXL in all eight categories considered. In addition, Midjourney garnered preference over SDXL in seven out of the eight categories, except for text rendering. Midjourney proves slightly more favorable than Firefly2 in five out of the eight categories. These findings indicate a hierarchical preference, with DALL-E 3 emerging as the most favorable model. Midjourney and Firefly2 occupy the second tier, demonstrating substantial competence, while SDXL appears positioned within the third tier in the DEsignBench evaluations.
Results on GPT-4V evaluation. To assess the efficacy of GPT-4V as an automated evaluator, we conduct a consistency analysis. Figure 46 illustrates the correlation between human preferences and the assessments executed by GPT-4V on DEsignBench. This analysis involved invoking the GPT-4V five times, and subsequently reporting on the mean and variance of the results. Our observations indicate that the judgments by GPT-4V predominantly concur with human evaluations, with sporadic discrepancies most notable in the evaluation of text rendering capabilities when comparing Midjourney-SDXL and Midjourney-Firefly2. Despite these occasional divergences, GPT- 4V exhibits relatively reliable performance across a spectrum of evaluative criteria, demonstrating its potential as an automated tool for T2I evaluation, particularly in pairwise comparisons.
Figures 47–48 show the GPT-4V evaluation results on comparing the five T2I models considered. In the experiments, we invoke the GPT-4V five times, and report the mean and variance of the results.
DALL-E 3 stands out as the most favorable model, followed by Firefly 2 and Midjourney within the second tier. SDXL and Ideogram are positioned within the third tier. We observe a notable consistency in GPT-4V evaluation, given the absence of any cyclical anomalies in the pairwise comparisons reviewed.
Finally, we present example outputs of GPT-4V evaluator in Tables 4–7. We observe that GPT-4V can correctly analyze the images and make reasonable assessments. Tables 8–9 show representative failure cases of the GPT-4V evaluator. We observe that GPT-4V may make a mistake in counting the teddy bears in the occlusion scenario. GPT-4V may struggle to read the small text, and instead shift its attention towards evaluating the overall aesthetics of the image.
Table 4: Example result from GPT-4V. Given the text prompt and two images, GPT-4V compares the
two image, and makes a reasonable assessment. The key rationale is highlight in yellow . Note that
Image 1 is generated by DALL-E 3, and Image 2 is generated by Midjourney.
Table 5: Example result from GPT-4V. Given the text prompt and two images, GPT-4V compares the
two image, and makes a reasonable assessment. The key rationale is highlight in yellow . Note that
Image 1 is generated by DALL-E 3, and Image 2 is generated by SDXL.
Table 6: Example result from GPT-4V. Given the text prompt and two images, GPT-4V compares the
two image, and makes a reasonable assessment. The key rationale is highlight in yellow . Note that
Image 1 is generated by DALL-E 3, and Image 2 is generated by Midjourney.
Table 7: Example result from GPT-4V. Given the text prompt and two images, GPT-4V compares the
two image, and makes a reasonable assessment. The key rationale is highlight in yellow . Note that
Image 1 is generated by DALL-E 3, and Image 2 is generated by Midjourney
Table 8: Failure case in GPT-4V evaluation. Incorrect rationale is highlight in red . Note that Image
1 is generated by Midjourney, and Image 2 is generated by SDXL.
Table 9: Failure case in GPT-4V evaluation. Incorrect rationale is highlight in red . Note that Image
1 is generated by DALL-E 3, and Image 2 is generated by Midjourney.
5.4 Limitations of DALL-E 3
We next discuss the representative failure cases and model limitations. First, DALL-E 3 may still fail on certain challenging prompts that describe uncommon or complicated scenes. For example, “all buildings of the same height” in Figure 49(a), “guitar without string” in (b), “fork in the pumpkin” in (c), “quarter-sized pizza” in (d), “to the left of” in (e), and the green grass in the left- and right-most part of (f).
DALL-E 3 has shown an impressive performance in text rendering and layout composition. However, document generation still remains a formidable challenge, hindering the achievement of flawless design outputs. Further enhancing the model’s text rendering capabilities would significantly elevate the quality of visual design, as exemplified by the need for precise text generation in storybooks, posters, and book covers shown in Figure 50(a,c,e). In addition to generating accurate Latin characters, there is a need for the model to improve visual and scene text semantic alignments (e.g., the incorrect pie chart portion in (b)), incorporate support for customizable fonts (e.g., for the chart title in (d)), and extend its capabilities to include multiple languages as in (f).
We observe that the generation artifacts still exist in certain types of generated images. Notably, the skin texture in Figure 51(a), and the human faces in the crowded scene (b), appear to be somewhat unnatural. Additionally, the model might also misunderstand certain generation settings, such as the camera setting “fast shutter speed” in (c), and the person counts in (d).
Finally, DALL-E 3 currently has limited support for extended image generation functionalities [98, 95, 42], such as editing uploaded images [61, 32, 12, 36], concept customization [79, 40, 6, 17, 83],
style transfer [28, 35, 48, 22], region control [94, 43], spatial condition [98, 7], etc. Several of these extended functionalities may ease and enhance the visual design process. For example, the incorporation of image condition input could empower designers to refine and build upon existing designs, such as the “halo armor” in Figure 51(e) or their prior designs, instead of starting from scratch. The region control [94] may allow designers to more precisely place texts and other visual elements.
6 Conclusions
We have presented DEsignBench, a novel image generation benchmark constructed for visual design scenarios. This benchmark is systematically organized with samples categorized according to the design technical capability and application scenarios. We showcase DALL-E 3’s strong capability in assisting genuine visual design applications. Leveraging the comprehensive design category topology, curated evaluation samples, a visual gallery comprising state-of-the-art T2I models, and the easily replicable GPT-4V-powered evaluator, we aspire for DEsignBench to establish a solid foundation for design-centric generative models, thereby aiding designers more effectively in real-world tasks.
Acknowledgment
We express our gratitude to all contributors from OpenAI for their technical efforts on the DALL-E 3 project [8, 67, 66]. Our sincere appreciation goes to Aditya Ramesh, Li Jing, Tim Brooks, and James Betker at OpenAI, who have provided thoughtful feedback on this work. We are profoundly thankful to Misha Bilenko for his invaluable guidance and support. We also extend heartfelt thanks to our Microsoft colleagues for their insights, with special acknowledgment to Jamie Huynh, Nguyen Bach, Ehsan Azarnasab, Faisal Ahmed, Lin Liang, Chung-Ching Lin, Ce Liu, and Zicheng Liu.
References
[1] Firefly 2. https://firefly.adobe.com/, 2023. Accessed: 2023-10-10.[2] Ideogram. https://ideogram.ai, 2023. Accessed: 2023-10-10.[3] Midjourney v5.2. https://www.midjourney.com/, 2023. Accessed: 2023-10-10.[4] Josef Albers. Interaction of color. Yale University Press, 2013.[5] Jie An, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Lijuan Wang, and Jiebo Luo. Openleaf: Open-domain interleaved image-text generation and evaluation. arXiv preprint arXiv:2310.07749, 2023.
[6] Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. Break-a- scene: Extracting multiple concepts from a single image. arXiv preprint arXiv:2305.16311, 2023.
[7] Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, and Xi Yin. Spatext: Spatio-textual representation for controllable image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18370–18380, 2023.
[8] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions. 2023.
[9] Faber Birren. Color Psychology and Color Therapy: A Factual Study of the Influence of Color on Human Life. Martino Fine Books, 2013.
[10] Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023.
[11] Robert Bringhurst. The elements of typographic style. Point Roberts, WA: Hartley & Marks, Publishers, 2004.
[12] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
[13] Blain Brown. Cinematography: theory and practice: image making for cinematographers and directors. Taylor & Francis, 2016.[14] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022.[15] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and- excite: Attention-based semantic guidance for text-to-image diffusion models. arXiv preprint arXiv:2301.13826, 2023.[16] Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters. arXiv preprint arXiv:2305.10855, 2023.[17] Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Rui, Xuhui Jia, Ming-Wei Chang, and William W Cohen. Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186, 2023.[18] Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023.[19] Cheng-Han Chiang and Hung-yi Lee. A closer look into automatic evaluation using large language models. EMNLP 2023 findings, 2023.[20] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.[21] Jaemin Cho, Abhay Zala, and Mohit Bansal. Visual programming for text-to-image generation and evaluation. arXiv preprint arXiv:2305.15328, 2023.[22] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8188–8197, 2020.[23] Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.[24] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021.[25] Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. arXiv preprint arXiv:2305.16381, 2023.[26] Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Reddy Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. In The Eleventh International Conference on Learning Representations, 2022.[27] Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire.arXiv preprint arXiv:2302.04166, 2023.
[28] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style.arXiv preprint arXiv:1508.06576, 2015.
[29] Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056, 2023.[30] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 2020.[31] Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, and Sunayana Sitaram. Are large language model-based evaluators the solution to scaling up multilingual evaluation? arXiv preprint arXiv:2309.07462, 2023.[32] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image editing with cross-attention control. In The Eleventh International Conference on Learning Representations, 2022.[33] Fan Huang, Haewoon Kwak, and Jisun An. Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech. arXiv preprint arXiv:2302.07736, 2023.[34] Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. arXiv preprint arXiv:2307.06350, 2023.[35] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image- to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.[36] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017,[37] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.[38] Hyung-Kwon Ko, Gwanmo Park, Hyeon Jeon, Jaemin Jo, Juho Kim, and Jinwook Seo. Large- scale text-to-image generation models for visual artists’ creative works. In Proceedings of the 28th International Conference on Intelligent User Interfaces, pages 919–933, 2023.[39] Max Ku, Tianle Li, Kai Zhang, Yujie Lu, Xingyu Fu, Wenwen Zhuang, and Wenhu Chen. Imagenhub: Standardizing the evaluation of conditional image generation models, 2023.[40] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.[41] DeepFloyd Lab. Deepfloyd if. https://github.com/deep-floyd/IF, 2023.[42] Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023.[43] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023.[44] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.[45] Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023.[46] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023.[47] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023.
[48] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation net- works. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.[49] Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mohammad Norouzi, and Noah Constant. Character-aware models improve visual text rendering. arXiv preprint arXiv:2212.10562, 2022.[50] Vivian Liu and Lydia B Chilton. Design guidelines for prompt engineering text-to-image gen- erative models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–23, 2022.[51] Vivian Liu, Han Qiao, and Lydia Chilton. Opal: Multimodal image generation for news illustration. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pages 1–17, 2022.[52] Vivian Liu, Jo Vermeulen, George Fitzmaurice, and Justin Matejka. 3dall-e: Integrating text- to-image ai in 3d design workflows. In Proceedings of the 2023 ACM designing interactive systems conference, pages 1955–1977, 2023.[53] Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634,[54] Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Wei- wei Deng, Feng Sun, and Qi Zhang. Calibrating llm-based evaluator. arXiv preprint arXiv:2309.13308, 2023.[55] Ellen Lupton. Thinking with type: A critical guide for designers, writers, editors, & students. Chronicle Books, 2014.[56] Ellen Lupton and Jennifer Cole Phillips. Graphic design: The new basics. Princeton Architec- tural Press, 2008.[57] Jian Ma, Mingjun Zhao, Chen Chen, Ruichen Wang, Di Niu, Haonan Lu, and Xiaodong Lin. Glyphdraw: Learning to draw chinese characters in image synthesis models coherently. arXiv preprint arXiv:2303.17870, 2023.[58] Gary Marcus, Ernest Davis, and Scott Aaronson. A very preliminary analysis of dall-e 2. arXiv preprint arXiv:2204.13807, 2022.[59] James McCammon. Can a horse ride an astronaut? 2023.[60] Kent McQuilkin and Anne Powers. Cinema 4D: The Artist’s Project Sourcebook. Taylor & Francis, 2011.[61] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.[62] Microsoft. Bingchat. https://www.microsoft.com/en-us/edge/features/ bing-chat, 2023.[63] Patti Mollica. Color Theory: An essential guide to color-from basic principles to practical applications, volume 53. Walter Foster, 2013.[64] Patti Mollica. Color Theory: An Essential Guide to Color-from Basic Principles to Practical Applications. Walter Foster Publishing, 2013.[65] OpenAI. Introducing chatgpt. 2022.[66] OpenAI. Dall·e 3 is now available in chatgpt plus and enterprise. 2023.[67] OpenAI. Dall·e 3 system card. 2023.[68] OpenAI. Gpt-4 technical report, 2023.[69] OpenAI. Gpt-4v(ision) system card. 2023.[70] Jonas Oppenlaender. The creativity of text-to-image generation. In Proceedings of the 25th International Academic Mindtrek Conference, pages 192–202, 2022.[71] Alan Pipes. Production for graphic designers. Laurence King Publishing, 2005.[72] Joern Ploennigs and Markus Berger. Ai art in architecture. AI in Civil Engineering, 2(1):8, 2023.[73] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.[74] Dan Qiao, Chenfei Wu, Yaobo Liang, Juntao Li, and Nan Duan. Gameeval: Evaluating llms on conversational games. arXiv preprint arXiv:2308.10032, 2023.[75] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.[76] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.[77] Lance J Rips. Similarity, typicality, and categorization. 1989.[78] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.[79] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aber- man. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.[80] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.[81] Timothy Samara. Making and breaking the grid: A graphic design layout workshop. Rockport Publishers, 2023.[82] Sachith Seneviratne, Damith Senanayake, Sanka Rasnayaka, Rajith Vidanaarachchi, and Jason Thompson. Dalle-urban: Capturing the urban design expertise of large text to image transformers. In 2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pages 1–9. IEEE, 2022.[83] Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411, 2023.[84] Wataru Shimoda, Daichi Haraguchi, Seiichi Uchida, and Kota Yamaguchi. Towards diverse and consistent typography generation. arXiv preprint arXiv:2309.02099, 2023.[85] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.[86] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.[87] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.[88] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.[89] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In NeurIPS, 2017.[90] Jason Van Gumster. Blender for dummies. John Wiley & Sons, 2020.[91] Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048, 2023.[92] Yukang Yang, Dongnan Gui, Yuhui Yuan, Haisong Ding, Han Hu, and Kai Chen. Glyphcontrol: Glyph conditional control for visual text generation. arXiv preprint arXiv:2305.18259, 2023.[93] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 2023.[94] Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, et al. Reco: Region-controlled text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14246–14255, 2023.[95] Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. Idea2img: Iterative self-refinement with gpt-4v(ision) for automatic image design and generation. arXiv preprint arXiv:2310.08541, 2023.[96] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research.[97] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabili- ties. arXiv preprint arXiv:2308.02490, 2023.[98] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.[99] Xinghua Zhang, Bowen Yu, Haiyang Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, and Yongbin Li. Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862, 2023.[100] Xujie Zhang, Yu Sha, Michael C Kampffmeyer, Zhenyu Xie, Zequn Jie, Chengwen Huang, Jianqing Peng, and Xiaodan Liang. Armani: Part-level garment-text alignment for unified cross-modal fashion design. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4525–4535, 2022.[101] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.A DEsignBench Gallery:
Comparisons among SDXL, Midjourney, Ideogram, Firefly2, and DALL-E 3
Figures 52–99 visualize the images in the DEsignBench gallery, containing generation results from SDXL v1.0 [73], Midjourney v5.2 [3], Ideogram [2], Firefly 2 [1], and DALL-E 3 [8, 67]. We use the Hugging Face Diffusers to run SDXL inference 1. We obtain generation results for the remaining models via their web interface 2345, respectively.