ACM Transactions on Graphics (Proceedings of SIGGRAPH), 2022
The lady wears a short-sleeve T-shirt with the pure color
pattern and a short and denim skirt.
The man wears a short-sleeve T-shirt with the pure color
pattern and short pants with the pure color
pattern.
A lady wearing a sleeveless pure-color shirt and long jeans.
The man wears a sleeveless shirt with the pure color
pattern and short pants with the pure color
pattern.
The man wears a long and floral shirt and long pants with the pure color pattern.
A lady wears a short-sleeve pure-color T-shirt and long pure-color pants. She also wears a hat.
A man wears a short-sleeve and short rompers with denim meterials.
A lady wears a short-sleeve and a short floral dress.
A man wears a shirt and long
pure-color pants with an unzipped denim outer
clothing.
The man wears a sleeveless shirt with the pure color
pattern and short pants with the pure color
pattern.
A lady wears a sleeveless and long rompers with denim meterials.
The guy wears a short-sleeve shirt with the pure color
pattern and long denim pants.
Abstract
Generating high-quality and diverse human images is an important yet challenging task in vision and
graphics. However, existing generative models often fall short under the high diversity of clothing
shapes and textures.
Furthermore, the generation process is even desired to be intuitively controllable for layman users. In
this work, we present a text-driven controllable framework, Text2Human, for a high-quality and diverse
human generation. We synthesize full-body human images starting from a given human pose with two
dedicated steps. 1) With some texts describing the shapes of clothes, the given human pose is first
translated to a human parsing map. 2) The final human image is then generated by providing the system
with more attributes about the textures of clothes. Specifically, to model the diversity of clothing
textures, we build a hierarchical texture-aware codebook that stores multi-scale neural representations
for each type of texture. The codebook at the coarse level includes the structural representations of
textures, while the codebook at the fine level focuses on the details of textures. To make use of the
learned hierarchical codebook to synthesize desired images, a diffusionbased transformer sampler with
mixture of experts is firstly employed to sample indices from the coarsest level of the codebook, which
then is used to predict the indices of the codebook at finer levels. The predicted indices at different
levels are translated to human images by the decoder learned accompanied with hierarchical codebooks.
The use of mixture-of-experts allows for the generated image conditioned on the fine-grained text input.
The prediction for finer level indices refines the quality of clothing textures. Extensive quantitative
and qualitative evaluations demonstrate that our proposed Text2Human framework can generate more diverse
and realistic human images compared to state-of-the-art methods.
You can select the attributes to customize the synthesized human images.
A wearing a with , and
with .
A wearing a with , and
with .
A lady wearing a with , and
with .
A lady wearing a with , and
with .
DeepFashion-MultiModal
DeepFashion-MultiModal
is a large-scale high-quality human dataset with rich multi-modal annotations. It has the following
properties:
1. It contains 44,096 high-resolution human images, including 12,701 full body human images.
2. For each full body images, we manually annotate the human parsing labels of 24 classes.
3. For each full body images, we manually annotate the keypoints.
4. We extract DensePose for each human image.
5. Each image is manually annotated with attributes for both clothes shapes and textures.
6. We provide a textual description for each image.
Bibtex
@article{jiang2022text2human,
title={Text2Human: Text-Driven Controllable Human Image Generation},
author={Jiang, Yuming and Yang, Shuai and Qiu, Haonan and Wu, Wayne and Loy, Chen Change and Liu, Ziwei},
journal={ACM Transactions on Graphics (TOG)},
volume={41},
number={4},
articleno={162},
pages={1--11},
year={2022},
publisher={ACM New York, NY, USA},
doi={10.1145/3528223.3530104}
}
More Researches
EVA3D proposes a
compositional framework to generate 3D human from 2D image collections.
StyleGAN-Human investigates the "data
engineering" in
unconditional human generation.
Talk-to-Edit proposes a
StyleGAN-based method
and a multi-modal dataset for dialog-based facial editing.
CelebA-Dialog is a
large-scale
visual-language face dataset with rich fine-grained labels and textual descriptions.