Authors: Jingyu Xiao, Yuxuan Wan, Yintong Huo, Zhiyao Xu, Michael R. Lyu
Abstract: Converting webpage design into functional UI code is a critical step for
building websites, which can be labor-intensive and time-consuming. To automate
this design-to-code transformation process, various automated methods using
learning-based networks and multi-modal large language models (MLLMs) have been
proposed. However, these studies were merely evaluated on a narrow range of
static web pages and ignored dynamic interaction elements, making them less
practical for real-world website deployment.
To fill in the blank, we present the first systematic investigation of MLLMs
in generating interactive webpages. Specifically, we first formulate the
Interaction-to-Code task and build the Interaction2Code benchmark that contains
97 unique web pages and 213 distinct interactions, spanning 15 webpage types
and 30 interaction categories. We then conduct comprehensive experiments on
three state-of-the-art (SOTA) MLLMs using both automatic metrics and human
evaluations, thereby summarizing six findings accordingly. Our experimental
results highlight the limitations of MLLMs in generating fine-grained
interactive features and managing interactions with complex transformations and
subtle visual modifications. We further analyze failure cases and their
underlying causes, identifying 10 common failure types and assessing their
severity. Additionally, our findings reveal three critical influencing factors,
i.e., prompts, visual saliency, and textual descriptions, that can enhance the
interaction generation performance of MLLMs. Based on these findings, we elicit
implications for researchers and developers, providing a foundation for future
advancements in this field. Datasets and source code are available at
https://github.com/WebPAI/Interaction2Code.
Source: http://arxiv.org/abs/2411.03292v1