An Exploratory Study on Multi-modal Generative AI in AR Storytelling

Submitted to CHI 2025

Authorship: First Author

Abstract

Storytelling in AR has gained significant attention due it the multi-modality and interactivity of the platform. Research has introduced vast development in deploying multi-modal content in AR Storytelling. However, generating multi-modality content for AR Storytelling requires expertise and significant time for high quality and to accurately convey the narrator’s intention. Therefore, we conducted an exploratory study to investigate the impact of multi-modal content generated by AI in AR Storytelling. Based on the analysis of 223 videos of storytelling in AR, we identify a design space for multi-modal AR Storytelling. Derived from this design space, we have developed a testbed that facilitates both modalities of content generation and atomic elements in AR Storytelling. Through two studies with N=30 experienced storytellers and live presenters, we revealed the participants’ preferences for modalities to augment each element, qualitative evaluations on the interactions with AI to generate content, and the overall quality of the AI-generated content in AR Storytelling. We further discuss design considerations for future AR Storytelling systems based on our results.

Contribution

screen reader text
Figure 1: The testbed workflow. 1) Content Generator interface: The user employs the content generator to create AIGC for AR Storytelling, supporting five modalities for the selected sentence. 2) AR interface: The user can view the text corresponding to spoken words. Based on the transferred speech text, the user can interact with AIGC using hand.

screen reader text
Figure 2: The testbed. (a) The Multi-Modal Content Generator interface. (a-i) The textual input section. The user can import a story text file for storytelling by entering the story title. (a-ii) The loaded story. Where the user can see the loaded story and select a sentence for augmentation by dragging it with the mouse cursor. (a-iii) The highlight (top) and save (bottom) buttons. The top button highlights the selected portion with yellow, indicating to the user that this part will be generated. The bottom button saves the highlighted text to the backend for output generation. a-iv) The modality selection. The user can choose the desired modality for content generation. (a-v) The output. The testbed displays the generated content, and the user can save it using a unique keyword. This keyword acts as a trigger during the storytelling process. (a-vi) The quality evaluation of the content by the user. (b) The AR interface. (b-i) Image. The user can interact with the AIGC using hand landmarks. The content appears on the tip of the user’s index finger. (b-ii) The Speech-to-Text box. The interface displays the narrator’s words, serving as a trigger for the corresponding content. (b-iii) Text. This modality shows the main keyword and detailed information. (b-iv) Audio. The icon indicates the corresponding audio is playing. (b-v) Video. (b-vi) 3D Content.

  • Summarized a design space of multi-modal AR Storytelling and implemented a cognitive model for understanding the roles of authors and audiences in the storytelling process.
  • Developed an experimental AR Storytelling testbed (Figure 2) with AI-generated multi-modal content, integrating with multiple state-of-the art generative AI models.
  • Investigated the impact of AI-generated multi-modal content on the creation and perception of AR Storytelling through an exploratory study.
Hyungjun Doh
Hyungjun Doh
Master’s Student

My research interests are Human-AI interaction and its practical applications, with a specific focus on Extended Reality, Task Guidance Systems, and AI-infused interfaces.