Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

Golden, Alicia; Hsia, Samuel; Sun, Fei; Acun, Bilge; Hosmer, Basil; Lee, Yejin; DeVito, Zachary; Johnson, Jeff; Wei, Gu-Yeon; Brooks, David; Wu, Carole-Jean

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2312.14385 (cs)

[Submitted on 22 Dec 2023 (v1), last revised 6 May 2024 (this version, v2)]

Title:Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

Authors:Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, Carole-Jean Wu

View PDF HTML (experimental)

Abstract:As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and efficiency. We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation models. Current model architecture designs are bifurcated into 2 categories: Diffusion- and Transformer-based models. Our systematic performance characterization on a suite of eight representative TTI/TTV models shows that after state-of-the-art optimization techniques such as Flash Attention are applied, Convolution accounts for up to 44% of execution time for Diffusion-based TTI models, while Linear layers consume up to 49% of execution time for Transformer-based models. We additionally observe that Diffusion-based TTI models resemble the Prefill stage of LLM inference, and benefit from 1.1-2.5x greater speedup from Flash Attention than Transformer-based TTI models that resemble the Decode phase. Since optimizations designed for LLMs do not map directly onto TTI/TTV models, we must conduct a thorough characterization of these workloads to gain insights for new optimization opportunities. In doing so, we define sequence length in the context of TTI/TTV models and observe sequence length can vary up to 4x in Diffusion model inference. We additionally observe temporal aspects of TTV workloads pose unique system bottlenecks, with Temporal Attention accounting for over 60% of total Attention time. Overall, our in-depth system performance characterization is a critical first step towards designing efficient and deployable systems for emerging TTI/TTV workloads.

Comments:	Published at 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2312.14385 [cs.DC]
	(or arXiv:2312.14385v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2312.14385

Submission history

From: Alicia Golden [view email]
[v1] Fri, 22 Dec 2023 02:21:26 UTC (1,117 KB)
[v2] Mon, 6 May 2024 03:54:58 UTC (1,298 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators