NExT-GPT/NExT-GPT
The first end-to-end multimodal large language model (MM-LLM) that perceives input and generates output in arbitrary combinations (any-to-any) of text, image, video, and audio.
Core Features
Detailed Introduction
NExT-GPT is a pioneering end-to-end multimodal large language model (MM-LLM) capable of processing and generating content in any combination of text, image, video, and audio. It addresses the critical challenge of unified multimodal understanding and generation by integrating established multimodal encoders, a powerful LLM for reasoning, and advanced diffusion models for content creation. This project, an ICML 2024 oral paper, provides the code, data, and model weights, enabling researchers and developers to explore and build upon its innovative any-to-any multimodal capabilities, pushing the boundaries of generative AI.