NExT-GPT/NExT-GPT
The first end-to-end multimodal large language model (MM-LLM) capable of perceiving and generating content in arbitrary combinations of text, image, video, and audio.
Core Features
Detailed Introduction
NExT-GPT is a pioneering end-to-end multimodal large language model (MM-LLM) presented as an ICML 2024 oral paper. It distinguishes itself by its "any-to-any" capability, allowing it to perceive inputs and generate outputs in arbitrary combinations of text, image, video, and audio. Built upon existing pre-trained LLMs, multimodal encoders, and advanced diffusion models, NExT-GPT employs a three-stage architecture: multimodal encoding, LLM understanding and reasoning, and multimodal generation, all refined through extensive instruction tuning. This project represents a significant step towards truly unified multimodal AI.