Tags: #any-to-any
Multimodal Large Language Model
3.6k
NExT-GPT/NExT-GPT
The first end-to-end multimodal large language model (MM-LLM) capable of perceiving and generating content in arbitrary combinations of text, image, video, and audio.
The first end-to-end multimodal large language model (MM-LLM) capable of perceiving and generating content in arbitrary combinations of text, image, video, and audio.