Tags: #multimodal-ai
pipecat-ai/pipecat
An open-source Python framework for building real-time, voice-first, and multimodal conversational AI agents with composable pipelines.
sgl-project/sglang
SGLang is a high-performance serving framework for large language models and multimodal models, optimizing inference throughput and latency.
ModelEngine-Group/nexent
Nexent is a zero-code platform for auto-generating production-grade AI agents, unifying tools, skills, memory, and orchestration with built-in controls.
TEN-framework/ten-framework
An open-source framework for building real-time, multimodal conversational AI agents with advanced features like voice assistance, diarization, and lip-sync.
lance-format/lance
An open lakehouse format for multimodal AI, offering high-performance random access, vector indexing, and data versioning.
OpenGVLab/Ask-Anything
An advanced multimodal AI chatbot framework that enables conversational interaction and deep understanding of video and image content, integrating various large language models.
lancedb/lancedb
An open-source, developer-friendly embedded retrieval library and multimodal AI lakehouse for fast, scalable vector search and data management.
NVIDIA-NeMo/Curator
A GPU-accelerated, scalable toolkit for multimodal data preprocessing and curation, designed to train better AI models faster.
roboflow/maestro
A streamlined tool to accelerate the fine-tuning of popular multimodal models like Florence-2, PaliGemma 2, and Qwen2.5-VL.
LianjiaTech/BELLE
BELLE is an open-source project dedicated to fostering the development of Chinese conversational large language models, aiming to make LLMs accessible to everyone.
SamurAIGPT/Generative-Media-Skills
Provides a multimodal toolset for AI agents to generate, edit, and display professional-grade images, videos, and audio using a CLI-powered architecture.
pixeltable/pixeltable
A declarative, transactional Python library for building multimodal AI applications with incremental data storage, transformation, indexing, and orchestration.
EvolvingLMMs-Lab/lmms-eval
A unified, reproducible, and efficient multimodal evaluation toolkit for large language models across text, image, video, and audio tasks.
fikrikarim/parlor
Parlor is an on-device, real-time multimodal AI that enables natural voice and vision conversations, running entirely on your local machine.
facebookresearch/mmf
A modular and scalable PyTorch-based framework for state-of-the-art vision and language multimodal research from Facebook AI Research.
OpenGVLab/InternVideo
A series of video foundation models and large-scale datasets designed for comprehensive multimodal video understanding and generation.
chenking2020/FindTheChatGPTer
A curated directory of open-source alternatives to ChatGPT and GPT-4, encompassing text and multimodal large language models, designed to assist users in navigating the AI landscape.
microsoft/unilm
A comprehensive research hub for large-scale self-supervised pre-training of foundation models across diverse tasks, languages, and modalities.
InternLM/InternLM-XComposer
A comprehensive multimodal AI system specializing in long-term streaming video and audio interactions, offering advanced vision-language understanding and composition.
jina-ai/serve
A cloud-native framework for building and deploying high-performance multimodal AI applications with built-in scaling and orchestration.
deepseek-ai/Janus
Janus-Series is a family of unified autoregressive multimodal AI models designed for both understanding and generating content across various modalities, featuring a novel decoupled visual encoding strategy.
BIT-DataLab/Edit-Banana
Edit Banana transforms static, uneditable content like images of diagrams into fully manipulatable and editable assets using advanced AI.
open-mmlab/mmpretrain
MMPreTrain is an OpenMMLab project providing a comprehensive, open-source PyTorch-based toolbox for pre-training and benchmarking various computer vision and multi-modal models.
OpenGVLab/InternGPT
InternGPT is an open-source, pointing-language-driven visual interactive system that significantly enhances user communication with AI models like ChatGPT, improving efficiency and accuracy in complex vision-centric tasks.
haotian-liu/LLaVA
An open-source large language and vision assistant (LLaVA) that achieves GPT-4V level multimodal capabilities through visual instruction tuning.