Tags: #multimodal-ai
pipecat-ai/pipecat
An open-source Python framework for building real-time, voice-first, and multimodal conversational AI agents with ultra-low latency.
sgl-project/sglang
A high-performance serving framework designed to accelerate inference for large language models and multimodal AI models.
ModelEngine-Group/nexent
Nexent is a zero-code platform for auto-generating production-grade AI agents, unifying tools, skills, memory, and orchestration with built-in controls.
mudler/LocalAI
An open-source AI engine that allows running various AI models (LLMs, vision, voice, image, video) locally on any hardware, including CPU-only, with drop-in API compatibility for commercial services.
lance-format/lance
An open lakehouse format designed for multimodal AI, offering high-performance vector search, lightning-fast random access, and robust data versioning capabilities.
OpenGVLab/Ask-Anything
An AI project extending Large Language Models with video understanding capabilities, enabling conversational AI to process and respond to queries about video content.
lancedb/lancedb
An open-source, embedded retrieval library and multimodal AI lakehouse designed for fast, scalable vector search and data management in AI/ML applications.
towhee-io/towhee
Towhee is a cutting-edge framework designed to simplify and accelerate neural data processing pipelines, particularly for unstructured multimodal data and LLM orchestration.
NVIDIA-NeMo/Curator
A GPU-accelerated, scalable toolkit for multimodal data preprocessing and curation, designed to train better AI models faster.
roboflow/maestro
A streamlined tool to accelerate the fine-tuning process for multimodal models like Florence-2, PaliGemma 2, and Qwen2.5-VL.
LianjiaTech/BELLE
BELLE is an open-source project dedicated to fostering the development of Chinese conversational large language models, aiming to make LLMs accessible to everyone.
PKU-Alignment/align-anything
A modular framework for aligning any-modality large models with human intentions and values using diverse fine-tuning and reinforcement learning methods.
SamurAIGPT/Generative-Media-Skills
A multimodal toolset enabling AI agents to generate, edit, and display professional-grade images, videos, and audio using a CLI-powered architecture.
vllm-project/vllm-omni
vLLM-Omni is an efficient, flexible, and easy-to-use framework extending vLLM to serve omni-modality models (text, image, video, audio) with high throughput and an OpenAI-compatible API.
pixeltable/pixeltable
A declarative, transactional Python library for building multimodal AI applications with incremental data storage, transformation, indexing, and orchestration.
EvolvingLMMs-Lab/lmms-eval
A unified, reproducible, and efficient multimodal evaluation toolkit for large language models across text, image, video, and audio tasks.
fikrikarim/parlor
Parlor is an on-device, real-time multimodal AI that enables natural voice and vision conversations, running entirely on your local machine.
facebookresearch/mmf
A modular PyTorch-based framework from Facebook AI Research for state-of-the-art vision and language multimodal AI research.
bytedance/UI-TARS-desktop
UI-TARS Desktop is an open-source application that provides a native GUI Agent, enabling AI to control local and remote computers and browsers through the UI-TARS model.
chenking2020/FindTheChatGPTer
A curated directory of open-source alternatives to ChatGPT and GPT-4, encompassing text and multimodal large language models, designed to assist users in navigating the AI landscape.
microsoft/unilm
A comprehensive research initiative by Microsoft focusing on large-scale self-supervised pre-training to develop advanced foundation models across diverse tasks, languages, and modalities.
InternLM/InternLM-XComposer
A comprehensive multimodal AI system specializing in long-term streaming video and audio interactions, offering advanced vision-language understanding and composition.
jina-ai/serve
A cloud-native framework for building, deploying, and scaling multimodal AI applications and services with gRPC, HTTP, and WebSockets.
deepseek-ai/Janus
Janus-Series is a family of unified autoregressive multimodal AI models designed for both understanding and generating content across various modalities, featuring a novel decoupled visual encoding strategy.
NexaAI/nexa-sdk
A high-performance local inference framework for running frontier multimodal AI models on various devices with minimal energy consumption.
OpenGVLab/InternVL
A pioneering open-source multimodal AI model family designed to serve as a high-performance alternative to commercial models like GPT-4o and GPT-5.