Tags: #multimodal
xorbitsai/inference
A unified, production-ready inference API for deploying and serving open-source language, speech, and multimodal AI models on various infrastructures.
OpenDCAI/Paper2Any
An AI-driven platform that transforms research papers, text, or topics into editable scientific figures, technical diagrams, and presentation slides with universal file support.
FlagOpen/FlagEmbedding
FlagEmbedding (BGE) is a comprehensive toolkit offering state-of-the-art embedding models for efficient search, Retrieval-Augmented Generation (RAG), and various multimodal AI applications.
xlang-ai/OSWorld
OSWorld is a benchmark and environment for evaluating multimodal AI agents on open-ended tasks within real computer operating systems.
datawhalechina/all-in-rag
A comprehensive, full-stack guide to Retrieval Augmented Generation (RAG) technology for large language model application development, covering theory, practice, and engineering best practices.
aiming-lab/SimpleMem
SimpleMem offers an efficient, lifelong, and multimodal memory solution for LLM agents, featuring semantic lossless compression for diverse data types.
WangRongsheng/awesome-LLM-resources
A comprehensive, continuously updated collection of the best resources for Large Language Models (LLMs), covering various aspects from data processing to advanced applications.
vllm-project/vllm-omni
A framework for efficient, fast, and cheap serving of omni-modality (text, image, video, audio) AI models.
Eventual-Inc/Daft
A high-performance data engine for AI and multimodal workloads, processing diverse data types at scale with Python and Rust.
2U1/Qwen-VL-Series-Finetune
An open-source implementation for efficiently fine-tuning Alibaba Cloud's Qwen-VL series of multimodal large language models using HuggingFace and Liger-Kernel.
morphik-org/morphik-core
A comprehensive AI-native toolset for accurate document search and storage, designed to integrate complex context from visually rich and multimodal data into AI applications.
bytedance/UI-TARS-desktop
An open-source desktop application providing a native GUI Agent for human-like task completion through multimodal AI, enabling local and remote computer/browser automation.
emcf/thepipe
A Python library for extracting clean markdown, multimodal media, and structured data from complex documents using vision-language models.
PySpur-Dev/pyspur
PySpur is a visual playground designed to accelerate the iteration, debugging, and deployment of AI agents, helping engineers overcome common challenges like prompt hell and workflow blindspots.
atfortes/Awesome-LLM-Reasoning
A meticulously curated collection of academic papers and resources focused on enhancing and understanding the reasoning abilities of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs).
OpenGVLab/InternVL
A pioneering open-source multimodal large language model family aiming to match or exceed commercial models like GPT-4o/GPT-5 in performance.
souzatharsis/podcastfy
An open-source Python package that transforms multi-modal content into captivating multilingual audio conversations using GenAI, serving as a programmatic alternative to tools like NotebookLM.