Tags: #model-serving
bentoml/BentoML
A Python library for building and deploying high-performance AI model inference APIs and multi-model serving systems with ease.
ollama/ollama
Easily run, manage, and interact with open-source large language models locally on your machine.
xorbitsai/inference
A unified, production-ready inference API for effortlessly deploying and serving open-source language, speech, and multimodal AI models across various environments.
modular/modular
A unified, open platform for accelerating AI model serving and scaling GenAI deployments with industry-leading performance across various hardware.
containers/ramalama
RamaLama simplifies the local serving and production inference of AI models from any source by leveraging familiar container patterns, eliminating complex host system configurations.
clearml/clearml
ClearML is an open-source MLOps/LLMOps solution that streamlines the entire AI workflow, from experiment management and data versioning to pipeline orchestration and model serving.
SeldonIO/seldon-core
An MLOps and LLMOps framework for deploying, managing, and scaling modular, data-centric AI applications and models on Kubernetes.
jundot/omlx
An optimized LLM inference server for Apple Silicon, featuring continuous batching, tiered KV caching, and macOS menu bar management for efficient local AI.
predibase/lorax
A multi-LoRA inference server designed to efficiently serve thousands of fine-tuned Large Language Models on a single GPU, drastically cutting serving costs while maintaining high throughput and low latency.
vllm-project/vllm-omni
vLLM-Omni is an efficient, flexible, and easy-to-use framework extending vLLM to serve omni-modality models (text, image, video, audio) with high throughput and an OpenAI-compatible API.