OSS Alternative - Discover Top Open Source Alternatives to Popular Software

ModelCloud/GPTQModel

A toolkit for quantizing (compressing) Large Language Models (LLMs) with hardware acceleration across various GPUs and CPUs, integrating with popular inference frameworks.

Core Features

Comprehensive LLM quantization: Supports various methods like GPTQ, AWQ, ParoQuant, GGUF, FP8, EXL3, and 1-bit quantization.

Broad hardware acceleration: Leverages NVIDIA CUDA, AMD ROCm, Intel XPU, and Intel/AMD/Apple CPUs for optimized inference.

Seamless integration: Works with popular LLM inference frameworks such as Hugging Face Transformers, vLLM, and SGLang.

Optimized performance: Features JIT-compiled CUDA kernels, faster CPU kernels (AMX, AVX2, AVX512), and specialized MoE expert quantization.

Extensive model support: Continuously adds support for new LLM architectures like GLM, Gemma, MiniCPM, Qwen, and more.

Quick Start

pip install gptqmodel

Detailed Introduction

GPTQModel is a powerful open-source toolkit designed to optimize Large Language Models (LLMs) through quantization and compression. By significantly reducing model size and memory footprint, it enables more efficient deployment and faster inference on a wide range of hardware, from high-end GPUs to consumer-grade CPUs. The project integrates seamlessly with leading LLM ecosystems like Hugging Face, vLLM, and SGLang, providing developers with flexible tools to make large models more accessible and cost-effective for various applications. Its continuous development introduces new quantization methods and hardware optimizations, ensuring state-of-the-art performance.