AI Data Curation Toolkit
1.5k 2026-04-18
NVIDIA-NeMo/Curator
A GPU-accelerated, scalable toolkit for multimodal data preprocessing and curation, designed to train better AI models faster.
Core Features
Scalable, GPU-accelerated data processing
Multimodal support (Text, Image, Video, Audio)
Advanced data quality filtering and deduplication
Integration with NVIDIA NeMo ecosystem
Modular pipelines for various curation tasks
Quick Start
uv pip install "nemo-curator[text_cuda12]"; python tutorials/quickstart.pyDetailed Introduction
NVIDIA NeMo Curator is a powerful, GPU-accelerated data curation toolkit that enables developers and researchers to efficiently prepare high-quality datasets for training advanced AI models, including LLMs, VLMs, and WFMs. As part of the NVIDIA NeMo suite, it offers modular pipelines for text, image, video, and audio data, supporting tasks like deduplication, quality filtering, and semantic analysis. Its scalability from laptops to multi-node clusters ensures faster and more effective AI model development.