Data Processing CLI Tool
4.4k 2026-04-18
rom1504/img2dataset
An efficient command-line tool to download, resize, and package vast collections of image URLs into ready-to-use datasets for machine learning.
Core Features
High-speed download and processing of millions of image URLs.
Automated image resizing and packaging into various formats (e.g., WebDataset).
Supports saving associated captions for image-text datasets.
Respects web opt-out directives (X-Robots-Tag) by default.
Scalable to billions of image-text pairs with distributed processing.
Quick Start
pip install img2dataset && img2dataset --url_list=myimglist.txt --output_folder=output_folder --thread_count=64 --image_size=256Detailed Introduction
img2dataset is a powerful and highly optimized command-line interface tool designed to streamline the creation of large-scale image datasets from lists of URLs. It addresses the critical need for efficient data preparation in machine learning, enabling users to quickly download, resize, and organize millions or even billions of images and their corresponding captions. Its performance capabilities, such as processing 100 million URLs in 20 hours on a single machine, make it an invaluable asset for researchers and developers building data-intensive AI models.