Tags: #computer-vision
huggingface/transformers
A comprehensive library providing state-of-the-art pre-trained models for various machine learning tasks across text, vision, audio, and multimodal domains, facilitating both inference and training.
web-infra-dev/midscene
An AI-powered, vision-driven UI automation framework that enables natural language control and scripting across web, mobile, and custom interfaces.
Skyvern-AI/skyvern
Automates complex browser-based workflows using LLMs and computer vision, providing a resilient and adaptive solution for web interaction.
OpenGVLab/Ask-Anything
An advanced multimodal AI chatbot framework that enables conversational interaction and deep understanding of video and image content, integrating various large language models.
spmallick/learnopencv
A comprehensive repository offering C++ and Python code examples for computer vision, deep learning, and AI research articles from LearnOpenCV.com.
huggingface/datasets
A lightweight library providing one-line dataloaders and efficient pre-processing tools for a vast hub of AI datasets, supporting various ML frameworks.
microsoft/fara
An ultra-compact 7B parameter AI agent designed by Microsoft to automate multi-step computer tasks through visual perception and direct interface interaction.
adobe-research/custom-diffusion
Enables fast and efficient multi-concept customization of text-to-image diffusion models like Stable Diffusion using a few images.
tianrun-chen/SAM-Adapter-PyTorch
A PyTorch-based framework to adapt Meta AI's Segment Anything Model (SAM) for improved performance on challenging downstream computer vision tasks using adapters and prompts.
xtreme1-io/xtreme1
An all-in-one open-source platform for multimodal data labeling and annotation, supporting 3D LiDAR, image, and LLM training data with AI-fueled tools.
Hunyuan-PromptEnhancer/PromptEnhancer
A prompt rewriting tool that refines user prompts into clearer, structured versions to enhance the quality of text-to-image generation and image-to-image editing.
OpenGVLab/InternVideo
A series of video foundation models and large-scale datasets designed for comprehensive multimodal video understanding and generation.
ZhaoJ9014/face.evoLVe
A high-performance, comprehensive face recognition library built on PaddlePaddle and PyTorch.
Yutong-Zhou-cv/Awesome-Text-to-Image
A comprehensive curated list of resources, papers, datasets, and projects related to text-to-image generation and manipulation.
microsoft/unilm
A comprehensive research hub for large-scale self-supervised pre-training of foundation models across diverse tasks, languages, and modalities.
autodistill/autodistill
Autodistill automates the process of training small, fast supervised models from unlabeled images by leveraging large foundation models, eliminating the need for manual data labeling.
X-PLUG/mPLUG-Owl
A family of powerful multi-modal large language models (MLLMs) designed to advance AI's understanding and generation capabilities across various data types.
xlite-dev/lite.ai.toolkit
A lightweight C++ toolkit for deploying over 100 AI models across various inference engines.
wangkai930418/awesome-diffusion-categorized
A meticulously categorized collection of research papers on diffusion models, spanning various subareas from visual illusions to image restoration and text-guided editing.
Fanghua-Yu/SUPIR
SUPIR is an AI-driven project focused on developing practical algorithms for photo-realistic image restoration and upscaling in real-world scenarios.
OpenGVLab/InternVL
A pioneering open-source multimodal large language model family aiming to match or exceed commercial models like GPT-4o/GPT-5 in performance.
open-mmlab/mmpretrain
MMPreTrain is an OpenMMLab project providing a comprehensive, open-source PyTorch-based toolbox for pre-training and benchmarking various computer vision and multi-modal models.