Tags: #distributed-training

Deep Learning Framework

31.1k

Lightning-AI/pytorch-lightning

Streamlines complex deep learning engineering, enabling scalable AI model training and finetuning across diverse hardware with minimal code changes.

deep-learning pytorch ml-framework

Details

AI/MLOps Platform

kubernetes

5.0k

tencentmusic/cube-studio

An open-source, cloud-native, all-in-one MLOps platform designed for the full lifecycle management of machine learning, deep learning, and large language model development and deployment.

mlops ai platform cloud native

Replaces:

AWS SageMaker Google Cloud AI Platform...

Details

AI/Deep Learning Optimization Framework

NVIDIA GPUs

41.4k

hpcaitech/ColossalAI

An open-source framework designed to make large AI model training and inference cheaper, faster, and more accessible through advanced distributed computing and memory optimization techniques.

deep-learning distributed-training llm-optimization

Replaces:

OpenRouter

Details

Reinforcement Learning Library for LLMs

Ray

3.1k

alibaba/ROLL

An efficient and user-friendly scaling library designed to optimize Reinforcement Learning with Large Language Models, enhancing performance in complex AI tasks.

reinforcement-learning large-language-models distributed-training

Details