果冻甜甜的

0%

NVIDIA Resiliency Extension (NVRx) 简介

Posted on 2026-01-14 In 其它

介绍NVRx的基础知识

Reducing Energy Bloat in Large Model Training

Posted on 2025-12-28 In 论文阅读

减少大模型训练中的能源浪费：Perseus 系统详解

Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters

Posted on 2025-12-28 In 论文阅读

Rail-only：面向万亿参数 LLM 训练的低成本高性能网络架构

Reducing Activation Recomputation in Large Transformer Models

Posted on 2025-11-23 In 论文阅读

大规模 Transformer 激活重计算的系统级优化

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Posted on 2025-11-23 In 论文阅读

Megatron-LM 三维并行实践解析

InstructCoder: Instruction Tuning Large Language Models for Code Editing

Posted on 2025-11-22 In 论文阅读

InstructCoder：面向代码编辑的指令微调实践解析

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Posted on 2025-11-22 In 论文阅读

Megatron-LM：使用模型并行训练数十亿参数的语言模型

token 简介

Posted on 2025-09-07 In 其它

介绍token的基础知识

pytorch中的stream和event

Posted on 2025-09-07 In 分布式基础

PyTorch 中的 Stream / Event 与跨流同步：原理、用法与可运行示例

ubuntu常见shell命令

Posted on 2025-08-17 Edited on 2025-08-24 In 其它

记录最常用的shell命令