Gradient Usage Guide

End-to-end guide for integrating Gradient into real training jobs. It covers setup, attach modes, commit patterns, resume/fork workflows, CLI handoff, and API reference.

Install

Package and dependency

bash
pip install gradient-desc

Requires torch and numpy.

Setup Paths

Two ways to initialize your project

Path A: CLI-initialized project (explicit structure, shared teams).

bash
gradient workspace init ./ml-experiments
cd ml-experiments
gradient repo init gpt4 --description "GPT-4 training runs"
gradient workspace status

Path B: Auto-create (quick start). Pass both workspace and repo to attach(...) and Gradient creates missing markers.

Attach Patterns

How to bind Gradient to your training run

Auto-create attach

python
import torch
import torch.nn as nn
import torch.optim as optim
from gradient import GradientEngine

model = nn.Linear(4, 1)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

engine = GradientEngine.attach(
    model,
    optimizer,
    workspace="./my_workspace",
    repo="my_model",
)

Auto-discover attach

python
# Run this inside an initialized repo directory
# Example cwd: ./ml-experiments/gpt4
from gradient import GradientEngine

engine = GradientEngine.attach(model, optimizer)

Attach with explicit config

python
from gradient import GradientEngine, GradientConfig

engine = GradientEngine.attach(
    model,
    optimizer,
    scheduler=lr_scheduler,
    config=GradientConfig(
        workspace_path="./ml-experiments",
        repo_name="gpt4",
        branch="main",
        compression="auto",
        reanchor_interval=50,
    ),
)

Training + Commit Patterns

Recommended checkpointing patterns during training

Periodic auto-commit

python
engine.autocommit(every=10)
start = engine.current_step

for step in range(start + 1, start + 1001):
    loss = train_step(...)
    engine.maybe_commit(step)

Manual milestone commits

python
# Save explicit milestones with messages
for step in range(start + 1, start + 501):
    loss = train_step(...)

    if step in {1, 50, 100, 250, 500}:
        engine.commit(step, message=f"milestone step {step}")

Resume + Fork Workflows

Continue training deterministically or branch experiments

Resume from Python API

python
# Resume exact ref
engine.resume("main@100")

# Or resume latest checkpoint on active branch
engine.resume_latest()

Resume/Fork from CLI

bash
# Resume from a specific checkpoint
gradient resume main@100 -- python train.py

# Resume from latest on current branch
gradient resume latest -- python train.py

# Fork a new branch from a checkpoint
gradient fork main@100 high_lr -- python train.py

Fork from API with options

python
engine.fork(
    from_ref="main@100",
    new_branch="high_lr",
    reset_optimizer=False,
    reset_scheduler=False,
    reset_rng_seed=None,
    message="10x LR experiment",
)

Checkpoint ref format:

NameDescription
main@100

Checkpoint at step 100 on branch `main`.

experiment@50

Checkpoint at step 50 on branch `experiment`.

latest

Highest step on current branch.

CLI handoff environment variables:

NameDescription
GRADIENT_WORKSPACE

Workspace path override for `attach(...)`.

GRADIENT_REPO

Repo name override for `attach(...)`.

GRADIENT_RESUME_REF

Auto-resume reference (`latest` or `branch@step`).

GRADIENT_BRANCH

Branch override for attach.

GRADIENT_AUTOCOMMIT

Auto-commit interval applied at attach.

GRADIENT_AUTH_VERIFY_URL

Optional token verification endpoint override for `gradient login`.

CLI Guide

Command reference for workspace/repo/training intents

NameDescription
gradient workspace init [path]

Create a workspace marker and workspace config.

gradient workspace status

Show workspace repos, branches, and checkpoint counts.

gradient repo init <name> [-d DESC]

Create a repo inside current workspace.

gradient repo list

List repos in current workspace.

gradient status

Show status for current repo.

gradient login [--token TOKEN] [--verify-url URL]

Verify and store a CLI access token for remote API calls.

gradient resume <ref> -- python train.py

Set resume intent and launch training command.

gradient fork <ref> <branch> -- python train.py

Set fork intent and launch training command.

API Reference

Attach arguments, methods, and helpers

`GradientEngine.attach(...)`

NameTypeDefaultDescription
modeltorch.nn.Modulerequired

Model parameters and buffers are checkpointed.

optimizertorch.optim.Optimizerrequired

Optimizer state is restored for deterministic continuation.

schedulerAny | NoneNone

Optional scheduler state capture and restore.

workspacestr | NoneNone

Workspace path. With `repo`, can auto-create hierarchy.

repostr | NoneNone

Repo name inside workspace.

configGradientConfig | NoneNone

Explicit configuration object.

Checkpoint methods

NameTypeDescription
commit(step, message='')-> str

Writes anchor/delta checkpoint and appends to manifest.

resume(ref)-> int

Resumes from `branch@step` and returns step.

resume_latest()-> int

Resumes highest step on current branch.

fork(...)-> str

Creates a new branch checkpoint from an existing ref.

Fork options

NameTypeDefaultDescription
from_refstrrequired

Source checkpoint ref, e.g. `main@100`.

new_branchstrrequired

Branch name for forked checkpoint.

reset_optimizerboolFalse

If true, drop optimizer state in fork.

reset_schedulerboolFalse

If true, reset scheduler state in fork.

reset_rng_seedOptional[int]None

If set, reseed RNG for forked run.

messagestr

Commit message for the fork checkpoint.

Training helpers + properties

NameTypeDescription
autocommit(every=N)-> None

Sets periodic commit cadence.

maybe_commit(step)-> None

Commits only when step matches cadence.

register_state(name, getter, setter)-> None

Adds custom external state to checkpoints.

current_stepint

Starting step after attach/resume (0 for fresh runs).

workspace_path / repo_name / repo_path / branchstr

Resolved run context properties.

GradientConfig Reference

User-facing options for pathing and checkpoint behavior

Location + Branch

NameTypeDefaultDescription
workspace_pathstrrequired

Workspace root path.

repo_namestrrequired

Repo name under workspace.

branchstrmain

Active branch for commit/resume.

Checkpoint behavior

NameTypeDefaultDescription
reanchor_intervalOptional[int]None

Forces a new full anchor after N delta checkpoints.

compression"off" | "auto" | "aggressive"auto

Delta compression mode. Use `off` to disable compression.

External State

Include non-model state for deterministic resume (RL env, curriculum, etc.)

python
engine.register_state(
    "env_state",
    getter=lambda: env.get_state(),
    setter=lambda state: env.set_state(state),
)

Layout + Manifest

How Gradient stores checkpoints and metadata on disk

text
my_workspace/
├── .gradient/
│   └── config.json
├── gpt4/
│   ├── .gradient-repo/
│   │   └── config.json
│   ├── manifest.json
│   ├── ckpt_main_s0.pt
│   └── ckpt_main_s100.pt
└── llama/
    └── ...
json
{
  "repo_name": "my_model",
  "description": "",
  "created_at": "2026-02-10T12:34:56.000000",
  "checkpoints": [
    {
      "step": 10,
      "branch": "main",
      "file": "/abs/path/to/ckpt_main_s10.pt",
      "type": "delta"
    }
  ]
}

Python helpers for workspace/repo context:

python
from gradient import (
    init_workspace,
    init_repo,
    find_workspace,
    find_repo,
    resolve_context,
)

init_workspace("./ml-experiments")
init_repo("./ml-experiments/gpt4", name="gpt4")

print(find_workspace("."))
print(find_repo("."))
print(resolve_context("."))