Gradient Usage Guide

End-to-end guide for integrating Gradient into real training jobs. It covers setup, attach modes, commit patterns, resume/fork workflows, CLI handoff, and API reference.

Install

Package and dependency

bash

pip install gradient-desc

Requires torch and numpy.

Setup Paths

Two ways to initialize your project

Path A: CLI-initialized project (explicit structure, shared teams).

bash

gradient workspace init ./ml-experiments
cd ml-experiments
gradient repo init gpt4 --description "GPT-4 training runs"
gradient workspace status

Path B: Auto-create (quick start). Pass both workspace and repo to attach(...) and Gradient creates missing markers.

Attach Patterns

How to bind Gradient to your training run

Auto-create attach

python

import torch
import torch.nn as nn
import torch.optim as optim
from gradient import GradientEngine

model = nn.Linear(4, 1)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

engine = GradientEngine.attach(
    model,
    optimizer,
    workspace="./my_workspace",
    repo="my_model",
)

Auto-discover attach

python

# Run this inside an initialized repo directory
# Example cwd: ./ml-experiments/gpt4
from gradient import GradientEngine

engine = GradientEngine.attach(model, optimizer)

Attach with explicit config

python

from gradient import GradientEngine, GradientConfig

engine = GradientEngine.attach(
    model,
    optimizer,
    scheduler=lr_scheduler,
    config=GradientConfig(
        workspace_path="./ml-experiments",
        repo_name="gpt4",
        branch="main",
        compression="auto",
        reanchor_interval=50,
    ),
)

Training + Commit Patterns

Recommended checkpointing patterns during training

Periodic auto-commit

python

engine.autocommit(every=10)
start = engine.current_step

for step in range(start + 1, start + 1001):
    loss = train_step(...)
    engine.maybe_commit(step)

Manual milestone commits

python

# Save explicit milestones with messages
for step in range(start + 1, start + 501):
    loss = train_step(...)

    if step in {1, 50, 100, 250, 500}:
        engine.commit(step, message=f"milestone step {step}")

Resume + Fork Workflows

Continue training deterministically or branch experiments

Resume from Python API

python

# Resume exact ref
engine.resume("main@100")

# Or resume latest checkpoint on active branch
engine.resume_latest()

Resume/Fork from CLI

bash

# Resume from a specific checkpoint
gradient resume main@100 -- python train.py

# Resume from latest on current branch
gradient resume latest -- python train.py

# Fork a new branch from a checkpoint
gradient fork main@100 high_lr -- python train.py

Fork from API with options

python

engine.fork(
    from_ref="main@100",
    new_branch="high_lr",
    reset_optimizer=False,
    reset_scheduler=False,
    reset_rng_seed=None,
    message="10x LR experiment",
)

Checkpoint ref format:

Name	Description
`main@100`	Checkpoint at step 100 on branch `main`.
`experiment@50`	Checkpoint at step 50 on branch `experiment`.
`latest`	Highest step on current branch.

CLI handoff environment variables:

Name	Description
`GRADIENT_WORKSPACE`	Workspace path override for `attach(...)`.
`GRADIENT_REPO`	Repo name override for `attach(...)`.
`GRADIENT_RESUME_REF`	Auto-resume reference (`latest` or `branch@step`).
`GRADIENT_BRANCH`	Branch override for attach.
`GRADIENT_AUTOCOMMIT`	Auto-commit interval applied at attach.
`GRADIENT_AUTH_VERIFY_URL`	Optional token verification endpoint override for `gradient login`.

CLI Guide

Command reference for workspace/repo/training intents

Name	Description
`gradient workspace init [path]`	Create a workspace marker and workspace config.
`gradient workspace status`	Show workspace repos, branches, and checkpoint counts.
`gradient repo init <name> [-d DESC]`	Create a repo inside current workspace.
`gradient repo list`	List repos in current workspace.
`gradient status`	Show status for current repo.
`gradient login [--token TOKEN] [--verify-url URL]`	Verify and store a CLI access token for remote API calls.
`gradient resume <ref> -- python train.py`	Set resume intent and launch training command.
`gradient fork <ref> <branch> -- python train.py`	Set fork intent and launch training command.

API Reference

Attach arguments, methods, and helpers

`GradientEngine.attach(...)`

Name	Type	Default	Description
`model`	torch.nn.Module	required	Model parameters and buffers are checkpointed.
`optimizer`	torch.optim.Optimizer	required	Optimizer state is restored for deterministic continuation.
`scheduler`	Any \| None	None	Optional scheduler state capture and restore.
`workspace`	str \| None	None	Workspace path. With `repo`, can auto-create hierarchy.
`repo`	str \| None	None	Repo name inside workspace.
`config`	GradientConfig \| None	None	Explicit configuration object.

Checkpoint methods

Name	Type	Description
`commit(step, message='')`	-> str	Writes anchor/delta checkpoint and appends to manifest.
`resume(ref)`	-> int	Resumes from `branch@step` and returns step.
`resume_latest()`	-> int	Resumes highest step on current branch.
`fork(...)`	-> str	Creates a new branch checkpoint from an existing ref.

Fork options

Name	Type	Default	Description
`from_ref`	str	required	Source checkpoint ref, e.g. `main@100`.
`new_branch`	str	required	Branch name for forked checkpoint.
`reset_optimizer`	bool	False	If true, drop optimizer state in fork.
`reset_scheduler`	bool	False	If true, reset scheduler state in fork.
`reset_rng_seed`	Optional[int]	None	If set, reseed RNG for forked run.
`message`	str		Commit message for the fork checkpoint.

Training helpers + properties

Name	Type	Description
`autocommit(every=N)`	-> None	Sets periodic commit cadence.
`maybe_commit(step)`	-> None	Commits only when step matches cadence.
`register_state(name, getter, setter)`	-> None	Adds custom external state to checkpoints.
`current_step`	int	Starting step after attach/resume (0 for fresh runs).
`workspace_path / repo_name / repo_path / branch`	str	Resolved run context properties.

GradientConfig Reference

User-facing options for pathing and checkpoint behavior

Location + Branch

Name	Type	Default	Description
`workspace_path`	str	required	Workspace root path.
`repo_name`	str	required	Repo name under workspace.
`branch`	str	main	Active branch for commit/resume.

Checkpoint behavior

Name	Type	Default	Description
`reanchor_interval`	Optional[int]	None	Forces a new full anchor after N delta checkpoints.
`compression`	"off" \| "auto" \| "aggressive"	auto	Delta compression mode. Use `off` to disable compression.

External State

Include non-model state for deterministic resume (RL env, curriculum, etc.)

python

engine.register_state(
    "env_state",
    getter=lambda: env.get_state(),
    setter=lambda state: env.set_state(state),
)

Layout + Manifest

How Gradient stores checkpoints and metadata on disk

text

my_workspace/
├── .gradient/
│   └── config.json
├── gpt4/
│   ├── .gradient-repo/
│   │   └── config.json
│   ├── manifest.json
│   ├── ckpt_main_s0.pt
│   └── ckpt_main_s100.pt
└── llama/
    └── ...

json

{
  "repo_name": "my_model",
  "description": "",
  "created_at": "2026-02-10T12:34:56.000000",
  "checkpoints": [
    {
      "step": 10,
      "branch": "main",
      "file": "/abs/path/to/ckpt_main_s10.pt",
      "type": "delta"
    }
  ]
}

Python helpers for workspace/repo context:

python

from gradient import (
    init_workspace,
    init_repo,
    find_workspace,
    find_repo,
    resolve_context,
)

init_workspace("./ml-experiments")
init_repo("./ml-experiments/gpt4", name="gpt4")

print(find_workspace("."))
print(find_repo("."))
print(resolve_context("."))