Gradient Usage Guide
End-to-end guide for integrating Gradient into real training jobs. It covers setup, attach modes, commit patterns, resume/fork workflows, CLI handoff, and API reference.
Install
Package and dependency
pip install gradient-descRequires torch and numpy.
Setup Paths
Two ways to initialize your project
Path A: CLI-initialized project (explicit structure, shared teams).
gradient workspace init ./ml-experiments
cd ml-experiments
gradient repo init gpt4 --description "GPT-4 training runs"
gradient workspace statusPath B: Auto-create (quick start). Pass both workspace and repo to attach(...) and Gradient creates missing markers.
Attach Patterns
How to bind Gradient to your training run
Auto-create attach
import torch
import torch.nn as nn
import torch.optim as optim
from gradient import GradientEngine
model = nn.Linear(4, 1)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
engine = GradientEngine.attach(
model,
optimizer,
workspace="./my_workspace",
repo="my_model",
)Auto-discover attach
# Run this inside an initialized repo directory
# Example cwd: ./ml-experiments/gpt4
from gradient import GradientEngine
engine = GradientEngine.attach(model, optimizer)Attach with explicit config
from gradient import GradientEngine, GradientConfig
engine = GradientEngine.attach(
model,
optimizer,
scheduler=lr_scheduler,
config=GradientConfig(
workspace_path="./ml-experiments",
repo_name="gpt4",
branch="main",
compression="auto",
reanchor_interval=50,
),
)Training + Commit Patterns
Recommended checkpointing patterns during training
Periodic auto-commit
engine.autocommit(every=10)
start = engine.current_step
for step in range(start + 1, start + 1001):
loss = train_step(...)
engine.maybe_commit(step)Manual milestone commits
# Save explicit milestones with messages
for step in range(start + 1, start + 501):
loss = train_step(...)
if step in {1, 50, 100, 250, 500}:
engine.commit(step, message=f"milestone step {step}")Resume + Fork Workflows
Continue training deterministically or branch experiments
Resume from Python API
# Resume exact ref
engine.resume("main@100")
# Or resume latest checkpoint on active branch
engine.resume_latest()Resume/Fork from CLI
# Resume from a specific checkpoint
gradient resume main@100 -- python train.py
# Resume from latest on current branch
gradient resume latest -- python train.py
# Fork a new branch from a checkpoint
gradient fork main@100 high_lr -- python train.pyFork from API with options
engine.fork(
from_ref="main@100",
new_branch="high_lr",
reset_optimizer=False,
reset_scheduler=False,
reset_rng_seed=None,
message="10x LR experiment",
)Checkpoint ref format:
| Name | Description |
|---|---|
main@100 | Checkpoint at step 100 on branch `main`. |
experiment@50 | Checkpoint at step 50 on branch `experiment`. |
latest | Highest step on current branch. |
CLI handoff environment variables:
| Name | Description |
|---|---|
GRADIENT_WORKSPACE | Workspace path override for `attach(...)`. |
GRADIENT_REPO | Repo name override for `attach(...)`. |
GRADIENT_RESUME_REF | Auto-resume reference (`latest` or `branch@step`). |
GRADIENT_BRANCH | Branch override for attach. |
GRADIENT_AUTOCOMMIT | Auto-commit interval applied at attach. |
GRADIENT_AUTH_VERIFY_URL | Optional token verification endpoint override for `gradient login`. |
CLI Guide
Command reference for workspace/repo/training intents
| Name | Description |
|---|---|
gradient workspace init [path] | Create a workspace marker and workspace config. |
gradient workspace status | Show workspace repos, branches, and checkpoint counts. |
gradient repo init <name> [-d DESC] | Create a repo inside current workspace. |
gradient repo list | List repos in current workspace. |
gradient status | Show status for current repo. |
gradient login [--token TOKEN] [--verify-url URL] | Verify and store a CLI access token for remote API calls. |
gradient resume <ref> -- python train.py | Set resume intent and launch training command. |
gradient fork <ref> <branch> -- python train.py | Set fork intent and launch training command. |
API Reference
Attach arguments, methods, and helpers
`GradientEngine.attach(...)`
| Name | Type | Default | Description |
|---|---|---|---|
model | torch.nn.Module | required | Model parameters and buffers are checkpointed. |
optimizer | torch.optim.Optimizer | required | Optimizer state is restored for deterministic continuation. |
scheduler | Any | None | None | Optional scheduler state capture and restore. |
workspace | str | None | None | Workspace path. With `repo`, can auto-create hierarchy. |
repo | str | None | None | Repo name inside workspace. |
config | GradientConfig | None | None | Explicit configuration object. |
Checkpoint methods
| Name | Type | Description |
|---|---|---|
commit(step, message='') | -> str | Writes anchor/delta checkpoint and appends to manifest. |
resume(ref) | -> int | Resumes from `branch@step` and returns step. |
resume_latest() | -> int | Resumes highest step on current branch. |
fork(...) | -> str | Creates a new branch checkpoint from an existing ref. |
Fork options
| Name | Type | Default | Description |
|---|---|---|---|
from_ref | str | required | Source checkpoint ref, e.g. `main@100`. |
new_branch | str | required | Branch name for forked checkpoint. |
reset_optimizer | bool | False | If true, drop optimizer state in fork. |
reset_scheduler | bool | False | If true, reset scheduler state in fork. |
reset_rng_seed | Optional[int] | None | If set, reseed RNG for forked run. |
message | str | Commit message for the fork checkpoint. |
Training helpers + properties
| Name | Type | Description |
|---|---|---|
autocommit(every=N) | -> None | Sets periodic commit cadence. |
maybe_commit(step) | -> None | Commits only when step matches cadence. |
register_state(name, getter, setter) | -> None | Adds custom external state to checkpoints. |
current_step | int | Starting step after attach/resume (0 for fresh runs). |
workspace_path / repo_name / repo_path / branch | str | Resolved run context properties. |
GradientConfig Reference
User-facing options for pathing and checkpoint behavior
Location + Branch
| Name | Type | Default | Description |
|---|---|---|---|
workspace_path | str | required | Workspace root path. |
repo_name | str | required | Repo name under workspace. |
branch | str | main | Active branch for commit/resume. |
Checkpoint behavior
| Name | Type | Default | Description |
|---|---|---|---|
reanchor_interval | Optional[int] | None | Forces a new full anchor after N delta checkpoints. |
compression | "off" | "auto" | "aggressive" | auto | Delta compression mode. Use `off` to disable compression. |
External State
Include non-model state for deterministic resume (RL env, curriculum, etc.)
engine.register_state(
"env_state",
getter=lambda: env.get_state(),
setter=lambda state: env.set_state(state),
)Layout + Manifest
How Gradient stores checkpoints and metadata on disk
my_workspace/
├── .gradient/
│ └── config.json
├── gpt4/
│ ├── .gradient-repo/
│ │ └── config.json
│ ├── manifest.json
│ ├── ckpt_main_s0.pt
│ └── ckpt_main_s100.pt
└── llama/
└── ...{
"repo_name": "my_model",
"description": "",
"created_at": "2026-02-10T12:34:56.000000",
"checkpoints": [
{
"step": 10,
"branch": "main",
"file": "/abs/path/to/ckpt_main_s10.pt",
"type": "delta"
}
]
}Python helpers for workspace/repo context:
from gradient import (
init_workspace,
init_repo,
find_workspace,
find_repo,
resolve_context,
)
init_workspace("./ml-experiments")
init_repo("./ml-experiments/gpt4", name="gpt4")
print(find_workspace("."))
print(find_repo("."))
print(resolve_context("."))