SDK — Job Lifecycle & Outputs

Overview

The SDK returns a thin Job handle on success. After that, the web dashboard is where progress tracking happens — but you can still read status and artefacts from Python if you need to.

States

JobState enum values: PENDING, RUNNING, FINISHED, FAILED, CANCELLED. Jobs start in PENDING while the kernel allocates resources, then transition to RUNNING as soon as workers are ready.

What rt_submit Returns

Open job.dashboard_url to get the full tracking UI: rounds, logs, file browser, model download, worker telemetry.

Checking Status From Python

Use sdk.jobs.get(job.id) for a one-shot refresh — no polling helpers are built into the SDK.

The SDK used to ship Job.stream() / Job.wait() — those were removed in 0.2.0 to keep the SDK focused on submission. Wrap the loop above yourself if you want polling logic in your code.

Listing Your Jobs

list() paginates with limit / offset.

Output Locations

Once a job reaches FINISHED:

Path	Contents
jobs/<slug>/files_out/	Training logs, per-round metrics, anything your train() wrote.
jobs/<slug>/model_out/	Final aggregated .pt checkpoint.

Everything under jobs/<slug>/ stays in your bucket until you delete it (subject to your storage quota).

From the web UI

Dashboard → Jobs → click your job → Browse Job Files opens the file browser at jobs/<slug>/.
Click files_out/training_log.txt to open in the Monaco viewer.
Click model_out/checkpoint.pt → Download for the final model.

From Python

Failed Jobs

On FAILED, the dashboard shows the error message and worker logs. Common causes:

Worker hit CUDA OOM → raise estimated_memory_mb or pick a larger GPU class.
Shard count mismatch → backend needs exactly N workers for N shards. Re-upload to match or use auto_select_workers=True.
model_def.py missing / broken — see source-extraction notes.
requirements.txt resolution failure — check spelling and version pins; prefer exact versions for reproducibility.

Cancelling

Stop a running job from the dashboard (Job detail → Stop) or via the backend API directly:

The SDK doesn't expose a dedicated stop() wrapper yet — this is the same endpoint the UI uses.

Archival

Bucket contents are not auto-deleted. Once you have the outputs you want:

Or use the file browser's folder delete in the web UI (recursive by default).

PreviousStorage NextTroubleshooting