DocsPlatformJobs & Results
Platform

Jobs & Results

Track job progress in real time, understand job states, read training logs, and download model checkpoints.

Job States

StateWhat is happeningWhat to do
PENDINGBackend is validating paths, selecting workers, generating presigned shard URLs, and dispatching containers. Usually resolves in under 60 seconds.Wait. Check worker counts appear.
RUNNINGNVFlare server is up, all required workers are connected, and FL rounds are executing. Worker count shows N/N connected.Monitor progress via logs or the NVFlare dashboard.
FINISHEDAll rounds completed — or the job was manually stopped. Output written to files_out/ and model_out/ in your workspace.Download your model from model_out/.
FAILEDA worker crashed, timed out, or a validation error occurred. Error message stored on the job. Partial output may exist in files_out/.Read the error message. Check logs. Fix and resubmit.
CANCELLEDJob was manually cancelled by the user. Partial output may exist.Resubmit if needed. Check partial logs in files_out/.
Jobs page — all states visible with worker counts and timestamps

Understanding Worker Counts

The worker count panel shows four numbers in real time:

CounterMeaning
TotalNumber of GPU workers assigned to this job
ConnectedWorkers currently online and sending heartbeats to the NVFlare server
ErrorsWorkers that reported an error or crashed
DisconnectedWorkers that lost connectivity (may be reassigned automatically)

A healthy running job shows Connected = Total and Errors = 0. If Connected drops below the minimum client threshold for too long, the job may be marked FAILED.

i
If a worker disconnects, the Coordinator detects the missing heartbeat and automatically selects a replacement. The replacement loads the last checkpoint and rejoins the FL round. You do not need to take any action.

Finding Your Results

When a job reaches FINISHED, output is written to two folders in your bucket:

PathContents
jobs/{name}/files_out/Training logs, per-round metrics JSON, any files your fl_train_model() writes to out_dir
jobs/{name}/model_out/Final aggregated model checkpoint after the last FL round (global_model.pt or similar)

Option A — from the Job Detail page

  1. 1

    Dashboard → Jobs → click the finished job

  2. 2

    Click "Browse Job Files"

    Opens the Files page pre-navigated to jobs/{name}/.
Job Detail page — Browse Job Files button
The button appears on finished and aborted jobs.

Option B — from the Files page directly

  1. 1

    Dashboard → Files → click your job folder

  2. 2

    Click files_out/ for logs, or model_out/ for the checkpoint

Files page — files_out/ listing with training logs and metrics

Reading Logs & Metrics

Click any .txt or .json in files_out/ to open it in the Monaco viewer — no download needed.

File Viewer — training_log.txt with per-round metrics
Timestamped log: round number, loss, and metric values per round.

metrics.json contains structured per-round data:

File Viewer — metrics.json with best_round and best_miou

Log types

FileContents
server.logNVFlare FL server log — round start/end, aggregation events, client connections
client_{n}.logPer-worker client log — local training steps, loss values, checkpoint writes
metrics.jsonStructured per-round data: loss, accuracy, sample counts, round duration
fl_round_complete.jsonSummary of each completed round including all client contributions

Downloading the Final Model

  1. 1

    Navigate to model_out/

  2. 2

    Click the .pt checkpoint file

  3. 3

    Click "Download"

    A 1-hour presigned GET URL is generated. Download starts directly from Garage.
File Viewer — checkpoint.pt with Download button highlighted
i
The checkpoint is a plain PyTorch state_dict. Load it at inference time with model.load_state_dict(torch.load("model_round_N.pt")). No NVFlare dependency required.
!
Presigned download URLs expire after 1 hour. If the download times out on a slow connection, click Download again to get a fresh URL.

Handling Failed Jobs

If a job reaches FAILED state, check the error message on the Job Detail page first. Common causes:

ErrorMost likely causeFix
Worker timeoutShard download too slow, model_def.py import error, or missing dependencyCheck your model_def.py imports and requirements.txt. Verify shard file structure.
OOM crashBatch size too large for the allocated GPU VRAMReduce batch_size in Step 2 and resubmit. Or request a GPU with more VRAM.
Missing fileRequired file not found in your bucket at the expected pathRe-upload the missing file and resubmit. Check the exact path in your config.
NCCL errorTransient network issue between workers on multi-node jobsRetry — transient network errors resolve automatically on resubmit.
name is not defined (server-side)Missing import in model_def.py (e.g. import torch not present)Add all required imports to the same file as your model class.
FileNotFoundError: manifest.ndjsonShard ZIP does not contain manifest.ndjson at rootVerify your shard ZIP structure — manifest must be at the root, not in a subdirectory.
i
Partial output in files_out/ may still be useful — logs from completed rounds are preserved even on failure. Check server.logfor the exact error that triggered the failure.