Jobs & Results | ResonTech Docs

Job States

State	What is happening	What to do
PENDING	Backend is validating paths, selecting workers, generating presigned shard URLs, and dispatching containers. Usually resolves in under 60 seconds.	Wait. Check worker counts appear.
RUNNING	NVFlare server is up, all required workers are connected, and FL rounds are executing. Worker count shows N/N connected.	Monitor progress via logs or the NVFlare dashboard.
FINISHED	All rounds completed — or the job was manually stopped. Output written to files_out/ and model_out/ in your workspace.	Download your model from model_out/.
FAILED	A worker crashed, timed out, or a validation error occurred. Error message stored on the job. Partial output may exist in files_out/.	Read the error message. Check logs. Fix and resubmit.
CANCELLED	Job was manually cancelled by the user. Partial output may exist.	Resubmit if needed. Check partial logs in files_out/.

Jobs page — all states visible with worker counts and timestamps

Understanding Worker Counts

The worker count panel shows four numbers in real time:

Counter	Meaning
Total	Number of GPU workers assigned to this job
Connected	Workers currently online and sending heartbeats to the NVFlare server
Errors	Workers that reported an error or crashed
Disconnected	Workers that lost connectivity (may be reassigned automatically)

A healthy running job shows Connected = Total and Errors = 0. If Connected drops below the minimum client threshold for too long, the job may be marked FAILED.

If a worker disconnects, the Coordinator detects the missing heartbeat and automatically selects a replacement. The replacement loads the last checkpoint and rejoins the FL round. You do not need to take any action.

Finding Your Results

When a job reaches FINISHED, output is written to two folders in your bucket:

Path	Contents
jobs/{name}/files_out/	Training logs, per-round metrics JSON, any files your fl_train_model() writes to out_dir
jobs/{name}/model_out/	Final aggregated model checkpoint after the last FL round (global_model.pt or similar)

Option A — from the Job Detail page

1
Dashboard → Jobs → click the finished job
2
Click "Browse Job Files"
Opens the Files page pre-navigated to jobs/{name}/.

Job Detail page — Browse Job Files button — The button appears on finished and aborted jobs.

Option B — from the Files page directly

1
Dashboard → Files → click your job folder
2
Click files_out/ for logs, or model_out/ for the checkpoint

Files page — files_out/ listing with training logs and metrics

Reading Logs & Metrics

Click any .txt or .json in files_out/ to open it in the Monaco viewer — no download needed.

File Viewer — training_log.txt with per-round metrics — Timestamped log: round number, loss, and metric values per round.

metrics.json contains structured per-round data:

File Viewer — metrics.json with best_round and best_miou

Log types

File	Contents
server.log	NVFlare FL server log — round start/end, aggregation events, client connections
client_{n}.log	Per-worker client log — local training steps, loss values, checkpoint writes
metrics.json	Structured per-round data: loss, accuracy, sample counts, round duration
fl_round_complete.json	Summary of each completed round including all client contributions

Downloading the Final Model

1
Navigate to model_out/
2
Click the .pt checkpoint file
3
Click "Download"
A 1-hour presigned GET URL is generated. Download starts directly from Garage.

File Viewer — checkpoint.pt with Download button highlighted

The checkpoint is a plain PyTorch state_dict. Load it at inference time with model.load_state_dict(torch.load("model_round_N.pt")). No NVFlare dependency required.

Presigned download URLs expire after 1 hour. If the download times out on a slow connection, click Download again to get a fresh URL.

Handling Failed Jobs

If a job reaches FAILED state, check the error message on the Job Detail page first. Common causes:

Error	Most likely cause	Fix
Worker timeout	Shard download too slow, model_def.py import error, or missing dependency	Check your model_def.py imports and requirements.txt. Verify shard file structure.
OOM crash	Batch size too large for the allocated GPU VRAM	Reduce batch_size in Step 2 and resubmit. Or request a GPU with more VRAM.
Missing file	Required file not found in your bucket at the expected path	Re-upload the missing file and resubmit. Check the exact path in your config.
NCCL error	Transient network issue between workers on multi-node jobs	Retry — transient network errors resolve automatically on resubmit.
name is not defined (server-side)	Missing import in model_def.py (e.g. import torch not present)	Add all required imports to the same file as your model class.
FileNotFoundError: manifest.ndjson	Shard ZIP does not contain manifest.ndjson at root	Verify your shard ZIP structure — manifest must be at the root, not in a subdirectory.

Partial output in files_out/ may still be useful — logs from completed rounds are preserved even on failure. Check server.logfor the exact error that triggered the failure.

PreviousSubmit Wizard NextFL Integration Guide