FL Integration Guide | ResonTech Docs

Requirements

The Python SDK handles all boilerplate: config generation, file packaging, upload via S3, and job submission. You provide your model class and training loop — the SDK generates everything else.

How Federated Training Works

The platform calls your training function once per FL round. Each round follows the same sequence:

Phase	Who acts	What happens
Initialization	FL Server	CustomPersistor instantiates your model class. Initial weights come from here (e.g. ImageNet pretrained backbone).
Distribute	FL Server → all workers	Server sends current global weights to all FL clients via gRPC.
Local train	Each worker	Your fl_train_model() is called with payload containing global weights, data path, round number, and hyperparams.
Return	Each worker → FL Server	Your function returns updated weights + sample count.
Aggregate	FL Server	FedAvg: weighted average of all client updates by sample count. New global model saved.
Repeat	FL Server	Next round begins. Repeat for num_rounds total.

Round 0 detail

Python SDK Quick Start

The SDK handles config generation, workspace upload, and job submission in a single call. You only need to provide your model class and shard directory:

submit_job.py

The SDK extracts your model class source code automatically — no hand-written config files needed. It generates model_def.py, NVFlare configs, and executor scripts, then uploads everything to your S3 bucket.

SDK Configuration Reference

ResonTechConfig

Parameter	Type	Default	Description
base_url	str	—	REST API base URL (e.g. https://api.beta.reson.tech)
email	str	—	Account email
password	str	—	Account password
sftp_key_file	str	~/.ssh/id_ed25519	Private key path (legacy — SFTP is deprecated, use S3)

TrainingConfig

Parameter	Type	Default	Description
num_classes	int	0	Output classes (0 = infer from manifest.ndjson)
local_epochs	int	2	Local training epochs per FL round
batch_size	int	32	Mini-batch size
learning_rate	float	0.001	Optimizer learning rate
img_size	int	224	Input image size (square)
manifest_file	str	manifest.ndjson	Manifest filename inside each shard ZIP

FederationConfig

Parameter	Type	Default	Description
num_rounds	int	5	Total FL training rounds
min_clients	int	1	Minimum clients required per round
wait_time_after_min_received	int	10	Seconds to wait after min clients respond before aggregating
heart_beat_timeout	int	600	Client heartbeat timeout in seconds — clients missing this are dropped
job_name	str	rt_job	Name stored in job metadata

Writing model_def.py Manually

If you prefer to write model_def.py directly (rather than using the SDK's auto-generation), it must follow this structure:

model_def.py

Your existing training loop — epochs, early stopping, scheduler, checkpointing — all runs inside fl_train_model(). The only difference: receive weights at the start and return updated weights at the end.

Payload Schema

The payload dict your function receives each round:

Manifest Dataset Format (manifest.ndjson)

The SDK's default dataset adapter expects a manifest.ndjson file at the root of each shard ZIP. Each line is a JSON object:

Field	Type	Description
id	str	Unique identifier for this sample
uri	str	Image URI — file:// for local paths, https:// for remote URLs (downloaded and cached)
y	list[int]	Label indices for multi-label classification (supports single-label too)
meta	dict	Optional metadata (not used by default adapter but available in payload)

The num_classes value in TrainingConfig must be at least max(y) + 1 across all samples. Set it to 0 to let the SDK infer it from the manifest.

Advanced: Custom Executor & Persistor

Pass your own NVFlare subclasses to override the platform defaults. This is useful for custom aggregation strategies, non-standard model architectures, or custom checkpoint formats:

Advanced: Manual build + upload

Inspect the generated job folder before uploading:

Common Mistakes & Troubleshooting

Do not call model.train() before loading weights. Always load weights first: model.set_weights(payload["initial_weights"]), then switch to train mode.

The samples value you return affects FedAvg gradient weighting. Return the actual number of samples in your shard, not a constant. Incorrect sample counts cause one worker's updates to dominate the aggregation.

Put all imports (import torch, import torch.nn as nn, etc.) in the same file as your model class when using the SDK. The SDK extracts them automatically — imports in another cell or file will not be included.

Error	Cause	Fix
name 'nn' is not defined	Missing import in model_def.py	Add import torch.nn as nn to the same file as your class
FileNotFoundError: manifest.ndjson	Manifest not at root of shard ZIP	Ensure manifest.ndjson is at the root, not inside a subdirectory
WorkspaceError: SFTP connection failed	SSH key not registered (legacy SFTP path)	Use S3/rclone for file transfers instead
NCCL timeout	Transient network issue between workers	Retry — auto-recovery handles transient failures
OOM on worker	Batch size too large for GPU VRAM	Reduce batch_size in TrainingConfig

PreviousJobs & Results NextDataset Format & Sharding