Requirements
The Python SDK handles all boilerplate: config generation, file packaging, upload via S3, and job submission. You provide your model class and training loop — the SDK generates everything else.
How Federated Training Works
The platform calls your training function once per FL round. Each round follows the same sequence:
| Phase | Who acts | What happens |
|---|---|---|
| Initialization | FL Server | CustomPersistor instantiates your model class. Initial weights come from here (e.g. ImageNet pretrained backbone). |
| Distribute | FL Server → all workers | Server sends current global weights to all FL clients via gRPC. |
| Local train | Each worker | Your fl_train_model() is called with payload containing global weights, data path, round number, and hyperparams. |
| Return | Each worker → FL Server | Your function returns updated weights + sample count. |
| Aggregate | FL Server | FedAvg: weighted average of all client updates by sample count. New global model saved. |
| Repeat | FL Server | Next round begins. Repeat for num_rounds total. |
Round 0 detail
Python SDK Quick Start
The SDK handles config generation, workspace upload, and job submission in a single call. You only need to provide your model class and shard directory:
model_def.py, NVFlare configs, and executor scripts, then uploads everything to your S3 bucket.SDK Configuration Reference
ResonTechConfig
| Parameter | Type | Default | Description |
|---|---|---|---|
| base_url | str | — | REST API base URL (e.g. https://api.beta.reson.tech) |
| str | — | Account email | |
| password | str | — | Account password |
| sftp_key_file | str | ~/.ssh/id_ed25519 | Private key path (legacy — SFTP is deprecated, use S3) |
TrainingConfig
| Parameter | Type | Default | Description |
|---|---|---|---|
| num_classes | int | 0 | Output classes (0 = infer from manifest.ndjson) |
| local_epochs | int | 2 | Local training epochs per FL round |
| batch_size | int | 32 | Mini-batch size |
| learning_rate | float | 0.001 | Optimizer learning rate |
| img_size | int | 224 | Input image size (square) |
| manifest_file | str | manifest.ndjson | Manifest filename inside each shard ZIP |
FederationConfig
| Parameter | Type | Default | Description |
|---|---|---|---|
| num_rounds | int | 5 | Total FL training rounds |
| min_clients | int | 1 | Minimum clients required per round |
| wait_time_after_min_received | int | 10 | Seconds to wait after min clients respond before aggregating |
| heart_beat_timeout | int | 600 | Client heartbeat timeout in seconds — clients missing this are dropped |
| job_name | str | rt_job | Name stored in job metadata |
Writing model_def.py Manually
If you prefer to write model_def.py directly (rather than using the SDK's auto-generation), it must follow this structure:
fl_train_model(). The only difference: receive weights at the start and return updated weights at the end.Payload Schema
The payload dict your function receives each round:
Manifest Dataset Format (manifest.ndjson)
The SDK's default dataset adapter expects a manifest.ndjson file at the root of each shard ZIP. Each line is a JSON object:
| Field | Type | Description |
|---|---|---|
| id | str | Unique identifier for this sample |
| uri | str | Image URI — file:// for local paths, https:// for remote URLs (downloaded and cached) |
| y | list[int] | Label indices for multi-label classification (supports single-label too) |
| meta | dict | Optional metadata (not used by default adapter but available in payload) |
num_classes value in TrainingConfig must be at least max(y) + 1 across all samples. Set it to 0 to let the SDK infer it from the manifest.Advanced: Custom Executor & Persistor
Pass your own NVFlare subclasses to override the platform defaults. This is useful for custom aggregation strategies, non-standard model architectures, or custom checkpoint formats:
Advanced: Manual build + upload
Inspect the generated job folder before uploading:
Common Mistakes & Troubleshooting
model.train() before loading weights. Always load weights first: model.set_weights(payload["initial_weights"]), then switch to train mode.samples value you return affects FedAvg gradient weighting. Return the actual number of samples in your shard, not a constant. Incorrect sample counts cause one worker's updates to dominate the aggregation.import torch, import torch.nn as nn, etc.) in the same file as your model class when using the SDK. The SDK extracts them automatically — imports in another cell or file will not be included.| Error | Cause | Fix |
|---|---|---|
| name 'nn' is not defined | Missing import in model_def.py | Add import torch.nn as nn to the same file as your class |
| FileNotFoundError: manifest.ndjson | Manifest not at root of shard ZIP | Ensure manifest.ndjson is at the root, not inside a subdirectory |
| WorkspaceError: SFTP connection failed | SSH key not registered (legacy SFTP path) | Use S3/rclone for file transfers instead |
| NCCL timeout | Transient network issue between workers | Retry — auto-recovery handles transient failures |
| OOM on worker | Batch size too large for GPU VRAM | Reduce batch_size in TrainingConfig |