DocsIntegrationFL Integration Guide
Integration

FL Integration Guide

How to adapt your PyTorch model and training loop for the ResonTech federated learning platform — from model_def.py to the Python SDK.

Requirements

The Python SDK handles all boilerplate: config generation, file packaging, upload via S3, and job submission. You provide your model class and training loop — the SDK generates everything else.

How Federated Training Works

The platform calls your training function once per FL round. Each round follows the same sequence:

PhaseWho actsWhat happens
InitializationFL ServerCustomPersistor instantiates your model class. Initial weights come from here (e.g. ImageNet pretrained backbone).
DistributeFL Server → all workersServer sends current global weights to all FL clients via gRPC.
Local trainEach workerYour fl_train_model() is called with payload containing global weights, data path, round number, and hyperparams.
ReturnEach worker → FL ServerYour function returns updated weights + sample count.
AggregateFL ServerFedAvg: weighted average of all client updates by sample count. New global model saved.
RepeatFL ServerNext round begins. Repeat for num_rounds total.

Round 0 detail

Python SDK Quick Start

The SDK handles config generation, workspace upload, and job submission in a single call. You only need to provide your model class and shard directory:

submit_job.py
i
The SDK extracts your model class source code automatically — no hand-written config files needed. It generates model_def.py, NVFlare configs, and executor scripts, then uploads everything to your S3 bucket.

SDK Configuration Reference

ResonTechConfig

ParameterTypeDefaultDescription
base_urlstrREST API base URL (e.g. https://api.beta.reson.tech)
emailstrAccount email
passwordstrAccount password
sftp_key_filestr~/.ssh/id_ed25519Private key path (legacy — SFTP is deprecated, use S3)

TrainingConfig

ParameterTypeDefaultDescription
num_classesint0Output classes (0 = infer from manifest.ndjson)
local_epochsint2Local training epochs per FL round
batch_sizeint32Mini-batch size
learning_ratefloat0.001Optimizer learning rate
img_sizeint224Input image size (square)
manifest_filestrmanifest.ndjsonManifest filename inside each shard ZIP

FederationConfig

ParameterTypeDefaultDescription
num_roundsint5Total FL training rounds
min_clientsint1Minimum clients required per round
wait_time_after_min_receivedint10Seconds to wait after min clients respond before aggregating
heart_beat_timeoutint600Client heartbeat timeout in seconds — clients missing this are dropped
job_namestrrt_jobName stored in job metadata

Writing model_def.py Manually

If you prefer to write model_def.py directly (rather than using the SDK's auto-generation), it must follow this structure:

model_def.py
i
Your existing training loop — epochs, early stopping, scheduler, checkpointing — all runs inside fl_train_model(). The only difference: receive weights at the start and return updated weights at the end.

Payload Schema

The payload dict your function receives each round:

Manifest Dataset Format (manifest.ndjson)

The SDK's default dataset adapter expects a manifest.ndjson file at the root of each shard ZIP. Each line is a JSON object:

FieldTypeDescription
idstrUnique identifier for this sample
uristrImage URI — file:// for local paths, https:// for remote URLs (downloaded and cached)
ylist[int]Label indices for multi-label classification (supports single-label too)
metadictOptional metadata (not used by default adapter but available in payload)
i
The num_classes value in TrainingConfig must be at least max(y) + 1 across all samples. Set it to 0 to let the SDK infer it from the manifest.

Advanced: Custom Executor & Persistor

Pass your own NVFlare subclasses to override the platform defaults. This is useful for custom aggregation strategies, non-standard model architectures, or custom checkpoint formats:

Advanced: Manual build + upload

Inspect the generated job folder before uploading:

Common Mistakes & Troubleshooting

!
Do not call model.train() before loading weights. Always load weights first: model.set_weights(payload["initial_weights"]), then switch to train mode.
!
The samples value you return affects FedAvg gradient weighting. Return the actual number of samples in your shard, not a constant. Incorrect sample counts cause one worker's updates to dominate the aggregation.
!
Put all imports (import torch, import torch.nn as nn, etc.) in the same file as your model class when using the SDK. The SDK extracts them automatically — imports in another cell or file will not be included.
ErrorCauseFix
name 'nn' is not definedMissing import in model_def.pyAdd import torch.nn as nn to the same file as your class
FileNotFoundError: manifest.ndjsonManifest not at root of shard ZIPEnsure manifest.ndjson is at the root, not inside a subdirectory
WorkspaceError: SFTP connection failedSSH key not registered (legacy SFTP path)Use S3/rclone for file transfers instead
NCCL timeoutTransient network issue between workersRetry — auto-recovery handles transient failures
OOM on workerBatch size too large for GPU VRAMReduce batch_size in TrainingConfig