Approach — Federated Learning on ResonTech

Overview

Every submitted job — whether through the SDK or the web wizard — renders the same six files and posts them to the platform: three scripts (model_def.py, custom_client_executor.py, custom_persistor.py) and three configs (config_fed_client.json, config_fed_server.json, meta.json). The platform takes them from there.

This page explains why the defaults look the way they do — what each piece does on a worker, how rounds aggregate, and the empirical tuning that produced the recipe behind the examples catalog.

Anatomy of an FL Job

Every submitted job has the same six artefacts. Everything else (shards, requirements, optional warm-start) wraps around them.

The three Python scripts are byte-identical across all examples in the job catalog — only model_def.py changes. The FL plumbing is task-agnostic.

Server Workflow — Aggregating Rounds

The default server workflow is a Scatter-and-Gather loop. One full FL round, scattered (out) then gathered (in):

The default aggregator is classic FedAvg, weighted by each site's samples count. The shareable generator packages the full state dict into the transport envelope.

Swap the workflow

FederationConfig exposes the workflow / aggregator class paths. Drop in FedProx, FedOpt, cyclic, hierarchical, or any other compatible workflow:

See the training hyperparameters reference for the full field surface.

Client Executor — The Per-Round Loop

On each worker, custom_client_executor.py drives the per-round loop. The default GenericClientExecutor:

Receives the global weights from the server.
Loads them into your model class (imported from model_def.py).
Reads training args from config_fed_client.json → executor.args (your TrainingConfig).
Plumbs CURRENT_ROUND, NUM_ROUNDS, SITE_NAME, DATA_ROOT, STATE_DIR, plus every key from TrainingConfig.extra, into env.
Calls your fl_train_model(model, env, out_dir, logger) (or whichever name you set in ModelConfig.train_fn).
Returns the new state dict + samples count back to the server.

STATE_DIR is a per-site, per-job directory that survives across rounds on the same worker. Use it to persist optimizer state, EMA buffers, dataset cursors — anything that needs continuity within a client's view.

Persistor — The Global Model on Disk

custom_persistor.py runs on the server side and is the only component that touches the global model file:

Initial weights: if you passed model_checkpoint=, the persistor loads it before round 0. Otherwise it instantiates your model class fresh.
Per-round saves: after each aggregation, the global state dict is written to files_out/GLOBAL_MODEL.pt (rolling) and files_out/GLOBAL_MODEL_round_N.pt (per-round).
Final output: on workflow end, the last aggregated weights land in model_out/checkpoint.pt — that's the file the web UI's Download button serves.

FedAvg — Strengths and Failure Modes

FedAvg is the default for one reason: it works without a server-side optimizer state, so the workflow stays stateless and the persistor stays simple. But it has a known dilution effect — empirically observed in the platform's reference job:

Per-round averaging is a regularizer

Four clients each take 148 local steps, then the global model becomes the mean of the four resulting weight tensors. The global trajectory is therefore much shorter than 4 × 148 sequential steps would have been. In practice this acts as an implicit regularizer — on the Food-101 reference recipe, the FL version beat a single-GPU H100 centralized run on the same total step count because the centralized recipe overfit by epoch 5 while FedAvg's averaging prevented that collapse. See the run log.

Two client-side fixes close most of the gap to centralized

Direct measurement on the reference job, vs. the HuggingFace centralized recipe:

Continuous LR schedule across rounds. If the LR lambda recomputes its position from current_round × steps_per_round + local_step instead of resetting per round, you don't hit peak LR num_rounds times. Closes ~62 % of the top-1 gap on its own.
Persisted per-client Adam state. Adam's m/v exponential averages need hundreds of steps to stabilize; rebuilding them every 148 steps gives every round a 10-15 step warmup tax. Save to STATE_DIR/optimizer_state.pt at end of round, restore at start of next.

The default GenericClientExecutor ships both of these fixes built in — they're the platform's recommended baseline.

Empirical Lessons (from the job_classify Tuning Playbook)

These are non-obvious failure modes observed during the platform's reference port of the HuggingFace Food-101 recipe to 4-worker FedAvg. Full source: TUNING_PLAYBOOK.md.

EMA over short FL rounds is a trap

Standard EMA decay 0.999 was tuned for thousand-step training. Over 74-148 local steps per round, 0.999¹⁴⁸ ≈ 86 % of the EMA weight ends up on the round-start weights — submitting that EMA back to FedAvg means each round contributes only ~14 % of intended progress. The fix is either (a) lower decay (~0.95) so EMA actually moves with training, or (b) persist EMA across rounds so the decay accumulates over the full run.

Label smoothing inflates measured val_loss

Even when it doesn't hurt top-1, a smoothing-trained model is calibrated for soft targets; eval uses hard CE so the reported val_loss goes up. Only use label smoothing if your headline metric is computed with smoothing.

More rounds is not free

Every round adds aggregation overhead (upload + aggregate + redistribute, ~50 s per round per worker on production hardware). Doubling rounds doubles that overhead even when local work shrinks. Worth it only if the regularization effect outweighs the comm cost.

Bigger batch ≠ free speed in FL

The reference job hit a wall-clock anomaly: batch 256 / 10 rounds took longer than batch 128 / 10 rounds despite half the local steps. Larger-batch paths can have higher per-batch overhead in the federated wrapper that doesn't amortize linearly. Validate batch/LR scaling in isolation before stacking it with other changes.

TTA at eval is free top-1

Hflip averaging at inference time bought ~0.2-0.4 pp top-1 across every measured checkpoint, with no training change. Works for symmetry-invariant tasks (most images); not for class-asymmetric ones (text rotations, etc.).

When tuning a similar federated port, change one thing at a time. Stacked deviations (e.g. EMA + label smoothing + 10 rounds in one run) regressed by ~2 pp on the reference job and the root cause could only be isolated by reverting and re-running each component alone.

Literature Context

The platform's defaults align with the published FL literature:

Server-side optimizers (FedAdam, FedYogi) are the canonical next step beyond client-only fixes — Reddi et al., Adaptive Federated Optimization, ICLR 2021.
Linear LR decay to zero is empirically strongest for transformer fine-tuning — Why linearly decaying the learning rate to zero works best (2025). Matches the default the platform ships.
PEFT / LoRA for massive models — Empowering Federated Learning for Massive Models (Roth et al., 2024). Cuts comm cost dramatically by transmitting only adapter weights — the right approach for job_llm and job_agent.
Beware server-side optimizers with BatchNorm. BN running stats aren't in model.named_parameters(), so a server-side optimizer never touches them — they end up uncoordinated across clients. Doesn't bite ViT-based jobs (LayerNorm only), but if you swap to a ResNet/MobileNet backbone, use GroupNorm instead (per Reddi et al.).

Cookbook — Porting a Centralized Recipe

If you're starting from a centralized HuggingFace recipe and porting it to N-worker FedAvg on the platform, in roughly this order:

1. Port literally first. Build the FL job with the exact centralized recipe (same backbone, optimizer, LR, batch, AMP, grad-clip, seed, preprocessing, no extra augmentation). Run it. Measure the gap.
2. Fix the schedule. If LR resets to peak each round, plumb current_round + total_rounds from env into your training fn and compute LR globally. The default GenericClientExecutor already does this.
3. Persist the optimizer. Save Adam's state dict to STATE_DIR/optimizer_state.pt at end of round; restore at start of next. Combined with (2), typically closes 50-70 % of the centralized-vs-FL gap.
4. Add TTA at eval. Free 0.2-0.4 pp top-1 for symmetry-invariant tasks. One flag, no retraining.
5. Don't add EMA / label smoothing / aggressive batch-LR scaling without isolated A/B tests. Each has FL-specific failure modes.
6. If a residual gap remains, try a server-side optimizer. Swap the aggregator via FederationConfig.aggregator_path. Caveat: incompatible with BatchNorm.

Next Steps

Submit your first FL job with the SDK
Browse the FL job catalog — nine ready-to-run recipes.
Custom executor / persistor for behaviour the defaults can't express.

PreviousWelcome to ResonTech NextFiles & Storage