Overview
Every submitted job — whether through the SDK or the web wizard — renders the same six files and posts them to the platform: three scripts (model_def.py, custom_client_executor.py, custom_persistor.py) and three configs (config_fed_client.json, config_fed_server.json, meta.json). The platform takes them from there.
This page explains why the defaults look the way they do — what each piece does on a worker, how rounds aggregate, and the empirical tuning that produced the recipe behind the examples catalog.
Anatomy of an FL Job
Every submitted job has the same six artefacts. Everything else (shards, requirements, optional warm-start) wraps around them.
The three Python scripts are byte-identical across all examples in the job catalog — only model_def.py changes. The FL plumbing is task-agnostic.
Server Workflow — Aggregating Rounds
The default server workflow is a Scatter-and-Gather loop. One full FL round, scattered (out) then gathered (in):
The default aggregator is classic FedAvg, weighted by each site's samples count. The shareable generator packages the full state dict into the transport envelope.
Swap the workflow
FederationConfig exposes the workflow / aggregator class paths. Drop in FedProx, FedOpt, cyclic, hierarchical, or any other compatible workflow:
See the training hyperparameters reference for the full field surface.
Client Executor — The Per-Round Loop
On each worker, custom_client_executor.py drives the per-round loop. The default GenericClientExecutor:
- Receives the global weights from the server.
- Loads them into your model class (imported from
model_def.py). - Reads training args from
config_fed_client.json → executor.args(yourTrainingConfig). - Plumbs
CURRENT_ROUND,NUM_ROUNDS,SITE_NAME,DATA_ROOT,STATE_DIR, plus every key fromTrainingConfig.extra, intoenv. - Calls your
fl_train_model(model, env, out_dir, logger)(or whichever name you set inModelConfig.train_fn). - Returns the new state dict +
samplescount back to the server.
STATE_DIR is a per-site, per-job directory that survives across rounds on the same worker. Use it to persist optimizer state, EMA buffers, dataset cursors — anything that needs continuity within a client's view.Persistor — The Global Model on Disk
custom_persistor.py runs on the server side and is the only component that touches the global model file:
- Initial weights: if you passed
model_checkpoint=, the persistor loads it before round 0. Otherwise it instantiates your model class fresh. - Per-round saves: after each aggregation, the global state dict is written to
files_out/GLOBAL_MODEL.pt(rolling) andfiles_out/GLOBAL_MODEL_round_N.pt(per-round). - Final output: on workflow end, the last aggregated weights land in
model_out/checkpoint.pt— that's the file the web UI's Download button serves.
FedAvg — Strengths and Failure Modes
FedAvg is the default for one reason: it works without a server-side optimizer state, so the workflow stays stateless and the persistor stays simple. But it has a known dilution effect — empirically observed in the platform's reference job:
Per-round averaging is a regularizer
Four clients each take 148 local steps, then the global model becomes the mean of the four resulting weight tensors. The global trajectory is therefore much shorter than 4 × 148 sequential steps would have been. In practice this acts as an implicit regularizer — on the Food-101 reference recipe, the FL version beat a single-GPU H100 centralized run on the same total step count because the centralized recipe overfit by epoch 5 while FedAvg's averaging prevented that collapse. See the run log.
Two client-side fixes close most of the gap to centralized
Direct measurement on the reference job, vs. the HuggingFace centralized recipe:
- Continuous LR schedule across rounds. If the LR lambda recomputes its position from
current_round × steps_per_round + local_stepinstead of resetting per round, you don't hit peak LRnum_roundstimes. Closes ~62 % of the top-1 gap on its own. - Persisted per-client Adam state. Adam's
m/vexponential averages need hundreds of steps to stabilize; rebuilding them every 148 steps gives every round a 10-15 step warmup tax. Save toSTATE_DIR/optimizer_state.ptat end of round, restore at start of next.
The default GenericClientExecutor ships both of these fixes built in — they're the platform's recommended baseline.
Empirical Lessons (from the job_classify Tuning Playbook)
These are non-obvious failure modes observed during the platform's reference port of the HuggingFace Food-101 recipe to 4-worker FedAvg. Full source: TUNING_PLAYBOOK.md.
EMA over short FL rounds is a trap
Standard EMA decay 0.999 was tuned for thousand-step training. Over 74-148 local steps per round, 0.999148 ≈ 86 % of the EMA weight ends up on the round-start weights — submitting that EMA back to FedAvg means each round contributes only ~14 % of intended progress. The fix is either (a) lower decay (~0.95) so EMA actually moves with training, or (b) persist EMA across rounds so the decay accumulates over the full run.
Label smoothing inflates measured val_loss
Even when it doesn't hurt top-1, a smoothing-trained model is calibrated for soft targets; eval uses hard CE so the reported val_loss goes up. Only use label smoothing if your headline metric is computed with smoothing.
More rounds is not free
Every round adds aggregation overhead (upload + aggregate + redistribute, ~50 s per round per worker on production hardware). Doubling rounds doubles that overhead even when local work shrinks. Worth it only if the regularization effect outweighs the comm cost.
Bigger batch ≠ free speed in FL
The reference job hit a wall-clock anomaly: batch 256 / 10 rounds took longer than batch 128 / 10 rounds despite half the local steps. Larger-batch paths can have higher per-batch overhead in the federated wrapper that doesn't amortize linearly. Validate batch/LR scaling in isolation before stacking it with other changes.
TTA at eval is free top-1
Hflip averaging at inference time bought ~0.2-0.4 pp top-1 across every measured checkpoint, with no training change. Works for symmetry-invariant tasks (most images); not for class-asymmetric ones (text rotations, etc.).
Literature Context
The platform's defaults align with the published FL literature:
- Server-side optimizers (FedAdam, FedYogi) are the canonical next step beyond client-only fixes — Reddi et al., Adaptive Federated Optimization, ICLR 2021.
- Linear LR decay to zero is empirically strongest for transformer fine-tuning — Why linearly decaying the learning rate to zero works best (2025). Matches the default the platform ships.
- PEFT / LoRA for massive models — Empowering Federated Learning for Massive Models (Roth et al., 2024). Cuts comm cost dramatically by transmitting only adapter weights — the right approach for job_llm and job_agent.
- Beware server-side optimizers with BatchNorm. BN running stats aren't in
model.named_parameters(), so a server-side optimizer never touches them — they end up uncoordinated across clients. Doesn't bite ViT-based jobs (LayerNorm only), but if you swap to a ResNet/MobileNet backbone, use GroupNorm instead (per Reddi et al.).
Cookbook — Porting a Centralized Recipe
If you're starting from a centralized HuggingFace recipe and porting it to N-worker FedAvg on the platform, in roughly this order:
- 1. Port literally first. Build the FL job with the exact centralized recipe (same backbone, optimizer, LR, batch, AMP, grad-clip, seed, preprocessing, no extra augmentation). Run it. Measure the gap.
- 2. Fix the schedule. If LR resets to peak each round, plumb
current_round+total_roundsfromenvinto your training fn and compute LR globally. The defaultGenericClientExecutoralready does this. - 3. Persist the optimizer. Save Adam's state dict to
STATE_DIR/optimizer_state.ptat end of round; restore at start of next. Combined with (2), typically closes 50-70 % of the centralized-vs-FL gap. - 4. Add TTA at eval. Free 0.2-0.4 pp top-1 for symmetry-invariant tasks. One flag, no retraining.
- 5. Don't add EMA / label smoothing / aggressive batch-LR scaling without isolated A/B tests. Each has FL-specific failure modes.
- 6. If a residual gap remains, try a server-side optimizer. Swap the aggregator via
FederationConfig.aggregator_path. Caveat: incompatible with BatchNorm.
Next Steps
- Submit your first FL job with the SDK
- Browse the FL job catalog — nine ready-to-run recipes.
- Custom executor / persistor for behaviour the defaults can't express.