DocsIntegrationDataset Format & Sharding
Integration

Dataset Format & Sharding

How to split your dataset into shard ZIPs, structure the manifest, and what the worker sees at data_root.

How Workers Receive Data

At job dispatch, each worker receives a presigned 1-hour GET URL for its assigned shard ZIP from your S3 bucket. The worker:

  • Downloads the ZIP using the presigned URL (direct from Garage, no proxy)
  • Unpacks it flat into /var/tmp/nvflare/data/{shard_index}/
  • Your training code accesses this path via payload["dataset"]["data_root"]

The number of shard ZIPs you upload to jobs/{name}/shards/ equals the number of GPU workers allocated. Upload 4 shards → 4 workers run in parallel, each training on its own partition.

i
Shards are assigned to workers in sort order by filename. Name them consistently:shard_0001.zip, shard_0002.zip, etc.

ZIP Structure Rules

The ZIP is unpacked flat into data_root. There must be no wrapper folder.

!
Do not add a wrapper folder inside the ZIP. The manifest and data must be at the root of the archive.

Manifest Format (manifest.ndjson)

The default dataset adapter (used by rt_submit()) expects a manifest.ndjson file at the root of each shard ZIP. Each line is a JSON object describing one training sample:

FieldTypeRequiredDescription
idstrYesUnique identifier for this sample (any string)
uristrYesImage URI. Use file:// for local paths relative to the worker, https:// for remote URLs (downloaded and cached on first access)
ylist[int]YesLabel indices. Multi-label supported. Must be 0-indexed integers.
metadictNoOptional metadata — available in payload but not used by the default adapter

Local file URIs

When using file:// URIs, paths should be relative to the worker's data_root. Use the absolute path on the worker:

i
The manifest-based format is required for the default adapter and rt_submit(). If you write your own model_def.py manually, you can use any dataset format as long as your code reads from payload["dataset"]["data_root"].

Creating Shards

Split your dataset into N roughly equal partitions, one ZIP per worker:

Creating a manifest programmatically

i
Shards do not need to be equal in size. FedAvg weights each worker's contribution by its sample count — just return the correct samples value from your fl_train_model().

Uploading Shards

Upload shard ZIPs to your bucket under jobs/{name}/shards/.

Verifying shard count

Before submitting, verify that the number of shards in your bucket matches your intended worker count. The platform will allocate exactly N workers for N shard files.

Non-Image Datasets

The manifest format and default adapter are designed for image classification. For other data types (text, tabular, audio, time series), write a custom model_def.pythat reads directly from payload["dataset"]["data_root"]:

Pack your data files directly into the shard ZIP without a manifest. The data_root will contain whatever files you put in the ZIP root.