DocsPython SDKSDK — Troubleshooting
Python SDK

SDK — Troubleshooting

Known failure modes with actionable fixes — auth, storage, submission, source extraction, networking, notebooks.

Auth & Login

AuthError: 401 Unauthorized: Invalid credentials

Wrong email or password. The SDK retries with a refresh token once before raising.

AuthError: Login response did not contain an access token

Almost always a wrong base_url. Open it in a browser — a healthy deployment returns a 404 page, not a login portal or a blank 200.

Storage / Bucket

StorageError: No S3 bucket configured

You haven't provisioned a bucket yet. Open the web UI → Profile → Storage → Provision Bucket, pick an alias and quota, and paste the returned secret into ResonTechConfig.

StorageError: Could not resolve your storage bucket

Login succeeded but GET /api/users/storage/bucket returned 404. Same fix — provision a bucket.

StorageError: s3_access_key_id and s3_secret_access_key are required

You constructed ResonTechConfig with empty strings. Pass both values before creating ResonTech(config).

StorageError: … AccessDenied

Credentials in s3_access_key_id / s3_secret_access_key don't match the bucket on your account. Rotate the key from Profile → Storage → Rotate Key and update your config.

StorageError: … InvalidAccessKeyId

The access key no longer exists in Garage. Someone (maybe you) already rotated it. Rotate again to get a fresh pair.

Slow or stalled shard upload

Check your upstream bandwidth. The SDK uses 50 MB parts × 5 concurrent threads — a 50 MB part on a 10 Mbit/s link takes ~40 s. Watch stderr — the _ProgressPrinter shows per-file percent.

Submission

ValidationError: No shard zip files found in './shards'

shards_dir must contain at least one *.zip. The filename pattern is free-form — shard_0.zip, part-01.zip, anything with a .zip suffix works.

ValidationError: shards_dir is not a directory

Typo or wrong path — confirm Path(shards_dir).is_dir() locally first.

ValidationError: model_checkpoint must be a .pt file

The backend enforces exactly one .pt in model/. Rename or convert your checkpoint before passing it.

HTTP 400: Folder not found in your S3 bucket: "/jobs/foo/scripts"

The submit backend re-verifies every path. Usually means an upload silently failed earlier — re-run rt_submit or inspect the bucket with sdk.storage.list("jobs/foo/").

HTTP 400: Shard count (5) does not match worker count (3)

You passed explicit worker_ids=[...] with a different length than the shard count. Either upload fewer shards or pass the right number of workers. With auto_select_workers=True, the backend picks the shard count for you.

Source Extraction

ValueError: Could not extract source of 'MyModel'

Happens in some notebook kernels where torch patches inspect. Workarounds: save the class to a .py file and set ModelConfig(model_class="my_module.MyModel"); or restart the kernel before importing torch and define the class first.

NameError: name 'nn' is not defined (server-side)

Imports live in a different cell from the class. Consolidate them into the same cell and resubmit. Applies to model=, executor=, persistor=.

NameError: name 'torch' is not defined on the worker

Same fix — put import torch in the same cell as your class.

Network / Infrastructure

ResonTechError: Cannot reach <base_url>

Backend is down or your network blocks it. Try the same URL in a browser.

Presigned shard URLs 403 the worker

Garage CORS or the bucket's platform-write-key grant is misconfigured. This is a platform-side issue — contact your admin.

Uploads fail with InvalidRequest: The Content-Md5 you specified was invalid

Rare boto3 regression on old versions. Upgrade: pip install -U boto3 (≥ 1.34 required).

Jupyter / Notebook Issues

ModuleNotFoundError: No module named 'resontech'

The SDK is not visible to the Python the notebook kernel is running. Compare sys.executable inside the notebook against the Python that your SDK was installed into — they need to match. The example notebooks call sys.executable when wiring up dependencies to avoid this mismatch.

Kernel ignores SDK source changes after editing

Use importlib.reload after editing SDK source — the example notebooks include a reload cell at the top.

Where Do I See the Real Error?

!
The dashboard job detail page (job.dashboard_url) is the source of truth for runtime errors. Worker logs, training stdout, Python tracebacks — they all end up there. The SDK only surfaces errors from the submission pipeline itself.