The Kernel Concept
ResonTech is structured around one central metaphor: a kernel. Like an operating system kernel, it sits between your code and the raw hardware, managing resources, isolating workloads, and handling failures transparently.
The kernel serves both primary ML workload types from a single control plane:
| Training Runtime | Inference Runtime |
|---|---|
| Distributed DDP | Endpoint Serving |
| FSDP Sharding | OpenAI-Compatible API |
| Federated Learning (NVFlare) | Autoscale to Zero |
| Checkpoint Management | Cold Start < 10s |
| Gradient Sync | Load Balancing |
Both runtimes share a common infrastructure layer: Coordinator, Dispatcher, Message Broker, Worker Registry, Job Scheduler, and Billing Metering. You submit a training job and an inference deployment through the same API, manage them on the same dashboard, and pay on the same pay-as-you-go invoice.
The Two-Sided Platform
ResonTech operates as a two-sided marketplace with the kernel at the center. GPU providers supply raw compute; ML teams consume it. The kernel routes, orchestrates, and guarantees delivery.
| Aspect | GPU Provider sees | ML Team sees |
|---|---|---|
| Onboarding | Docker agent install + node registration | Account creation + S3 bucket provisioned |
| Daily interface | Node health dashboard, earnings tracker | Web platform, Python SDK, REST API |
| Training data | Never — raw data never leaves client workspace | Full control of shards, scripts, configs in S3 |
| Job control | Accept/reject policies; automatic execution | Submit, monitor, cancel, retrieve artifacts |
| Inference | Node serves requests automatically | Live endpoint URL, autoscaling, metrics |
| Billing | Payouts per completed work unit and uptime tier | Pay-as-you-go or predictable reserved plans |
| Failure handling | Automatic — platform reassigns if node drops | Transparent — job continues, no manual action |
Core Service Components
Main Backend (NestJS API)
The central control plane and client-facing API. Built with NestJS (TypeScript), it is the primary entry point for ML clients, the web dashboard, and the Python SDK.
| Domain | Responsibility |
|---|---|
| Job Lifecycle | Submit → Validate → Assign → Monitor → Complete → Archive |
| User Management | Registration, SSH key storage, workspace provisioning, billing tracking |
| Worker Registry | Maintains registry of all registered GPU providers and their current status |
| Dispatcher Bridge | Proxies cluster provisioning commands to Dispatcher via gRPC |
| Inference Management | Endpoint lifecycle, autoscaling rules, health monitoring, traffic routing |
Coordinator (Intelligent Job Router)
The scheduling brain of the platform. When a job is submitted, the Coordinator:
- Parses GPU requirements (VRAM hint, GPU type, shard count)
- Filters eligible workers (free VRAM ≥ job hint, availability = true)
- Ranks candidates by score: uptime × hardware tier × geo affinity × current load
- Selects the top N workers (N = shard count for training, N = replica count for inference)
- Assigns jobs via the async Message Broker
- Monitors heartbeats and triggers replacement on failure
Dispatcher (Kubernetes Cluster Manager)
Bridges high-level job intent into running Kubernetes infrastructure via gRPC. Responsible for the full cluster lifecycle: creating K8s resources for each training job, managing inference deployments with HPA and load balancers, and tearing down clusters after completion.
Worker Agent (Node-Side Runtime)
A Docker container installed on each GPU provider's machine. It registers with the Coordinator, sends heartbeat telemetry, and executes job commands: pull the FLARE image, initialize the FL client, begin training on the assigned shard, write checkpoints, serve inference.
S3 Workspace & Storage Layer
Every user account gets a private S3-compatible bucket automatically provisioned at account creation. The bucket is backed by Garage — a self-hosted, geographically distributed S3-compatible object store. Your data stays under ResonTech infrastructure, never in a third-party public cloud unless you configure a Private Cluster with your own storage.
Workspace folder layout
How uploads work
The API generates presigned PUT URLs (valid 15 minutes) for uploads. Your browser or SDK sends data directly to Garage — zero bytes flow through the API server.
For large files (≥ 10 MB), the platform automatically initiates multipart upload: files are split into 50 MB chunks, up to 5 parts uploaded in parallel, with automatic abort on failure. Maximum supported upload size: 500 GB.
How workers access data
At job dispatch, each worker receives a presigned 1-hour GET URL for its assigned shard ZIP. The worker downloads and unpacks the ZIP into /var/tmp/nvflare/data. Your training code accesses this path viapayload["dataset"]["data_root"]. The presigned URL expires after the download — workers never have persistent access to your bucket.
How results are returned
After each FL round, partial metrics are written to files_out/. After the final round, the aggregated model checkpoint is written to model_out/. Downloads use presigned 1-hour GET URLs served directly from Garage — no proxy bottleneck. Large checkpoints download at full network speed.
Training Pipeline — End to End
| Phase | Step | What happens |
|---|---|---|
| Data Prep | Shard dataset | Split into N ZIP archives (one per GPU worker). Upload to jobs/{name}/shards/ in your S3 bucket. |
| Submission | Job submitted | Main Backend validates all referenced paths, configs, and file existence in your bucket. |
| Routing | Worker selection | Coordinator scores all available workers and selects N optimal nodes based on VRAM, locality, and uptime. |
| Provisioning | Cluster created | Dispatcher creates a Kubernetes cluster via gRPC. NVFlare FL Server pod deployed. |
| Execution | Distributed training | FLARE server distributes the model. Each worker downloads its shard, trains locally, sends weight updates back. |
| Aggregation | FedAvg round | Server aggregates weight updates (weighted by sample count), saves global model, begins next round. |
| Delivery | Artifacts written | Final checkpoint written to model_out/, per-round metrics to files_out/ in your bucket. |
| Billing | Invoice generated | Cluster torn down. Pay-as-you-go invoice generated for actual GPU time used. |
Fault recovery during training
When a worker node loses connectivity:
- Coordinator detects the missing heartbeat within seconds
- A replacement worker is selected from the available pool
- FLARE server manages partial aggregation (partial round is handled gracefully)
- Replacement node loads the last successful checkpoint
- Training resumes — zero compute lost from the last checkpoint
Inference Pipeline — End to End
| Step | What happens |
|---|---|
| Model Loading | Specify a HuggingFace repo or workspace checkpoint path. The model is fetched and loaded onto the selected GPU. |
| Endpoint Config | Set GPU type, initial replica count, and autoscale policy (min/max replicas, scale-to-zero timeout). |
| Dispatcher Deploy | Dispatcher creates a Kubernetes Deployment + HPA + Load Balancer. Health check endpoints configured. |
| Serving Runtime | OpenAI-compatible REST API live: /v1/completions, /v1/chat/completions, /v1/embeddings. Cold start < 10s. |
| Autoscaling | HPA monitors RPS and GPU utilization. Scales out when RPS > threshold. Scales to zero after configurable idle timeout. |
| Response | Text completions, embeddings, or custom tensor outputs served with P50/P95/P99 latency monitoring. |
Inference autoscaling architecture
Incoming traffic hits a Load Balancer which round-robins across healthy replicas. An HPA Controller monitors both RPS and GPU utilization:
- RPS > threshold → scale out (add replicas)
- RPS < threshold → scale in (remove replicas)
- RPS = 0 for N minutes → scale to zero (billing stops)
| Specification | Value |
|---|---|
| Cold start time | < 10 seconds from zero replicas |
| API protocol | REST (HTTP/1.1 and HTTP/2) |
| LLM API format | OpenAI-compatible (/v1/completions, /v1/chat/completions, /v1/embeddings) |
| Autoscaling basis | RPS and GPU utilization; configurable min/max replicas |
| Scale-to-zero | Yes — replicas removed after configurable idle timeout |
| Model sources | HuggingFace Hub, workspace checkpoint, custom container |
| Load balancing | Round-robin across healthy replicas; health-check-aware |
| Monitoring | RPS, latency (P50/P95/P99), GPU utilization, replica count, error rate |
Framework Compatibility
- Training frameworks: PyTorch, TensorFlow, JAX, HuggingFace Trainer, Fastai
- Distributed strategies: DDP, FSDP, DeepSpeed ZeRO-2/ZeRO-3, model parallelism, NVIDIA FLARE federated learning
- Model hubs: HuggingFace Hub, local workspace checkpoints, custom containers
- Observability: Weights & Biases, TensorBoard, MLflow (bring your own tracker)
- Inference APIs: OpenAI-compatible REST endpoints for LLM serving, raw tensor outputs for custom models
- Runtime: CUDA 11.8+, NVIDIA GPUs from GTX 1080 Ti to H100 SXM5
| Model size | GPU count | Recommended strategy |
|---|---|---|
| Fits in one GPU VRAM | 1–8 GPUs | DDP |
| Larger than one GPU, fits in cluster VRAM | 4–64 GPUs | FSDP or DeepSpeed ZeRO-2 |
| Very large (70B+ parameters) | 8–128 GPUs | FSDP + DeepSpeed ZeRO-3 |
| Privacy-sensitive training data | Any count | NVIDIA FLARE (Federated) |
| Heterogeneous hardware across sites | Any count | NVIDIA FLARE (Federated) |