DocsProductSolution
Product

Solution

How the ResonTech kernel works end-to-end — training pipeline, inference pipeline, S3 workspace, fault tolerance, and the two-sided marketplace.

The Kernel Concept

ResonTech is structured around one central metaphor: a kernel. Like an operating system kernel, it sits between your code and the raw hardware, managing resources, isolating workloads, and handling failures transparently.

The kernel serves both primary ML workload types from a single control plane:

Training RuntimeInference Runtime
Distributed DDPEndpoint Serving
FSDP ShardingOpenAI-Compatible API
Federated Learning (NVFlare)Autoscale to Zero
Checkpoint ManagementCold Start < 10s
Gradient SyncLoad Balancing

Both runtimes share a common infrastructure layer: Coordinator, Dispatcher, Message Broker, Worker Registry, Job Scheduler, and Billing Metering. You submit a training job and an inference deployment through the same API, manage them on the same dashboard, and pay on the same pay-as-you-go invoice.

The Two-Sided Platform

ResonTech operates as a two-sided marketplace with the kernel at the center. GPU providers supply raw compute; ML teams consume it. The kernel routes, orchestrates, and guarantees delivery.

AspectGPU Provider seesML Team sees
OnboardingDocker agent install + node registrationAccount creation + S3 bucket provisioned
Daily interfaceNode health dashboard, earnings trackerWeb platform, Python SDK, REST API
Training dataNever — raw data never leaves client workspaceFull control of shards, scripts, configs in S3
Job controlAccept/reject policies; automatic executionSubmit, monitor, cancel, retrieve artifacts
InferenceNode serves requests automaticallyLive endpoint URL, autoscaling, metrics
BillingPayouts per completed work unit and uptime tierPay-as-you-go or predictable reserved plans
Failure handlingAutomatic — platform reassigns if node dropsTransparent — job continues, no manual action

Core Service Components

Main Backend (NestJS API)

The central control plane and client-facing API. Built with NestJS (TypeScript), it is the primary entry point for ML clients, the web dashboard, and the Python SDK.

DomainResponsibility
Job LifecycleSubmit → Validate → Assign → Monitor → Complete → Archive
User ManagementRegistration, SSH key storage, workspace provisioning, billing tracking
Worker RegistryMaintains registry of all registered GPU providers and their current status
Dispatcher BridgeProxies cluster provisioning commands to Dispatcher via gRPC
Inference ManagementEndpoint lifecycle, autoscaling rules, health monitoring, traffic routing

Coordinator (Intelligent Job Router)

The scheduling brain of the platform. When a job is submitted, the Coordinator:

  • Parses GPU requirements (VRAM hint, GPU type, shard count)
  • Filters eligible workers (free VRAM ≥ job hint, availability = true)
  • Ranks candidates by score: uptime × hardware tier × geo affinity × current load
  • Selects the top N workers (N = shard count for training, N = replica count for inference)
  • Assigns jobs via the async Message Broker
  • Monitors heartbeats and triggers replacement on failure

Dispatcher (Kubernetes Cluster Manager)

Bridges high-level job intent into running Kubernetes infrastructure via gRPC. Responsible for the full cluster lifecycle: creating K8s resources for each training job, managing inference deployments with HPA and load balancers, and tearing down clusters after completion.

Worker Agent (Node-Side Runtime)

A Docker container installed on each GPU provider's machine. It registers with the Coordinator, sends heartbeat telemetry, and executes job commands: pull the FLARE image, initialize the FL client, begin training on the assigned shard, write checkpoints, serve inference.

S3 Workspace & Storage Layer

Every user account gets a private S3-compatible bucket automatically provisioned at account creation. The bucket is backed by Garage — a self-hosted, geographically distributed S3-compatible object store. Your data stays under ResonTech infrastructure, never in a third-party public cloud unless you configure a Private Cluster with your own storage.

Workspace folder layout

How uploads work

The API generates presigned PUT URLs (valid 15 minutes) for uploads. Your browser or SDK sends data directly to Garage — zero bytes flow through the API server.

For large files (≥ 10 MB), the platform automatically initiates multipart upload: files are split into 50 MB chunks, up to 5 parts uploaded in parallel, with automatic abort on failure. Maximum supported upload size: 500 GB.

How workers access data

At job dispatch, each worker receives a presigned 1-hour GET URL for its assigned shard ZIP. The worker downloads and unpacks the ZIP into /var/tmp/nvflare/data. Your training code accesses this path viapayload["dataset"]["data_root"]. The presigned URL expires after the download — workers never have persistent access to your bucket.

How results are returned

After each FL round, partial metrics are written to files_out/. After the final round, the aggregated model checkpoint is written to model_out/. Downloads use presigned 1-hour GET URLs served directly from Garage — no proxy bottleneck. Large checkpoints download at full network speed.

i
Your raw training data never traverses the ResonTech control plane. Workers receive time-limited presigned URLs. Only model weight updates (gradients) travel between worker nodes during federated training aggregation.

Training Pipeline — End to End

PhaseStepWhat happens
Data PrepShard datasetSplit into N ZIP archives (one per GPU worker). Upload to jobs/{name}/shards/ in your S3 bucket.
SubmissionJob submittedMain Backend validates all referenced paths, configs, and file existence in your bucket.
RoutingWorker selectionCoordinator scores all available workers and selects N optimal nodes based on VRAM, locality, and uptime.
ProvisioningCluster createdDispatcher creates a Kubernetes cluster via gRPC. NVFlare FL Server pod deployed.
ExecutionDistributed trainingFLARE server distributes the model. Each worker downloads its shard, trains locally, sends weight updates back.
AggregationFedAvg roundServer aggregates weight updates (weighted by sample count), saves global model, begins next round.
DeliveryArtifacts writtenFinal checkpoint written to model_out/, per-round metrics to files_out/ in your bucket.
BillingInvoice generatedCluster torn down. Pay-as-you-go invoice generated for actual GPU time used.

Fault recovery during training

When a worker node loses connectivity:

  • Coordinator detects the missing heartbeat within seconds
  • A replacement worker is selected from the available pool
  • FLARE server manages partial aggregation (partial round is handled gracefully)
  • Replacement node loads the last successful checkpoint
  • Training resumes — zero compute lost from the last checkpoint

Inference Pipeline — End to End

StepWhat happens
Model LoadingSpecify a HuggingFace repo or workspace checkpoint path. The model is fetched and loaded onto the selected GPU.
Endpoint ConfigSet GPU type, initial replica count, and autoscale policy (min/max replicas, scale-to-zero timeout).
Dispatcher DeployDispatcher creates a Kubernetes Deployment + HPA + Load Balancer. Health check endpoints configured.
Serving RuntimeOpenAI-compatible REST API live: /v1/completions, /v1/chat/completions, /v1/embeddings. Cold start < 10s.
AutoscalingHPA monitors RPS and GPU utilization. Scales out when RPS > threshold. Scales to zero after configurable idle timeout.
ResponseText completions, embeddings, or custom tensor outputs served with P50/P95/P99 latency monitoring.

Inference autoscaling architecture

Incoming traffic hits a Load Balancer which round-robins across healthy replicas. An HPA Controller monitors both RPS and GPU utilization:

  • RPS > threshold → scale out (add replicas)
  • RPS < threshold → scale in (remove replicas)
  • RPS = 0 for N minutes → scale to zero (billing stops)
SpecificationValue
Cold start time< 10 seconds from zero replicas
API protocolREST (HTTP/1.1 and HTTP/2)
LLM API formatOpenAI-compatible (/v1/completions, /v1/chat/completions, /v1/embeddings)
Autoscaling basisRPS and GPU utilization; configurable min/max replicas
Scale-to-zeroYes — replicas removed after configurable idle timeout
Model sourcesHuggingFace Hub, workspace checkpoint, custom container
Load balancingRound-robin across healthy replicas; health-check-aware
MonitoringRPS, latency (P50/P95/P99), GPU utilization, replica count, error rate

Framework Compatibility

  • Training frameworks: PyTorch, TensorFlow, JAX, HuggingFace Trainer, Fastai
  • Distributed strategies: DDP, FSDP, DeepSpeed ZeRO-2/ZeRO-3, model parallelism, NVIDIA FLARE federated learning
  • Model hubs: HuggingFace Hub, local workspace checkpoints, custom containers
  • Observability: Weights & Biases, TensorBoard, MLflow (bring your own tracker)
  • Inference APIs: OpenAI-compatible REST endpoints for LLM serving, raw tensor outputs for custom models
  • Runtime: CUDA 11.8+, NVIDIA GPUs from GTX 1080 Ti to H100 SXM5
Model sizeGPU countRecommended strategy
Fits in one GPU VRAM1–8 GPUsDDP
Larger than one GPU, fits in cluster VRAM4–64 GPUsFSDP or DeepSpeed ZeRO-2
Very large (70B+ parameters)8–128 GPUsFSDP + DeepSpeed ZeRO-3
Privacy-sensitive training dataAny countNVIDIA FLARE (Federated)
Heterogeneous hardware across sitesAny countNVIDIA FLARE (Federated)