Solution | ResonTech Docs

The Kernel Concept

ResonTech is structured around one central metaphor: a kernel. Like an operating system kernel, it sits between your code and the raw hardware, managing resources, isolating workloads, and handling failures transparently.

The kernel serves both primary ML workload types from a single control plane:

Training Runtime	Inference Runtime
Distributed DDP	Endpoint Serving
FSDP Sharding	OpenAI-Compatible API
Federated Learning (NVFlare)	Autoscale to Zero
Checkpoint Management	Cold Start < 10s
Gradient Sync	Load Balancing

Both runtimes share a common infrastructure layer: Coordinator, Dispatcher, Message Broker, Worker Registry, Job Scheduler, and Billing Metering. You submit a training job and an inference deployment through the same API, manage them on the same dashboard, and pay on the same pay-as-you-go invoice.

The Two-Sided Platform

ResonTech operates as a two-sided marketplace with the kernel at the center. GPU providers supply raw compute; ML teams consume it. The kernel routes, orchestrates, and guarantees delivery.

Aspect	GPU Provider sees	ML Team sees
Onboarding	Docker agent install + node registration	Account creation + S3 bucket provisioned
Daily interface	Node health dashboard, earnings tracker	Web platform, Python SDK, REST API
Training data	Never — raw data never leaves client workspace	Full control of shards, scripts, configs in S3
Job control	Accept/reject policies; automatic execution	Submit, monitor, cancel, retrieve artifacts
Inference	Node serves requests automatically	Live endpoint URL, autoscaling, metrics
Billing	Payouts per completed work unit and uptime tier	Pay-as-you-go or predictable reserved plans
Failure handling	Automatic — platform reassigns if node drops	Transparent — job continues, no manual action

Core Service Components

Main Backend (NestJS API)

The central control plane and client-facing API. Built with NestJS (TypeScript), it is the primary entry point for ML clients, the web dashboard, and the Python SDK.

Domain	Responsibility
Job Lifecycle	Submit → Validate → Assign → Monitor → Complete → Archive
User Management	Registration, SSH key storage, workspace provisioning, billing tracking
Worker Registry	Maintains registry of all registered GPU providers and their current status
Dispatcher Bridge	Proxies cluster provisioning commands to Dispatcher via gRPC
Inference Management	Endpoint lifecycle, autoscaling rules, health monitoring, traffic routing

Coordinator (Intelligent Job Router)

The scheduling brain of the platform. When a job is submitted, the Coordinator:

Parses GPU requirements (VRAM hint, GPU type, shard count)
Filters eligible workers (free VRAM ≥ job hint, availability = true)
Ranks candidates by score: uptime × hardware tier × geo affinity × current load
Selects the top N workers (N = shard count for training, N = replica count for inference)
Assigns jobs via the async Message Broker
Monitors heartbeats and triggers replacement on failure

Dispatcher (Kubernetes Cluster Manager)

Bridges high-level job intent into running Kubernetes infrastructure via gRPC. Responsible for the full cluster lifecycle: creating K8s resources for each training job, managing inference deployments with HPA and load balancers, and tearing down clusters after completion.

Worker Agent (Node-Side Runtime)

A Docker container installed on each GPU provider's machine. It registers with the Coordinator, sends heartbeat telemetry, and executes job commands: pull the FLARE image, initialize the FL client, begin training on the assigned shard, write checkpoints, serve inference.

S3 Workspace & Storage Layer

Every user account gets a private S3-compatible bucket automatically provisioned at account creation. The bucket is backed by Garage — a self-hosted, geographically distributed S3-compatible object store. Your data stays under ResonTech infrastructure, never in a third-party public cloud unless you configure a Private Cluster with your own storage.

Workspace folder layout

How uploads work

The API generates presigned PUT URLs (valid 15 minutes) for uploads. Your browser or SDK sends data directly to Garage — zero bytes flow through the API server.

For large files (≥ 10 MB), the platform automatically initiates multipart upload: files are split into 50 MB chunks, up to 5 parts uploaded in parallel, with automatic abort on failure. Maximum supported upload size: 500 GB.

How workers access data

At job dispatch, each worker receives a presigned 1-hour GET URL for its assigned shard ZIP. The worker downloads and unpacks the ZIP into /var/tmp/nvflare/data. Your training code accesses this path viapayload["dataset"]["data_root"]. The presigned URL expires after the download — workers never have persistent access to your bucket.

How results are returned

After each FL round, partial metrics are written to files_out/. After the final round, the aggregated model checkpoint is written to model_out/. Downloads use presigned 1-hour GET URLs served directly from Garage — no proxy bottleneck. Large checkpoints download at full network speed.

Your raw training data never traverses the ResonTech control plane. Workers receive time-limited presigned URLs. Only model weight updates (gradients) travel between worker nodes during federated training aggregation.

Training Pipeline — End to End

Phase	Step	What happens
Data Prep	Shard dataset	Split into N ZIP archives (one per GPU worker). Upload to jobs/{name}/shards/ in your S3 bucket.
Submission	Job submitted	Main Backend validates all referenced paths, configs, and file existence in your bucket.
Routing	Worker selection	Coordinator scores all available workers and selects N optimal nodes based on VRAM, locality, and uptime.
Provisioning	Cluster created	Dispatcher creates a Kubernetes cluster via gRPC. NVFlare FL Server pod deployed.
Execution	Distributed training	FLARE server distributes the model. Each worker downloads its shard, trains locally, sends weight updates back.
Aggregation	FedAvg round	Server aggregates weight updates (weighted by sample count), saves global model, begins next round.
Delivery	Artifacts written	Final checkpoint written to model_out/, per-round metrics to files_out/ in your bucket.
Billing	Invoice generated	Cluster torn down. Pay-as-you-go invoice generated for actual GPU time used.

Fault recovery during training

When a worker node loses connectivity:

Coordinator detects the missing heartbeat within seconds
A replacement worker is selected from the available pool
FLARE server manages partial aggregation (partial round is handled gracefully)
Replacement node loads the last successful checkpoint
Training resumes — zero compute lost from the last checkpoint

Inference Pipeline — End to End

Step	What happens
Model Loading	Specify a HuggingFace repo or workspace checkpoint path. The model is fetched and loaded onto the selected GPU.
Endpoint Config	Set GPU type, initial replica count, and autoscale policy (min/max replicas, scale-to-zero timeout).
Dispatcher Deploy	Dispatcher creates a Kubernetes Deployment + HPA + Load Balancer. Health check endpoints configured.
Serving Runtime	OpenAI-compatible REST API live: /v1/completions, /v1/chat/completions, /v1/embeddings. Cold start < 10s.
Autoscaling	HPA monitors RPS and GPU utilization. Scales out when RPS > threshold. Scales to zero after configurable idle timeout.
Response	Text completions, embeddings, or custom tensor outputs served with P50/P95/P99 latency monitoring.

Inference autoscaling architecture

Incoming traffic hits a Load Balancer which round-robins across healthy replicas. An HPA Controller monitors both RPS and GPU utilization:

RPS > threshold → scale out (add replicas)
RPS < threshold → scale in (remove replicas)
RPS = 0 for N minutes → scale to zero (billing stops)

Specification	Value
Cold start time	< 10 seconds from zero replicas
API protocol	REST (HTTP/1.1 and HTTP/2)
LLM API format	OpenAI-compatible (/v1/completions, /v1/chat/completions, /v1/embeddings)
Autoscaling basis	RPS and GPU utilization; configurable min/max replicas
Scale-to-zero	Yes — replicas removed after configurable idle timeout
Model sources	HuggingFace Hub, workspace checkpoint, custom container
Load balancing	Round-robin across healthy replicas; health-check-aware
Monitoring	RPS, latency (P50/P95/P99), GPU utilization, replica count, error rate

Framework Compatibility

Training frameworks: PyTorch, TensorFlow, JAX, HuggingFace Trainer, Fastai
Distributed strategies: DDP, FSDP, DeepSpeed ZeRO-2/ZeRO-3, model parallelism, NVIDIA FLARE federated learning
Model hubs: HuggingFace Hub, local workspace checkpoints, custom containers
Observability: Weights & Biases, TensorBoard, MLflow (bring your own tracker)
Inference APIs: OpenAI-compatible REST endpoints for LLM serving, raw tensor outputs for custom models
Runtime: CUDA 11.8+, NVIDIA GPUs from GTX 1080 Ti to H100 SXM5

Model size	GPU count	Recommended strategy
Fits in one GPU VRAM	1–8 GPUs	DDP
Larger than one GPU, fits in cluster VRAM	4–64 GPUs	FSDP or DeepSpeed ZeRO-2
Very large (70B+ parameters)	8–128 GPUs	FSDP + DeepSpeed ZeRO-3
Privacy-sensitive training data	Any count	NVIDIA FLARE (Federated)
Heterogeneous hardware across sites	Any count	NVIDIA FLARE (Federated)

Previousrclone Access NextInfrastructure