Infrastructure | ResonTech Docs

Three GPU Runtime Pools

ResonTech exposes three distinct GPU execution environments. Each serves a different workload profile, with a consistent API surface across all three — resontech submit and resontech infer work identically regardless of which pool you target.

	Shared Pool	Managed Cluster	Private Cluster
Tenancy	Multi-tenant	Dedicated	Sovereign
Provisioning	Boot < 60s	Pre-reserved nodes	Your hardware
Billing	Pay-as-you-go for actual job time	Reserved capacity pricing	Enterprise contract
SLA	Best effort	99.9% node availability	Enterprise SLA
Data stays on your infra	✗	◑ (ResonTech DC)	✓
Dedicated capacity	✗	✓	✓
No queue contention	✗	✓	✓
Custom GPU config	✗	✓	✓
Air-gapped mode	✗	✗	✓
Compliance (HIPAA, SOC2)	✗	◑	✓
Priority queue access	✗	✓	✓
Idle cost	Zero between runs	Reserved rate	Your hardware cost

Shared Pool (Public Pool)

Multi-tenant GPU pool. Instant provisioning.

Jobs are routed to any available GPU node in the shared network. Each job runs in its own Kubernetes namespace with strict workload isolation — other tenants cannot observe or interfere with your workload. Ephemeral compute: no persistent state between jobs.

How it works

Submit a job → Coordinator selects the N best available nodes in real time
Dispatcher provisions a dedicated K8s namespace and network policies
Cold start under 60 seconds from submission to first training step
Cluster is torn down immediately after job completion or cancellation
Inference endpoints scale to zero between requests — no idle GPU cost

Billing

Pay-as-you-go for actual GPU time used, from job start to job end. Zero idle cost between runs. Inference endpoints scale to zero — billing stops when no requests are in flight.

Best for

Experiments, prototyping, hyperparameter sweeps
One-off training runs and bursty research workloads
Teams without predictable GPU demand
Getting started without infrastructure commitment

Managed Cluster (Dedicated)

Reserved nodes. Isolated kernel. Production SLA.

Nodes reserved exclusively for your organization. An isolated job kernel with persistent node state between jobs. Custom GPU configurations available with topology-aware placement. No queue contention with other teams — your jobs always run immediately on your reserved capacity.

How it works

Reserved GPU nodes are allocated exclusively to your organization
A dedicated kernel instance manages job scheduling within your cluster
Node state can persist between jobs (pre-loaded CUDA environment, warm model caches)
Topology-aware placement: multi-node jobs placed on nodes with NVLink or InfiniBand interconnects
Priority queue: managed cluster jobs always preempt shared pool jobs for the same hardware

SLA

99.9% uptime on reserved node availability
Priority queue access — no waiting behind shared pool jobs
Dedicated support channel and account manager

Best for

Production training pipelines with predictable GPU demand
Continuous model retraining on a schedule
Inference serving at production scale with uptime guarantees
Teams that need consistent performance without queue variability

Typical GPU utilization improvement: from ~40% to ~80%+ by replacing manual scheduling with the kernel's intelligent job assignment.

Private Cluster (Sovereign Mode)

Your hardware. Your data. Your compliance team won't complain.

The ResonTech worker agent is installed on your existing GPU nodes. The kernel runs entirely within your network perimeter. Zero data egress — compute and storage stay local. Air-gapped mode is available: no inbound internet required after initial setup.

How it works

Install the ResonTech worker agent (Docker container) on your GPU nodes
Agents register with your private Coordinator instance, deployed within your perimeter
The kernel orchestrates jobs across your nodes — same API, same dashboard
Training data never leaves your network — workers access local storage directly
In air-gap mode: after initial deployment, no external internet connectivity required

Compliance

HIPAA, SOC 2, GDPR, DORA configurations available
Full RBAC and SSO integration with your identity provider
Complete audit logs for all job submissions, data access, and model outputs
Data residency: compute and storage remain within your jurisdiction

Best for

Organizations with strict data residency or sovereignty requirements
Healthcare, finance, defense, and government ML teams
Enterprises with existing GPU fleets that want orchestration without migration
Teams training on sensitive data (PII, PHI, proprietary datasets)

NVIDIA FLARE Architecture

ResonTech uses NVIDIA FLARE (NVFlare) as the production-grade distributed training framework for federated learning workloads. It manages all execution on the GPU cluster provisioned by the Dispatcher.

Why federated learning?

Privacy by architecture — Data shards stay on individual worker nodes. Only model weights travel between nodes and the aggregation server.
Heterogeneous hardware compatibility — Federated learning's round-based communication model is tolerant of heterogeneous hardware and variable network quality.
Enterprise & regulatory compliance — Training on sensitive data is only possible if that data never moves. The federated paradigm satisfies strict data residency requirements.

FLARE concepts in ResonTech

FLARE Concept	What it means in ResonTech
FL Server	Central aggregation node. Deployed by Dispatcher as a K8s Pod. Manages lifecycle and merges model updates.
FL Client	One per worker GPU. Trains on its data shard, sends updated weights back to the server.
Round (num_rounds)	One cycle: distribute model → local training → collect weights → aggregate.
FedAvg	Default aggregation: weighted average of all client updates by sample count.
Scatter-and-Gather	The FLARE workflow pattern. Server scatters work; gathers results and repeats.
CustomPersistor	User-supplied Python class defining how to save the global model after each round.

Training round sequence

FL Server starts → CustomPersistor instantiates the model (e.g., ImageNet pretrained backbone)
Server sends initial weights to all FL clients via gRPC
Each client calls fl_train_model(payload, ...) with the global weights
Each client trains locally on its shard for N local epochs
Each client returns updated weights + sample count
Server aggregates all updates (FedAvg) → saves new global model → begins next round

Kubernetes Isolation & Security

Each training job and inference endpoint runs in its own isolated Kubernetes environment. Per-job isolation provides strong security boundaries and resource guarantees.

Dedicated Kubernetes namespace per job — no resource sharing with other jobs
Network policies enforce zero cross-namespace traffic
Resource quotas prevent runaway jobs from starving other workloads
Cluster teardown is automatic after job completion — no orphaned pods
Client data stays in client workspace — only encrypted gradients cross boundaries

Communication architecture

Channel	Protocol	Purpose
Main Backend ↔ Dispatcher	gRPC over HTTP/2	Cluster provisioning, status polling, scaling, termination
Coordinator ↔ Workers	Async Message Broker (RabbitMQ/NATS)	Job commands, heartbeats — decoupled so transient connectivity issues never fail jobs
Platform ↔ Client	WebSocket	Real-time updates: epoch progress, loss values, GPU utilization, checkpoint events
Workers ↔ FL Server	gRPC (NVFlare protocol)	Model weight distribution and gradient collection
Client ↔ S3	HTTPS presigned URLs	Data upload and artifact download — never through the API

Performance Characteristics

Optimization	How it works	Result
Multi-node distribution	Submit with a GPU count and the kernel splits your job across nodes automatically — no manual NCCL setup.	8× throughput gain
Topology-aware scheduling	Multi-node jobs are placed on nodes with high-bandwidth interconnects — NVLink, InfiniBand.	< 15s node selection
Data sharding	Your dataset is automatically sharded across worker nodes at job start.	3× data throughput
Checkpoint-aware recovery	Node failure triggers automatic rescheduling. Job resumes from last checkpoint — not epoch 0.	100% recovery success
GPU utilization improvement	Intelligent job assignment replaces manual scheduling.	40% → 80%+ utilization

Hardware Requirements (GPU Providers)

GPU providers join the network by installing the ResonTech worker agent. Supported hardware spans all NVIDIA GPU generations — from GTX 1080 Ti gaming cards to H100 data center nodes.

Component	Minimum	Recommended
GPU	GTX 1080 Ti (8 GB)	RTX 3090 / A100 (24 GB+)
VRAM	8 GB	24 GB+
System RAM	32 GB	128 GB+
Storage	100 GB NVMe	200 GB NVMe
Network	100 Mbps	10 Gbps (for multi-node jobs)
OS	Ubuntu 20.04 LTS	Ubuntu 22.04 LTS
CUDA	11.8+	Latest (12.x)

Provider availability tiers

Tier	Uptime	Hardware	Job priority	Earnings
Platinum	99%+	H100 / A100	Maximum — first assignment	Highest multiplier
Gold	95%+	RTX 4090 / RTX 3090	Standard assignment	Standard multiplier
Silver	90%+	Older generation GPUs	Spot/batch assignment	Reduced rate
Unverified	N/A	Any	Test/benchmark only	Zero until verified

Contact office@reson.tech to join the supplier network.

PreviousSolution NextUse Cases