DocsProductInfrastructure
Product

Infrastructure

Three GPU cluster types in depth — Shared Pool, Managed Cluster, and Private Cluster — plus hardware specs, FLARE architecture, and Kubernetes isolation.

Three GPU Runtime Pools

ResonTech exposes three distinct GPU execution environments. Each serves a different workload profile, with a consistent API surface across all three resontech submit and resontech infer work identically regardless of which pool you target.

Shared PoolManaged ClusterPrivate Cluster
TenancyMulti-tenantDedicatedSovereign
ProvisioningBoot < 60sPre-reserved nodesYour hardware
BillingPay-as-you-go for actual job timeReserved capacity pricingEnterprise contract
SLABest effort99.9% node availabilityEnterprise SLA
Data stays on your infra◑ (ResonTech DC)
Dedicated capacity
No queue contention
Custom GPU config
Air-gapped mode
Compliance (HIPAA, SOC2)
Priority queue access
Idle costZero between runsReserved rateYour hardware cost

Shared Pool (Public Pool)

Multi-tenant GPU pool. Instant provisioning.

Jobs are routed to any available GPU node in the shared network. Each job runs in its own Kubernetes namespace with strict workload isolation — other tenants cannot observe or interfere with your workload. Ephemeral compute: no persistent state between jobs.

How it works

  • Submit a job → Coordinator selects the N best available nodes in real time
  • Dispatcher provisions a dedicated K8s namespace and network policies
  • Cold start under 60 seconds from submission to first training step
  • Cluster is torn down immediately after job completion or cancellation
  • Inference endpoints scale to zero between requests — no idle GPU cost

Billing

Pay-as-you-go for actual GPU time used, from job start to job end. Zero idle cost between runs. Inference endpoints scale to zero — billing stops when no requests are in flight.

Best for

  • Experiments, prototyping, hyperparameter sweeps
  • One-off training runs and bursty research workloads
  • Teams without predictable GPU demand
  • Getting started without infrastructure commitment

Managed Cluster (Dedicated)

Reserved nodes. Isolated kernel. Production SLA.

Nodes reserved exclusively for your organization. An isolated job kernel with persistent node state between jobs. Custom GPU configurations available with topology-aware placement. No queue contention with other teams — your jobs always run immediately on your reserved capacity.

How it works

  • Reserved GPU nodes are allocated exclusively to your organization
  • A dedicated kernel instance manages job scheduling within your cluster
  • Node state can persist between jobs (pre-loaded CUDA environment, warm model caches)
  • Topology-aware placement: multi-node jobs placed on nodes with NVLink or InfiniBand interconnects
  • Priority queue: managed cluster jobs always preempt shared pool jobs for the same hardware

SLA

  • 99.9% uptime on reserved node availability
  • Priority queue access — no waiting behind shared pool jobs
  • Dedicated support channel and account manager

Best for

  • Production training pipelines with predictable GPU demand
  • Continuous model retraining on a schedule
  • Inference serving at production scale with uptime guarantees
  • Teams that need consistent performance without queue variability
i
Typical GPU utilization improvement: from ~40% to ~80%+ by replacing manual scheduling with the kernel's intelligent job assignment.

Private Cluster (Sovereign Mode)

Your hardware. Your data. Your compliance team won't complain.

The ResonTech worker agent is installed on your existing GPU nodes. The kernel runs entirely within your network perimeter. Zero data egress — compute and storage stay local. Air-gapped mode is available: no inbound internet required after initial setup.

How it works

  • Install the ResonTech worker agent (Docker container) on your GPU nodes
  • Agents register with your private Coordinator instance, deployed within your perimeter
  • The kernel orchestrates jobs across your nodes — same API, same dashboard
  • Training data never leaves your network — workers access local storage directly
  • In air-gap mode: after initial deployment, no external internet connectivity required

Compliance

  • HIPAA, SOC 2, GDPR, DORA configurations available
  • Full RBAC and SSO integration with your identity provider
  • Complete audit logs for all job submissions, data access, and model outputs
  • Data residency: compute and storage remain within your jurisdiction

Best for

  • Organizations with strict data residency or sovereignty requirements
  • Healthcare, finance, defense, and government ML teams
  • Enterprises with existing GPU fleets that want orchestration without migration
  • Teams training on sensitive data (PII, PHI, proprietary datasets)

NVIDIA FLARE Architecture

ResonTech uses NVIDIA FLARE (NVFlare) as the production-grade distributed training framework for federated learning workloads. It manages all execution on the GPU cluster provisioned by the Dispatcher.

Why federated learning?

  • Privacy by architecture — Data shards stay on individual worker nodes. Only model weights travel between nodes and the aggregation server.
  • Heterogeneous hardware compatibility — Federated learning's round-based communication model is tolerant of heterogeneous hardware and variable network quality.
  • Enterprise & regulatory compliance — Training on sensitive data is only possible if that data never moves. The federated paradigm satisfies strict data residency requirements.

FLARE concepts in ResonTech

FLARE ConceptWhat it means in ResonTech
FL ServerCentral aggregation node. Deployed by Dispatcher as a K8s Pod. Manages lifecycle and merges model updates.
FL ClientOne per worker GPU. Trains on its data shard, sends updated weights back to the server.
Round (num_rounds)One cycle: distribute model → local training → collect weights → aggregate.
FedAvgDefault aggregation: weighted average of all client updates by sample count.
Scatter-and-GatherThe FLARE workflow pattern. Server scatters work; gathers results and repeats.
CustomPersistorUser-supplied Python class defining how to save the global model after each round.

Training round sequence

  • FL Server starts → CustomPersistor instantiates the model (e.g., ImageNet pretrained backbone)
  • Server sends initial weights to all FL clients via gRPC
  • Each client calls fl_train_model(payload, ...) with the global weights
  • Each client trains locally on its shard for N local epochs
  • Each client returns updated weights + sample count
  • Server aggregates all updates (FedAvg) → saves new global model → begins next round

Kubernetes Isolation & Security

Each training job and inference endpoint runs in its own isolated Kubernetes environment. Per-job isolation provides strong security boundaries and resource guarantees.

  • Dedicated Kubernetes namespace per job — no resource sharing with other jobs
  • Network policies enforce zero cross-namespace traffic
  • Resource quotas prevent runaway jobs from starving other workloads
  • Cluster teardown is automatic after job completion — no orphaned pods
  • Client data stays in client workspace — only encrypted gradients cross boundaries

Communication architecture

ChannelProtocolPurpose
Main Backend ↔ DispatchergRPC over HTTP/2Cluster provisioning, status polling, scaling, termination
Coordinator ↔ WorkersAsync Message Broker (RabbitMQ/NATS)Job commands, heartbeats — decoupled so transient connectivity issues never fail jobs
Platform ↔ ClientWebSocketReal-time updates: epoch progress, loss values, GPU utilization, checkpoint events
Workers ↔ FL ServergRPC (NVFlare protocol)Model weight distribution and gradient collection
Client ↔ S3HTTPS presigned URLsData upload and artifact download — never through the API

Performance Characteristics

OptimizationHow it worksResult
Multi-node distributionSubmit with a GPU count and the kernel splits your job across nodes automatically — no manual NCCL setup.8× throughput gain
Topology-aware schedulingMulti-node jobs are placed on nodes with high-bandwidth interconnects — NVLink, InfiniBand.< 15s node selection
Data shardingYour dataset is automatically sharded across worker nodes at job start.3× data throughput
Checkpoint-aware recoveryNode failure triggers automatic rescheduling. Job resumes from last checkpoint — not epoch 0.100% recovery success
GPU utilization improvementIntelligent job assignment replaces manual scheduling.40% → 80%+ utilization

Hardware Requirements (GPU Providers)

GPU providers join the network by installing the ResonTech worker agent. Supported hardware spans all NVIDIA GPU generations — from GTX 1080 Ti gaming cards to H100 data center nodes.

ComponentMinimumRecommended
GPUGTX 1080 Ti (8 GB)RTX 3090 / A100 (24 GB+)
VRAM8 GB24 GB+
System RAM32 GB128 GB+
Storage100 GB NVMe200 GB NVMe
Network100 Mbps10 Gbps (for multi-node jobs)
OSUbuntu 20.04 LTSUbuntu 22.04 LTS
CUDA11.8+Latest (12.x)

Provider availability tiers

TierUptimeHardwareJob priorityEarnings
Platinum99%+H100 / A100Maximum — first assignmentHighest multiplier
Gold95%+RTX 4090 / RTX 3090Standard assignmentStandard multiplier
Silver90%+Older generation GPUsSpot/batch assignmentReduced rate
UnverifiedN/AAnyTest/benchmark onlyZero until verified
i
Contact office@reson.tech to join the supplier network.