Three GPU Runtime Pools
ResonTech exposes three distinct GPU execution environments. Each serves a different workload profile, with a consistent API surface across all three — resontech submit and resontech infer work identically regardless of which pool you target.
| Shared Pool | Managed Cluster | Private Cluster | |
|---|---|---|---|
| Tenancy | Multi-tenant | Dedicated | Sovereign |
| Provisioning | Boot < 60s | Pre-reserved nodes | Your hardware |
| Billing | Pay-as-you-go for actual job time | Reserved capacity pricing | Enterprise contract |
| SLA | Best effort | 99.9% node availability | Enterprise SLA |
| Data stays on your infra | ✗ | ◑ (ResonTech DC) | ✓ |
| Dedicated capacity | ✗ | ✓ | ✓ |
| No queue contention | ✗ | ✓ | ✓ |
| Custom GPU config | ✗ | ✓ | ✓ |
| Air-gapped mode | ✗ | ✗ | ✓ |
| Compliance (HIPAA, SOC2) | ✗ | ◑ | ✓ |
| Priority queue access | ✗ | ✓ | ✓ |
| Idle cost | Zero between runs | Reserved rate | Your hardware cost |
Managed Cluster (Dedicated)
Reserved nodes. Isolated kernel. Production SLA.
Nodes reserved exclusively for your organization. An isolated job kernel with persistent node state between jobs. Custom GPU configurations available with topology-aware placement. No queue contention with other teams — your jobs always run immediately on your reserved capacity.
How it works
- Reserved GPU nodes are allocated exclusively to your organization
- A dedicated kernel instance manages job scheduling within your cluster
- Node state can persist between jobs (pre-loaded CUDA environment, warm model caches)
- Topology-aware placement: multi-node jobs placed on nodes with NVLink or InfiniBand interconnects
- Priority queue: managed cluster jobs always preempt shared pool jobs for the same hardware
SLA
- 99.9% uptime on reserved node availability
- Priority queue access — no waiting behind shared pool jobs
- Dedicated support channel and account manager
Best for
- Production training pipelines with predictable GPU demand
- Continuous model retraining on a schedule
- Inference serving at production scale with uptime guarantees
- Teams that need consistent performance without queue variability
Private Cluster (Sovereign Mode)
Your hardware. Your data. Your compliance team won't complain.
The ResonTech worker agent is installed on your existing GPU nodes. The kernel runs entirely within your network perimeter. Zero data egress — compute and storage stay local. Air-gapped mode is available: no inbound internet required after initial setup.
How it works
- Install the ResonTech worker agent (Docker container) on your GPU nodes
- Agents register with your private Coordinator instance, deployed within your perimeter
- The kernel orchestrates jobs across your nodes — same API, same dashboard
- Training data never leaves your network — workers access local storage directly
- In air-gap mode: after initial deployment, no external internet connectivity required
Compliance
- HIPAA, SOC 2, GDPR, DORA configurations available
- Full RBAC and SSO integration with your identity provider
- Complete audit logs for all job submissions, data access, and model outputs
- Data residency: compute and storage remain within your jurisdiction
Best for
- Organizations with strict data residency or sovereignty requirements
- Healthcare, finance, defense, and government ML teams
- Enterprises with existing GPU fleets that want orchestration without migration
- Teams training on sensitive data (PII, PHI, proprietary datasets)
NVIDIA FLARE Architecture
ResonTech uses NVIDIA FLARE (NVFlare) as the production-grade distributed training framework for federated learning workloads. It manages all execution on the GPU cluster provisioned by the Dispatcher.
Why federated learning?
- Privacy by architecture — Data shards stay on individual worker nodes. Only model weights travel between nodes and the aggregation server.
- Heterogeneous hardware compatibility — Federated learning's round-based communication model is tolerant of heterogeneous hardware and variable network quality.
- Enterprise & regulatory compliance — Training on sensitive data is only possible if that data never moves. The federated paradigm satisfies strict data residency requirements.
FLARE concepts in ResonTech
| FLARE Concept | What it means in ResonTech |
|---|---|
| FL Server | Central aggregation node. Deployed by Dispatcher as a K8s Pod. Manages lifecycle and merges model updates. |
| FL Client | One per worker GPU. Trains on its data shard, sends updated weights back to the server. |
| Round (num_rounds) | One cycle: distribute model → local training → collect weights → aggregate. |
| FedAvg | Default aggregation: weighted average of all client updates by sample count. |
| Scatter-and-Gather | The FLARE workflow pattern. Server scatters work; gathers results and repeats. |
| CustomPersistor | User-supplied Python class defining how to save the global model after each round. |
Training round sequence
- FL Server starts → CustomPersistor instantiates the model (e.g., ImageNet pretrained backbone)
- Server sends initial weights to all FL clients via gRPC
- Each client calls
fl_train_model(payload, ...)with the global weights - Each client trains locally on its shard for N local epochs
- Each client returns updated weights + sample count
- Server aggregates all updates (FedAvg) → saves new global model → begins next round
Kubernetes Isolation & Security
Each training job and inference endpoint runs in its own isolated Kubernetes environment. Per-job isolation provides strong security boundaries and resource guarantees.
- Dedicated Kubernetes namespace per job — no resource sharing with other jobs
- Network policies enforce zero cross-namespace traffic
- Resource quotas prevent runaway jobs from starving other workloads
- Cluster teardown is automatic after job completion — no orphaned pods
- Client data stays in client workspace — only encrypted gradients cross boundaries
Communication architecture
| Channel | Protocol | Purpose |
|---|---|---|
| Main Backend ↔ Dispatcher | gRPC over HTTP/2 | Cluster provisioning, status polling, scaling, termination |
| Coordinator ↔ Workers | Async Message Broker (RabbitMQ/NATS) | Job commands, heartbeats — decoupled so transient connectivity issues never fail jobs |
| Platform ↔ Client | WebSocket | Real-time updates: epoch progress, loss values, GPU utilization, checkpoint events |
| Workers ↔ FL Server | gRPC (NVFlare protocol) | Model weight distribution and gradient collection |
| Client ↔ S3 | HTTPS presigned URLs | Data upload and artifact download — never through the API |
Performance Characteristics
| Optimization | How it works | Result |
|---|---|---|
| Multi-node distribution | Submit with a GPU count and the kernel splits your job across nodes automatically — no manual NCCL setup. | 8× throughput gain |
| Topology-aware scheduling | Multi-node jobs are placed on nodes with high-bandwidth interconnects — NVLink, InfiniBand. | < 15s node selection |
| Data sharding | Your dataset is automatically sharded across worker nodes at job start. | 3× data throughput |
| Checkpoint-aware recovery | Node failure triggers automatic rescheduling. Job resumes from last checkpoint — not epoch 0. | 100% recovery success |
| GPU utilization improvement | Intelligent job assignment replaces manual scheduling. | 40% → 80%+ utilization |
Hardware Requirements (GPU Providers)
GPU providers join the network by installing the ResonTech worker agent. Supported hardware spans all NVIDIA GPU generations — from GTX 1080 Ti gaming cards to H100 data center nodes.
| Component | Minimum | Recommended |
|---|---|---|
| GPU | GTX 1080 Ti (8 GB) | RTX 3090 / A100 (24 GB+) |
| VRAM | 8 GB | 24 GB+ |
| System RAM | 32 GB | 128 GB+ |
| Storage | 100 GB NVMe | 200 GB NVMe |
| Network | 100 Mbps | 10 Gbps (for multi-node jobs) |
| OS | Ubuntu 20.04 LTS | Ubuntu 22.04 LTS |
| CUDA | 11.8+ | Latest (12.x) |
Provider availability tiers
| Tier | Uptime | Hardware | Job priority | Earnings |
|---|---|---|---|---|
| Platinum | 99%+ | H100 / A100 | Maximum — first assignment | Highest multiplier |
| Gold | 95%+ | RTX 4090 / RTX 3090 | Standard assignment | Standard multiplier |
| Silver | 90%+ | Older generation GPUs | Spot/batch assignment | Reduced rate |
| Unverified | N/A | Any | Test/benchmark only | Zero until verified |