The GPU Network
Behind the Speed
Distributed GPU pools, intelligent scheduling, fault-tolerant execution. Three cluster types for every scale of ML work.
Three execution modes.
One API.
Public Pool, Managed Cluster, and Private Cluster have fundamentally different architectures. Same API surface. Different guarantees.
Public Pool
Shared GPU network. Multi-tenant isolation.
- →Jobs routed to any available GPU node in the network
- →Multi-tenant with strict workload isolation
- →Ephemeral compute — no persistent state between jobs
- →Node capacity reported in real-time by the kernel
- →FIFO queue with configurable priority lanes
Your workspace is mounted read-only at job start. Outputs are written back. The node is wiped clean after job completion.
Experiments, prototyping, one-off training runs
Decentralized.
Not just cloud-hosted.
GPU capacity is aggregated across a distributed network — better availability, broader hardware diversity, no single vendor lock-in.
Nodes Register
GPU suppliers install the ResonTech worker. It reports GPU specs, VRAM, network speed, and availability to the kernel.
Smart Matching
When a job is submitted, the kernel matches it to optimal nodes — by GPU type, locality, bandwidth, and current load.
Distributed Execution
Large jobs are split across multiple nodes. Data is sharded. Workers communicate via high-bandwidth interconnects.
Self-Healing
Heartbeat monitoring detects node failure in seconds. Jobs reschedule automatically to healthy nodes, resuming from checkpoint.
Simplified job routing flow. Jobs are dispatched to best-fit nodes across the network.
Why we're faster.
The technical reasons.
Multi-node automatic distribution
Submit with --gpus 8 and we split your job across nodes automatically. No manual NCCL setup, no rank configuration.
Topology-aware scheduling
Multi-node jobs are placed on nodes with high-bandwidth interconnects — NVLink, InfiniBand. Slower nodes picked last.
Data sharding at mount time
Your dataset is automatically sharded across worker nodes. No centralized bottleneck, no transfer overhead.
Checkpoint-aware recovery
Node failure triggers automatic rescheduling. Job resumes from last checkpoint — not epoch 0. Zero lost compute.
Jobs requesting specific GPU types are routed to matching nodes first
Multi-node jobs prefer nodes with NVLink or InfiniBand interconnects
Lower-priority jobs yield to high-priority ones; they resume from checkpoint
Inference endpoints scale replicas up/down based on request throughput
Reclaimed spot nodes trigger automatic rescheduling, not failure
Kernel can prefer cheaper nodes when latency is not the constraint
Have idle GPUs?
Put them to work.
Join the ResonTech supplier network. Install the worker and your nodes are registered into the distributed GPU pool, running real ML workloads.
Register Your Hardware
Sign up as a GPU supplier. Provide hardware specs, location, and availability windows.
Install the Worker
One-line install of the ResonTech node worker. It registers your GPU into the network and handles job routing.
Configure Availability
Define GPU availability windows and resource allocation. The kernel registers your node capacity and factors it into job routing.
Monitor Utilization
Jobs route to your hardware automatically. Full dashboard visibility into node utilization, job history, and resource metrics.
Minimum specs.
And what we recommend.
For both running jobs on the network and supplying GPUs to the network.
Note: Requirements vary by job type. Inference serving requires less RAM than large training runs. Multi-node training requires NVLink or InfiniBand for best performance. Contact us for specific hardware validation.
Ready to run on the network?
Start in 60 seconds.
Start with the public pool — deploy instantly, no configuration required. Book a demo for managed or private cluster access.