Machine Type | GPU Options | VRAM | RAM | vCPUs | Hobby Price per Hour | Pro Price per Hour |
---|---|---|---|---|---|---|
Medium | T4, A4000 | 16GB | 16GB | 8 | $0.99 | $0.79 |
Large | A10G, A5000 | 24GB | 32GB | 8 | $1.75 | $1.39 |
X-Large | A6000 | 48GB | 48GB | 28 | $2.50 | $1.99 |
X-Large Plus | L40S, L40 | 48GB | 64GB | 28 | $2.99 | $2.15 |
2X-Large | A100 | 80GB | 96GB | 28 | $4.99 | $3.99 |
2X-Large Plus | H100 | 80GB | 180GB | 28 | $7.49 | $5.99 |
minimum_instances
and maximum_instances
settings:
minimum_instances
> 0 in your deployment settings, that number of instances is created and stays active even without requests. Billing for these persistent instances starts shortly after saving the changes and continues until you reduce minimum_instances
(causing excess instances to scale down). The full duration, including idle time, is billed. This setup supports immediate responses by avoiding cold starts for the baseline capacity, but balance the performance gains with costs for your specific workload.
minimum_instances
(or all demand if minimum_instances
= 0). They scale down completely when idle after the keep-warm period. You’re billed only for their active time during requests, including cold start, execution, and keep-warm durations. This suits fluctuating workloads, aligning costs directly with demand, and is ideal for variable loads, non-time-sensitive apps, and optimizing costs in sporadic usage. If maximum_instances
> minimum_instances
, these on-demand instances provide elastic scaling on top of any persistent baseline.
Note: In your Request page, you may also see Queue Time, the period an instance is waiting before it starts work. This can happen while GPU resources are being allocated, concurrency limits are reached, or earlier tasks are still running. Queue time is not billed, as the instance has not yet begun active use.