Note: If you don’t want to use the web page, you can create a deployment via the API. See the Deployment Endpoints documentation.
Select a Workflow
To deploy as an API, choose either a custom workflow from your My Workflows page (built or modified by you and cloud-saved with all dependencies included) or a community workflow from the Explore page (pre-saved and ready to use). For guidance on creating custom workflows, see the Custom Workflows section.Configure Hardware
Choose GPU hardware based on your workflow’s VRAM requirements and performance demands. Test in a ComfyUI session beforehand to estimate usage and prevent runtime errors. Options include:- 16GB: T4 or A4000
- 24GB: A10G or A5000
- 48GB: A6000
- 48GB Plus: L40S or L40
- 80GB: A100
- 80GB Plus: H100
Autoscaling
Configure autoscaling to automatically scale the number of instances (running containerized versions of your workflow) up or down based on the volume of incoming requests. This helps maintain low latency during traffic spikes while optimizing costs for quieter periods—ideal for apps with variable workloads like user-generated AI content creation.-
Minimum instances (0–3):
The baseline number of instances that remain active at all times, even with no requests. Setting it to1
keeps one instance always warm, avoiding cold-start delays (3–5 minutes to boot), but incurs ongoing costs. For infrequent traffic, use0
to save costs, though users may experience initial waits. -
Maximum instances (1–10):
The upper limit on simultaneous instances during peak demand. For example,3
means no more than three instances will run at once—excess requests queue instead. This caps costs while handling moderate bursts. For heavier loads (>10 concurrent jobs), contact hi@runcomfy.com to discuss custom limits. -
Queue size (1+):
The number of pending requests before spinning up a new instance. If set to1
, the arrival of a second request triggers an additional instance to reduce wait times. Ideal for latency-sensitive apps, though may increase costs during short spikes. -
Keep warm (seconds):
How long an idle instance stays active after its last job before shutting down. For example,60
means it lingers for one minute, ready for reuse without a cold start. Slight costs may occur if no new requests arrive.
Tip: Start with defaults (minimum0
, maximum1
, queue size1
, keep warm60
) for most setups. Then monitor traffic patterns via request and billing data to fine-tune.
Deploy
Review your selections for workflow, hardware, and scaling, then click Deploy. You’ll be taken to the deployment details page, where yourdeployment_id
is displayed—use this in all API calls to the provided endpoints.