In a serverless GPU environment, what is a "Cold Start"? How does the size of your model weights (e.g., a 70B model) impact the time it takes for a new instance to start serving traffic?

Question

Accepted Answer

This occurs when a serverless provider has to "spin up" a new GPU and download the model weights before it can answer a query. For a 70B model (140GB+), this can take minutes. Strategies to fix this include persistent storage volumes or keeping a "warm" instance always running.

In a serverless GPU environment, what is a "Cold Start"? How does the size of your model weights (e.g., a 70B model) impact the time it takes for a new instance to start serving traffic?

Practice Your Response

Similar Questions in Deployment & Cost (AI-Ops)