If your inference latency is high because the model is too big for one GPU, do you scale horizontally or vertically? What if the latency is high because you have too many concurrent users?

Question

Accepted Answer

If the model is too big for one GPU, you use Model Parallelism (Vertical Scaling) to split it across multiple cards. If you have too many users, you use Replicas (Horizontal Scaling) to run the same model on multiple separate servers.

If your inference latency is high because the model is too big for one GPU, do you scale horizontally or vertically? What if the latency is high because you have too many concurrent users?

Practice Your Response

Similar Questions in Deployment & Cost (AI-Ops)