Deployment & Cost (AI-Ops)Medium

If your inference latency is high because the model is too big for one GPU, do you scale horizontally or vertically? What if the latency is high because you have too many concurrent users?

Practice Your Response