Ask HN: Best way to deploy an LLM for 100s of user

If you have any experience of deploying an LLM what resources/tools did you use?

do you have a link to your project?

Anything will help.

13 points | by sujayk_33 546 days ago

3 comments

eschnou 546 days ago
Could you detail what you mean by deploying LLMs ? Is it about integrating commercial LLMs in an enterprise context? Or running self-hosted LLM for a small company (e.g. Ollama + Ollama Web UI)? Or integrating an Agentic approach to existing software stack?
[-]
- sujayk_33 546 days ago
  Self hosting for a small company kind
  [-]
  - eschnou 546 days ago
    For pure self-hosting, I'd look into Ollama (ollama.com) or llamafile (https://github.com/Mozilla-Ocho/llamafile) on the LLM side and then picking a UI such as https://openwebui.com/ or one of the many in this list: https://github.com/ollama/ollama?tab=readme-ov-file#web--des...
    However, the issue you will quickly encounter is resources/costs. For a simple mode like llama3-7b you need at least a g5.2xlarge on AWS. If you want a 'chat gpt equivalent' model, you need something like llama3-70b or command-r-plus etc. These will require at least a g5.48xlarge that will cost you $20 an hour
    An alternate approach is going hybrid: self-hosted UI which takes care of user access, shared documents (RAG), custom prompts etc but that you hook to a LLM provider where you pay per token (could be OpenAI platform or anything from Huggings).
    Let me know if this helps! Also note that I'm lead dev of an open source project addressing these kind of needs: https://opengpa.org - Feel free to jump on our Discord to discuss.
    [-]
    - clark-kent 544 days ago
      If you are already on AWS, then it's better to run llama3 using AWS Bedrook. With Bedrock, you only pay for what you use instead of paying for an always-on EC2 instance.
gardnr 546 days ago
Not enough info.
Do they want near-realtime responses? Will they all hit it at the same time? Can you put some workloads in an overnight batch queue?
K0IN 546 days ago
so for fast responses we usea a rtx4090 with vllm, but yeah it depends on your use case
[-]
- samspenc 545 days ago
  Agreed, this works great for a single or small set of users (2-3), but OP is asking about self-hosting an open-source / open-weight LLM locally to serve a larger set of users (100s per post title). I suspect they may need a larger server farm with a lot of GPUs to handle the load.