Real-world minimum server specs for OpenWebUI
-
Q for those running OpenWebUI on a VPS with Cloudron.
Just wondering what server specs your VPS has.I have tried OpenWebUI twice on my Cloudron VPS which is a dedicated Hetzner box, 62Gb RAM, 1TB disk only 30% used. But OpenWebUI runs soooo slooow. Unusable frankly.
Would love to have self-hosted AI but not currently viable for me, using other AI chat systems for now.
-
Q for those running OpenWebUI on a VPS with Cloudron.
Just wondering what server specs your VPS has.I have tried OpenWebUI twice on my Cloudron VPS which is a dedicated Hetzner box, 62Gb RAM, 1TB disk only 30% used. But OpenWebUI runs soooo slooow. Unusable frankly.
Would love to have self-hosted AI but not currently viable for me, using other AI chat systems for now.
@timconsidine said in Real-world minimum server specs for OpenWebUI:
I have tried OpenWebUI twice on my Cloudron VPS which is a dedicated Hetzner box, 62Gb RAM, 1TB disk only 30% used. But OpenWebUI runs soooo slooow. Unusable frankly.
I'm not using it at present but did play with it on my dedicated Hetzner server with similar specs and agree that it was pretty slow. There was quite a bit of variation in the speeds of different models though.
I recently stumbled across this via a post on Mastodon which is someone testing various smaller models on a Raspberry Pi https://itsfoss.com/llms-for-raspberry-pi/ and it sounds like Qwen2.5 (3b) might be work a look as something that apparently works quite well and quickly.
-
This is kinda very hard to assess. It depends on the expected outcome (quality and speed of produced answers) as well as the systems ability to augment with extra (and more up-to-date) sources via RAG and of course which pre-trained model and flavor of that is used.
-
@timconsidine said in Real-world minimum server specs for OpenWebUI:
I have tried OpenWebUI twice on my Cloudron VPS which is a dedicated Hetzner box, 62Gb RAM, 1TB disk only 30% used. But OpenWebUI runs soooo slooow. Unusable frankly.
I'm not using it at present but did play with it on my dedicated Hetzner server with similar specs and agree that it was pretty slow. There was quite a bit of variation in the speeds of different models though.
I recently stumbled across this via a post on Mastodon which is someone testing various smaller models on a Raspberry Pi https://itsfoss.com/llms-for-raspberry-pi/ and it sounds like Qwen2.5 (3b) might be work a look as something that apparently works quite well and quickly.
@jdaviescoates There has been success using Deepseek reasoning engine as a controller for Qwen and other models, which optimizes the prompting via it's thought process. This one-two punch makes even lesser models provide better output.
-
@timconsidine Are you trying to use locally hosted ollama models, or have you wired up API keys for the public cloud models like ChatGPT or DeepSeek(*) in your OpenWebUI instance?
If you're experiencing unusable slowness for locally hosted models, it might be because OpenWebUI on Cloudron out of the box is running with CPU+RAM only (not GPU+VRAM). Even for tiny models, that's going to be very slow even with very fast CPUs.
I'd be surprised if you're finding OpenWebUI to be slow with the public cloud models. There will be some latency through API calls between your Cloudron server and the online model. but I'd be surprised if you didn't find it to be nearly as fast as using the online hosted versions directly.
(*) By the way, if you're using DeepSeek online and not self-hosted, please assume every interaction is being read at the other end. There are no privacy controls. And even with ChatGPT and the others, I'd suggest reading the terms and conditions of your API usage carefully and considering which jurisdiction you're sending your data and chats to.
-
@timconsidine Are you trying to use locally hosted ollama models, or have you wired up API keys for the public cloud models like ChatGPT or DeepSeek(*) in your OpenWebUI instance?
If you're experiencing unusable slowness for locally hosted models, it might be because OpenWebUI on Cloudron out of the box is running with CPU+RAM only (not GPU+VRAM). Even for tiny models, that's going to be very slow even with very fast CPUs.
I'd be surprised if you're finding OpenWebUI to be slow with the public cloud models. There will be some latency through API calls between your Cloudron server and the online model. but I'd be surprised if you didn't find it to be nearly as fast as using the online hosted versions directly.
(*) By the way, if you're using DeepSeek online and not self-hosted, please assume every interaction is being read at the other end. There are no privacy controls. And even with ChatGPT and the others, I'd suggest reading the terms and conditions of your API usage carefully and considering which jurisdiction you're sending your data and chats to.
-
Sorry, I didn't answer your original question directly...
Real world server specs for OpenWebUI itself are very low. My Cloudron OpenWebUI app instance fits into a few Gb of storage, barely uses any CPU on its own, and runs in well under 1 Gb of RAM.
But if you want to use the embedded ollama system to interact with locally hosted LLMs, your server needs to support the actual LLMs aswell as OpenWebUI. So you need all of this:
- Enough disk storage for all the models you want to use.
- Individual models you can typically run locally for a reasonable cost range from 2-3 Gb (e.g. for a 3B model) up to 40-50Gb (e.g. for a 70B model). You might want to store multiple models.
- Enough RAM (or VRAM) to fully load the model you want to use into memory, separately for each concurrent chat.
- To roughly calculate, you need the size of the model file plus some room for chat context depending on how much you want it to know/remember during chats, e.g. 3-6Gb per chat for 3-8B models, more for the bigger ones.
- Enough CPU (or GPU) compute power to run the model fast.
- For tiny (3-8B) models, expect 1-2 minutes per chat response on a typical CPU+RAM system and don't imagine you can use bigger models at all, or seconds per chat response using GPU+VRAM. (Note: You might do better than that on the very latest CPUs, but GPU+VRAM is still going to be hundreds of times faster.)
- If you're using CPU+RAM (as opposed to GPU+VRAM):
- You'll find that your disk I/O will be hammered (particularly during model loading) too, so you'll want very fast SSDs.
- Expect your CPU and your RAM to be fully consumed during inference (chats), so don't expect to be running other apps on your server at the same time.
In short, I'm not sure that a VPS hosted OpenWebUI instance running only on CPU+RAM is ever going to be useful for self hosted LLMs.
Unfortunately, even if you have a GPU on your virtual server, even if you get under the hood and install your GPU drivers on the Ubuntu operating system, currently Cloudron's OpenWebUI app installation won't use your GPU. So on Cloudron you're stuck with CPU+RAM.
But that is not as gloomy as it sounds... To answer your next question...
@timconsidine said in Real-world minimum server specs for OpenWebUI:
Using “out the box” with local default model.
Is there any point to the app to use with publicly hosted model ?Yes, there is a point. Your use cases for handling data privately are more limited, certainly, but there are some outstanding advantages to doing this, particularly on Cloudron.
- You're storing your data (including chats and RAG data) on a system you control.
- Although you're still sending your chats and data within them to the public model, you at least control what you can do with the storage of your chats and data.
- You can download, backup, and always access your chats, or move them to a different OpenWebUI server, even if your connection to the public model is severed.
- You can interact with multiple public and private models via a single interface, even within each chat. None of the public platforms let you talk to the others.
- E.g. OpenWebUI has some pretty cool features to let you split chat threads among different models, and let models "compete" with each other using "arena" chats. We've found this to be invaluable in our business because a lot of optimizing AI usage is about experimentation and finding the best tool for the task at hand.
- You can install and manage your own prompt libraries, system prompts, workspaces (like "GPTs" in ChatGPT), coded tools and functions (OpenWebUI has some cool integrated Python coding capabilities in this area), in a standard way across every LLM that you interact with, and without storing your code and extended data in the public cloud.
- You can brand your chat UI according to your company or client, and modify/integrate it in other ways. OpenWebUI is flexible and open source.
- You can centrally connect to other apps that you self-host for various workloads including data access and agent/workflow automation without needing to upload and manage all that stuff in public systems.
- E.g. some apps running on Cloudron that can give your AI interactions super powers include:
- N8N for workflow automation
- Nextcloud for data storage and management
- Chat and notification apps
- BI apps like Baserow and Grafana
- E.g. some apps running on Cloudron that can give your AI interactions super powers include:
- You can manage and segregate branded multi-user access to different chats and different AIs, either in a single OpenWebUI instance, or (since app management on Cloudron is so bloody easy), different instances on different URLs.
- In the future when you switch to self hosted LLMs or other integrations, there's little no migration. You just switch off the public API connectors and redirect them to your own models and tools, because you managed your data and chats and code integrations locally from the outset.
- And more. I'm sure I didn't think of everything.
By the way, plenty of these advantages are either because of or enhanced by running on Cloudron. Cloudron is great.
I haven’t tried DeepSeek locally but might be worth a shot for privacy. I wouldn’t use it otherwise.
I agree with that decision wholeheartedly. Well, unless you're talking with DeepSeek about stuff that you want the whole world to know and learn from. Then, go nuts.
- Enough disk storage for all the models you want to use.