Old CPU / No GPU / Ollama Language Model?

LoudLemur

gemma3:1b

This works. we gave it a whopping amount of RAM (32GB). You might be able to get it to run with less RAM.

qwen3:4b was too slow and hit the proxy timeout.

In the Ollama terminal you can set some environment variables to help too:

export OLLAMA_KEEP_ALIVE=24h
export OLLAMA_FLASH_ATTENTION=false

OLLAMA_KEEP_ALIVE
24h
Keeps the model loaded in RAM (prevents reloading every request)
OLLAMA_FLASH_ATTENTION
false
More stable on older CPUs

After you have Ollama running on cloudron and have its api key, you can go into the Ollama terminal and:

ollama pull gemma3:1b

Then, using your own URL and your own API token, you can run this from your local machine to get gemma to tell you a joke and see if it is working:

curl -X POST "https://YOUR_REAL_OLLAMA_URL/api/chat" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{
    "model": "gemma3:1b",
    "messages": [{"role": "user", "content": "Hello! Tell me a short joke."}],
    "stream": false,
    "options": {
      "num_ctx": 1024,
      "num_thread": 6
    }
  }' | jq

You will hopefully see a joke in the output and maybe some smilies laughing!

joseph

I tried something like this in my 14 year old CPU It's still writing out the joke . I also have only 16GB RAM to give.

timconsidine

Sadly there's currently no substitute for RAM or VRAM.
My Mac silicon chip laptop does an ok-ish job with 24Gb RAM (integrated CPU/GPU memory model)
But mostly I just accept defeat and use Ollama Cloud models (or Venice T2EE cloud models).

LoudLemur

@joseph

Hey, I hope it tells you that joke some day, Joseph! :

In the meanwhile, here is one it told me:

Why did the chicken cross the playground?

To get to the other slide!

It is pretty "low-VRAM" humour!

robi

There's a gemma3 270m instruct model that is fast, but it's also not very smart

LoudLemur

@robi Can it tell jokes?

I had to find out!
You be the judge:

Tell us a quick joke about a chicken

Why don't chicken birds fly?
Because they are too small.

robi

@LoudLemur of course it can and at 200 tok/sec no less.

It's just makes a lot of mistakes. Had trouble tool calling and web access

LoudLemur

@robi said:

200 tok/sec no less

nice!

robi

Yes, compare it to LFM2.5-270M and 350M which are being geared for on phone inference. The prompts you are used to need to change though and be much more explicit so they don't feel so dumb.

LoudLemur

@joseph Thanks for this story. We asked a smallish model (Qwen 9B) running on a lot of VRAM to tell us a joke.

Qwen didn't tell us a joke, it just started thinking about which chicken joke to tell us.

We looked at its thinking and it had created and considered over 200 chicken jokes before we decided the best thing to do was ... not wait for it!

We wish we had left it running to find out which joke it would have eventually chosen for us!

timconsidine

Why did the AI engine search for a chicken joke ?

Because it was looking for poultry in motion!

LoudLemur

@timconsidine

LoudLemur

@andreasdueren Thank you for Hermes! It is a great choice for us and it also tells a funny chicken joke!

hermes-4.3-36b

Sure! Here's a clucktastic one:

Why did the chicken join a band?

To learn how to make some "eggcellent" beats!

(If you want more, just say the word!)

Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.

Cloudron Forum

Old CPU / No GPU / Ollama Language Model?