Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Bookmarks
  • Search
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

Cloudron Forum

Apps | Demo | Docs | Install
  1. Cloudron Forum
  2. App Wishlist
  3. Serge - LLaMa made easy 🦙 - self-hosted AI Chat

Serge - LLaMa made easy 🦙 - self-hosted AI Chat

Scheduled Pinned Locked Moved App Wishlist
18 Posts 7 Posters 7.5k Views 10 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • marcusquinnM marcusquinn
    • https://github.com/nsarrazin/serge

    chat interface based on llama.cpp for running Alpaca models. Entirely self-hosted, no API keys needed. Fits on 4GB of RAM and runs on the CPU.

    SvelteKit frontend
    MongoDB for storing chat history & parameters
    FastAPI + beanie for the API, wrapping calls to llama.cpp

    11273713-d64f-4a6f-bb60-2b8df61c0131-image.png

    4fe27494-ec3d-44bf-9a15-f2d5030b551b-image.png

    L Offline
    L Offline
    LoudLemur
    wrote on last edited by LoudLemur
    #3

    @marcusquinn Have you been able to get the ggml-alpaca-30B-q4_0 model to work? For me the best it can do is eggtime or give me a "loading" message.

    @JOduMonT I think you might like this project.

    marcusquinnM 1 Reply Last reply
    0
    • L LoudLemur

      @marcusquinn Have you been able to get the ggml-alpaca-30B-q4_0 model to work? For me the best it can do is eggtime or give me a "loading" message.

      @JOduMonT I think you might like this project.

      marcusquinnM Offline
      marcusquinnM Offline
      marcusquinn
      wrote on last edited by
      #4

      @LoudLemur Not tried yet. Just spotted it on my Reddit travels, skimming past a post on it i the self-hosting sub-Reddit.

      Web Design https://www.evergreen.je
      Development https://brandlight.org
      Life https://marcusquinn.com

      L 1 Reply Last reply
      0
      • L LoudLemur referenced this topic on
      • marcusquinnM marcusquinn

        @LoudLemur Not tried yet. Just spotted it on my Reddit travels, skimming past a post on it i the self-hosting sub-Reddit.

        L Offline
        L Offline
        LoudLemur
        wrote on last edited by
        #5

        @marcusquinn said in Serge - LLaMa made easy 🦙 - self-hosted AI Chat:

        @LoudLemur Not tried yet. Just spotted it on my Reddit travels, skimming past a post on it i the self-hosting sub-Reddit.

        It was a RAM issue. If you don't have a lot of RAM, it is best not to multi-task with it.

        brave_AP7lseZ4vu.png

        1 Reply Last reply
        3
        • marcusquinnM marcusquinn
          • https://github.com/nsarrazin/serge

          chat interface based on llama.cpp for running Alpaca models. Entirely self-hosted, no API keys needed. Fits on 4GB of RAM and runs on the CPU.

          SvelteKit frontend
          MongoDB for storing chat history & parameters
          FastAPI + beanie for the API, wrapping calls to llama.cpp

          11273713-d64f-4a6f-bb60-2b8df61c0131-image.png

          4fe27494-ec3d-44bf-9a15-f2d5030b551b-image.png

          L Offline
          L Offline
          LoudLemur
          wrote on last edited by
          #6

          @marcusquinn There is now a medical Alpaca too:

          https://teddit.net/r/LocalLLaMA/comments/12c4hyx/introducing_medalpaca_language_models_for_medical/

          We could ask it health related questions, or perhaps it could ask us questions...

          1 Reply Last reply
          2
          • L Offline
            L Offline
            lars1134
            wrote on last edited by
            #7

            Is there any interest in getting this in the app store?

            L 1 Reply Last reply
            1
            • L lars1134

              Is there any interest in getting this in the app store?

              L Offline
              L Offline
              LoudLemur
              wrote on last edited by
              #8

              @lars1134 said in Serge - LLaMa made easy 🦙 - self-hosted AI Chat:

              Is there any interest in getting this in the app store?

              There is also this, but I couldn't see the code for the GUI:
              https://lmstudio.ai/

              L 1 Reply Last reply
              1
              • L LoudLemur

                @lars1134 said in Serge - LLaMa made easy 🦙 - self-hosted AI Chat:

                Is there any interest in getting this in the app store?

                There is also this, but I couldn't see the code for the GUI:
                https://lmstudio.ai/

                L Offline
                L Offline
                lars1134
                wrote on last edited by
                #9

                @LoudLemur

                It looks like that only runs locally, but I like the program.

                I liked serge as I have quite a bit of ram left unused at one of my servers.

                1 Reply Last reply
                0
                • timconsidineT Offline
                  timconsidineT Offline
                  timconsidine
                  App Dev
                  wrote on last edited by
                  #10

                  This might be a good one for @Kubernetes ?

                  KubernetesK 1 Reply Last reply
                  1
                  • timconsidineT timconsidine

                    This might be a good one for @Kubernetes ?

                    KubernetesK Offline
                    KubernetesK Offline
                    Kubernetes
                    App Dev
                    wrote on last edited by
                    #11

                    @timconsidine I already played with it on my local machines. I still miss quality for other languages than english. In addition it is very RAM and Disk consuming. So I think it is better to run it dedicated instead of running in shared mode with other applications.

                    1 Reply Last reply
                    3
                    • humptydumptyH Offline
                      humptydumptyH Offline
                      humptydumpty
                      wrote on last edited by humptydumpty
                      #12

                      I'm on huggingface and the library is huge! My laptop has 32GB RAM and an empty 500GB secondary SSD. Which model would be a good GPT-3.5 alternative?

                      L 1 Reply Last reply
                      0
                      • humptydumptyH humptydumpty

                        I'm on huggingface and the library is huge! My laptop has 32GB RAM and an empty 500GB secondary SSD. Which model would be a good GPT-3.5 alternative?

                        L Offline
                        L Offline
                        LoudLemur
                        wrote on last edited by
                        #13

                        @humptydumpty said in Serge - LLaMa made easy 🦙 - self-hosted AI Chat:

                        I'm on huggingface and the library is huge! My laptop has 32GB RAM and an empty 500GB secondary SSD. Which model would be a good GPT-3.5 alternative?

                        This model uses the recently released Llama2, is uncensored as far as things go and works well and even better with longer prompts:

                        https://huggingface.co/NousResearch/Nous-Hermes-Llama2-13b

                        1 Reply Last reply
                        1
                        • L Offline
                          L Offline
                          lars1134
                          wrote on last edited by
                          #14

                          @Kubernetes Thank you for looking into it. I understand the limitations. I run a few servers that have more than 10 EPYC cores and 50GB RAM left unused - those servers have a lot more available resources than our local clients. But I understand that not too many have a similar situation.

                          robiR 1 Reply Last reply
                          0
                          • L lars1134

                            @Kubernetes Thank you for looking into it. I understand the limitations. I run a few servers that have more than 10 EPYC cores and 50GB RAM left unused - those servers have a lot more available resources than our local clients. But I understand that not too many have a similar situation.

                            robiR Offline
                            robiR Offline
                            robi
                            wrote on last edited by robi
                            #15

                            @lars1134 the tests I've run in a LAMP app allows these to run in less than 5GB RAM (depending on model) using the right combo of sw all on CPU only.

                            Conscious tech

                            L L 2 Replies Last reply
                            1
                            • robiR robi

                              @lars1134 the tests I've run in a LAMP app allows these to run in less than 5GB RAM (depending on model) using the right combo of sw all on CPU only.

                              L Offline
                              L Offline
                              LoudLemur
                              wrote on last edited by
                              #16

                              @robi said in Serge - LLaMa made easy 🦙 - self-hosted AI Chat:

                              using the right combo of sw all on CPU only.

                              The GGML versions of the models are designed to offload the work onto the CPU and RAM and, if there is a GPU available, to use that too. Q6 versions offer more power than e.g. Q4, but are a bit slower.

                              1 Reply Last reply
                              2
                              • L Offline
                                L Offline
                                LoudLemur
                                wrote on last edited by LoudLemur
                                #17

                                From Claude AI:

                                "
                                Here is a quick comparison of FP16, GPTQ and GGML model versions:

                                FP16 (Half precision float 16):

                                Uses 16-bit floats instead of 32-bit floats to represent weights and activations in a neural network.
                                Reduces model size and memory usage by half compared to FP32 models.
                                May lower model accuracy slightly compared to FP32, but often accuracy is very close.
                                Supported on most modern GPUs and TPUs for efficient training and inference.

                                GPTQ (Quantization aware training):

                                Quantizes weights and/or activations to low bitwidths like 8-bit during training.
                                Further compresses model size over FP16.
                                Accuracy is often very close to FP32 model.
                                Requires quantization aware training techniques.
                                Currently primarily supported on TPUs.

                                GGML (Mixture of experts):

                                Partitions a single large model into smaller expert models.
                                Reduces compute requirements for inference since only one expert is used per sample.
                                Can maintain accuracy of original large model.
                                Increases model size due to overhead of gating model and expert models.
                                Requires changes to model architecture and training procedure.
                                In summary, FP16 is a straightforward way to reduce model size with minimal accuracy loss. GPTQ can further compress models through quantization-aware training. GGML can provide inference speedups through model parallelism while maintaining accuracy. The best choice depends on hardware constraints, accuracy requirements and inference latency needs."

                                1 Reply Last reply
                                2
                                • robiR robi

                                  @lars1134 the tests I've run in a LAMP app allows these to run in less than 5GB RAM (depending on model) using the right combo of sw all on CPU only.

                                  L Offline
                                  L Offline
                                  lars1134
                                  wrote on last edited by
                                  #18

                                  @robi said in Serge - LLaMa made easy 🦙 - self-hosted AI Chat:

                                  @lars1134 the tests I've run in a LAMP app allows these to run in less than 5GB RAM (depending on model) using the right combo of sw all on CPU only.

                                  Awesome, I will definitely look into that today and tomorrow. Thank you for the inspiration 😄. I will let you know how things went for me.

                                  1 Reply Last reply
                                  0
                                  Reply
                                  • Reply as topic
                                  Log in to reply
                                  • Oldest to Newest
                                  • Newest to Oldest
                                  • Most Votes


                                  • Login

                                  • Don't have an account? Register

                                  • Login or register to search.
                                  • First post
                                    Last post
                                  0
                                  • Categories
                                  • Recent
                                  • Tags
                                  • Popular
                                  • Bookmarks
                                  • Search