Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Bookmarks
  • Search
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

Cloudron Forum

Apps | Demo | Docs | Install
  1. Cloudron Forum
  2. App Wishlist
  3. Serge - LLaMa made easy 🦙 - self-hosted AI Chat

Serge - LLaMa made easy 🦙 - self-hosted AI Chat

Scheduled Pinned Locked Moved App Wishlist
18 Posts 7 Posters 7.5k Views 10 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • marcusquinnM marcusquinn

    @LoudLemur Not tried yet. Just spotted it on my Reddit travels, skimming past a post on it i the self-hosting sub-Reddit.

    L Offline
    L Offline
    LoudLemur
    wrote on last edited by
    #5

    @marcusquinn said in Serge - LLaMa made easy 🦙 - self-hosted AI Chat:

    @LoudLemur Not tried yet. Just spotted it on my Reddit travels, skimming past a post on it i the self-hosting sub-Reddit.

    It was a RAM issue. If you don't have a lot of RAM, it is best not to multi-task with it.

    brave_AP7lseZ4vu.png

    1 Reply Last reply
    3
    • marcusquinnM marcusquinn
      • https://github.com/nsarrazin/serge

      chat interface based on llama.cpp for running Alpaca models. Entirely self-hosted, no API keys needed. Fits on 4GB of RAM and runs on the CPU.

      SvelteKit frontend
      MongoDB for storing chat history & parameters
      FastAPI + beanie for the API, wrapping calls to llama.cpp

      11273713-d64f-4a6f-bb60-2b8df61c0131-image.png

      4fe27494-ec3d-44bf-9a15-f2d5030b551b-image.png

      L Offline
      L Offline
      LoudLemur
      wrote on last edited by
      #6

      @marcusquinn There is now a medical Alpaca too:

      https://teddit.net/r/LocalLLaMA/comments/12c4hyx/introducing_medalpaca_language_models_for_medical/

      We could ask it health related questions, or perhaps it could ask us questions...

      1 Reply Last reply
      2
      • L Offline
        L Offline
        lars1134
        wrote on last edited by
        #7

        Is there any interest in getting this in the app store?

        L 1 Reply Last reply
        1
        • L lars1134

          Is there any interest in getting this in the app store?

          L Offline
          L Offline
          LoudLemur
          wrote on last edited by
          #8

          @lars1134 said in Serge - LLaMa made easy 🦙 - self-hosted AI Chat:

          Is there any interest in getting this in the app store?

          There is also this, but I couldn't see the code for the GUI:
          https://lmstudio.ai/

          L 1 Reply Last reply
          1
          • L LoudLemur

            @lars1134 said in Serge - LLaMa made easy 🦙 - self-hosted AI Chat:

            Is there any interest in getting this in the app store?

            There is also this, but I couldn't see the code for the GUI:
            https://lmstudio.ai/

            L Offline
            L Offline
            lars1134
            wrote on last edited by
            #9

            @LoudLemur

            It looks like that only runs locally, but I like the program.

            I liked serge as I have quite a bit of ram left unused at one of my servers.

            1 Reply Last reply
            0
            • timconsidineT Offline
              timconsidineT Offline
              timconsidine
              App Dev
              wrote on last edited by
              #10

              This might be a good one for @Kubernetes ?

              KubernetesK 1 Reply Last reply
              1
              • timconsidineT timconsidine

                This might be a good one for @Kubernetes ?

                KubernetesK Offline
                KubernetesK Offline
                Kubernetes
                App Dev
                wrote on last edited by
                #11

                @timconsidine I already played with it on my local machines. I still miss quality for other languages than english. In addition it is very RAM and Disk consuming. So I think it is better to run it dedicated instead of running in shared mode with other applications.

                1 Reply Last reply
                3
                • humptydumptyH Offline
                  humptydumptyH Offline
                  humptydumpty
                  wrote on last edited by humptydumpty
                  #12

                  I'm on huggingface and the library is huge! My laptop has 32GB RAM and an empty 500GB secondary SSD. Which model would be a good GPT-3.5 alternative?

                  L 1 Reply Last reply
                  0
                  • humptydumptyH humptydumpty

                    I'm on huggingface and the library is huge! My laptop has 32GB RAM and an empty 500GB secondary SSD. Which model would be a good GPT-3.5 alternative?

                    L Offline
                    L Offline
                    LoudLemur
                    wrote on last edited by
                    #13

                    @humptydumpty said in Serge - LLaMa made easy 🦙 - self-hosted AI Chat:

                    I'm on huggingface and the library is huge! My laptop has 32GB RAM and an empty 500GB secondary SSD. Which model would be a good GPT-3.5 alternative?

                    This model uses the recently released Llama2, is uncensored as far as things go and works well and even better with longer prompts:

                    https://huggingface.co/NousResearch/Nous-Hermes-Llama2-13b

                    1 Reply Last reply
                    1
                    • L Offline
                      L Offline
                      lars1134
                      wrote on last edited by
                      #14

                      @Kubernetes Thank you for looking into it. I understand the limitations. I run a few servers that have more than 10 EPYC cores and 50GB RAM left unused - those servers have a lot more available resources than our local clients. But I understand that not too many have a similar situation.

                      robiR 1 Reply Last reply
                      0
                      • L lars1134

                        @Kubernetes Thank you for looking into it. I understand the limitations. I run a few servers that have more than 10 EPYC cores and 50GB RAM left unused - those servers have a lot more available resources than our local clients. But I understand that not too many have a similar situation.

                        robiR Offline
                        robiR Offline
                        robi
                        wrote on last edited by robi
                        #15

                        @lars1134 the tests I've run in a LAMP app allows these to run in less than 5GB RAM (depending on model) using the right combo of sw all on CPU only.

                        Conscious tech

                        L L 2 Replies Last reply
                        1
                        • robiR robi

                          @lars1134 the tests I've run in a LAMP app allows these to run in less than 5GB RAM (depending on model) using the right combo of sw all on CPU only.

                          L Offline
                          L Offline
                          LoudLemur
                          wrote on last edited by
                          #16

                          @robi said in Serge - LLaMa made easy 🦙 - self-hosted AI Chat:

                          using the right combo of sw all on CPU only.

                          The GGML versions of the models are designed to offload the work onto the CPU and RAM and, if there is a GPU available, to use that too. Q6 versions offer more power than e.g. Q4, but are a bit slower.

                          1 Reply Last reply
                          2
                          • L Offline
                            L Offline
                            LoudLemur
                            wrote on last edited by LoudLemur
                            #17

                            From Claude AI:

                            "
                            Here is a quick comparison of FP16, GPTQ and GGML model versions:

                            FP16 (Half precision float 16):

                            Uses 16-bit floats instead of 32-bit floats to represent weights and activations in a neural network.
                            Reduces model size and memory usage by half compared to FP32 models.
                            May lower model accuracy slightly compared to FP32, but often accuracy is very close.
                            Supported on most modern GPUs and TPUs for efficient training and inference.

                            GPTQ (Quantization aware training):

                            Quantizes weights and/or activations to low bitwidths like 8-bit during training.
                            Further compresses model size over FP16.
                            Accuracy is often very close to FP32 model.
                            Requires quantization aware training techniques.
                            Currently primarily supported on TPUs.

                            GGML (Mixture of experts):

                            Partitions a single large model into smaller expert models.
                            Reduces compute requirements for inference since only one expert is used per sample.
                            Can maintain accuracy of original large model.
                            Increases model size due to overhead of gating model and expert models.
                            Requires changes to model architecture and training procedure.
                            In summary, FP16 is a straightforward way to reduce model size with minimal accuracy loss. GPTQ can further compress models through quantization-aware training. GGML can provide inference speedups through model parallelism while maintaining accuracy. The best choice depends on hardware constraints, accuracy requirements and inference latency needs."

                            1 Reply Last reply
                            2
                            • robiR robi

                              @lars1134 the tests I've run in a LAMP app allows these to run in less than 5GB RAM (depending on model) using the right combo of sw all on CPU only.

                              L Offline
                              L Offline
                              lars1134
                              wrote on last edited by
                              #18

                              @robi said in Serge - LLaMa made easy 🦙 - self-hosted AI Chat:

                              @lars1134 the tests I've run in a LAMP app allows these to run in less than 5GB RAM (depending on model) using the right combo of sw all on CPU only.

                              Awesome, I will definitely look into that today and tomorrow. Thank you for the inspiration 😄. I will let you know how things went for me.

                              1 Reply Last reply
                              0
                              Reply
                              • Reply as topic
                              Log in to reply
                              • Oldest to Newest
                              • Newest to Oldest
                              • Most Votes


                              • Login

                              • Don't have an account? Register

                              • Login or register to search.
                              • First post
                                Last post
                              0
                              • Categories
                              • Recent
                              • Tags
                              • Popular
                              • Bookmarks
                              • Search