Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Bookmarks
  • Search
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

Cloudron Forum

Apps | Demo | Docs | Install
  1. Cloudron Forum
  2. WordPress (Developer)
  3. Is there a way to rate limit connections to a site for certain user agent strings?

Is there a way to rate limit connections to a site for certain user agent strings?

Scheduled Pinned Locked Moved WordPress (Developer)
9 Posts 6 Posters 122 Views 6 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • d19dotcaD Offline
    d19dotcaD Offline
    d19dotca
    wrote last edited by
    #1

    Hello,

    I have a particular website that for the last 2+ days has been reaching max memory and restarting frequently, a dozen times a day. I've tried increasing the memory which has helped of course but that's only a temporary workaround. The issue started when (according to the logs) the site started receiving an onslaught of traffic from Facebook crawler bots, specifically their Meta-ExternalAgent/1.1 one.

    What I'd like to do is try to rate limit (within Cloudron if possible) the requests from certain user agents, to maybe 10 a minute for example instead of several a second (which is currently what I'm seeing). If this is possible, I'd love to know.

    I may be able to use a plugin in WordPress to do that but my thinking is this will still take up Apache connections which can still saturate the connections. In fact I tried to do this with the .htaccess using something ChatGPT recommended, but this just slows down the data rate and doesn't really slow down the indexing from Facebook / Meta, so I suspect this will simply increase the connection saturation if each request takes a bit longer to respond to.

    # BEGIN Meta-ExternalHit Throttling
    <IfModule mod_rewrite.c>
        RewriteEngine On
        # Detect Meta-ExternalHit user agent
        RewriteCond %{HTTP_USER_AGENT} "Meta-ExternalHit" [NC]
        # Set an env var if matched
        RewriteRule ^ - [E=IS_META_BOT:1]
    </IfModule>
    
    <IfModule mod_ratelimit.c>
        # Apply rate limit if Meta bot detected
        SetEnvIf IS_META_BOT 1 META_BOT
        <IfModule mod_filter.c>
            AddOutputFilterByType RATE_LIMIT text/html text/plain text/xml application/json application/xml image/jpeg image/png image/webp image/avif
        </IfModule>
        # Limit to ~50 KB/s (value is KB per second)
        SetEnvIf META_BOT 1 RATE_LIMIT 50
    </IfModule>
    # END Meta-ExternalHit Throttling
    

    This is leading to the health checks taking over 7000ms as well which I see in the logs.

    Thank you in advance for any advice.

    --
    Dustin Dauncey
    www.d19.ca

    luckowL andreasduerenA 2 Replies Last reply
    2
    • necrevistonnezrN Offline
      necrevistonnezrN Offline
      necrevistonnezr
      wrote last edited by necrevistonnezr
      #2

      Anf of course, Meta ignores your robots.txt? It‘s such a sh*t company…

      Just FYI: If you were in the EU and/or Germany:

      • You can legally prevent AI companies from using your website's data. The legal basis is the right to opt-out of Text and Data Mining (TDM) under Art. 4 of the EU Copyright Directive.

      • Germany has implemented the directive in Art. 44b UrhG: „Uses in accordance with subsection (2) sentence 1 are permitted only if they have not been reserved by the rightholder. A reservation of use in the case of works which are available online is effective only if it is made in a machine-readable format.“

      • Your objection must be machine-readable. A simple text disclaimer on your site (e.g., in the legal notice) is legally insufficient.

      • The standard method is to use the above mentioned robots.txt.

      • A comprehensive, community-maintained list can be found at projects like ai.robots.txt on GitHub.

      • While major companies respect robots.txt, compliance is not guaranteed from all crawlers, but it is the recognized legal and technical standard for opting out.

      1 Reply Last reply
      2
      • d19dotcaD d19dotca

        Hello,

        I have a particular website that for the last 2+ days has been reaching max memory and restarting frequently, a dozen times a day. I've tried increasing the memory which has helped of course but that's only a temporary workaround. The issue started when (according to the logs) the site started receiving an onslaught of traffic from Facebook crawler bots, specifically their Meta-ExternalAgent/1.1 one.

        What I'd like to do is try to rate limit (within Cloudron if possible) the requests from certain user agents, to maybe 10 a minute for example instead of several a second (which is currently what I'm seeing). If this is possible, I'd love to know.

        I may be able to use a plugin in WordPress to do that but my thinking is this will still take up Apache connections which can still saturate the connections. In fact I tried to do this with the .htaccess using something ChatGPT recommended, but this just slows down the data rate and doesn't really slow down the indexing from Facebook / Meta, so I suspect this will simply increase the connection saturation if each request takes a bit longer to respond to.

        # BEGIN Meta-ExternalHit Throttling
        <IfModule mod_rewrite.c>
            RewriteEngine On
            # Detect Meta-ExternalHit user agent
            RewriteCond %{HTTP_USER_AGENT} "Meta-ExternalHit" [NC]
            # Set an env var if matched
            RewriteRule ^ - [E=IS_META_BOT:1]
        </IfModule>
        
        <IfModule mod_ratelimit.c>
            # Apply rate limit if Meta bot detected
            SetEnvIf IS_META_BOT 1 META_BOT
            <IfModule mod_filter.c>
                AddOutputFilterByType RATE_LIMIT text/html text/plain text/xml application/json application/xml image/jpeg image/png image/webp image/avif
            </IfModule>
            # Limit to ~50 KB/s (value is KB per second)
            SetEnvIf META_BOT 1 RATE_LIMIT 50
        </IfModule>
        # END Meta-ExternalHit Throttling
        

        This is leading to the health checks taking over 7000ms as well which I see in the logs.

        Thank you in advance for any advice.

        luckowL Offline
        luckowL Offline
        luckow
        translator
        wrote last edited by
        #3

        @d19dotca Install your own WAF. We have been testing https://www.bunkerweb.io/ for almost a month. And it works.

        Pronouns: he/him | Primary language: German

        necrevistonnezrN jdaviescoatesJ 2 Replies Last reply
        1
        • robiR Offline
          robiR Offline
          robi
          wrote last edited by
          #4

          Another thought it to inspect their robots.txt for any directives for their bots which you may adapt for your needs.

          Conscious tech

          1 Reply Last reply
          0
          • luckowL luckow

            @d19dotca Install your own WAF. We have been testing https://www.bunkerweb.io/ for almost a month. And it works.

            necrevistonnezrN Offline
            necrevistonnezrN Offline
            necrevistonnezr
            wrote last edited by
            #5

            @luckow said in Is there a way to rate limit connections to a site for certain user agent strings?:

            @d19dotca Install your own WAF. We have been testing https://www.bunkerweb.io/ for almost a month. And it works.

            Would that be interesting as a Cloudron service?

            1 Reply Last reply
            1
            • luckowL luckow

              @d19dotca Install your own WAF. We have been testing https://www.bunkerweb.io/ for almost a month. And it works.

              jdaviescoatesJ Offline
              jdaviescoatesJ Offline
              jdaviescoates
              wrote last edited by
              #6

              @luckow said in Is there a way to rate limit connections to a site for certain user agent strings?:

              @d19dotca Install your own WAF. We have been testing https://www.bunkerweb.io/ for almost a month. And it works.

              Sounds good. How?

              I use Cloudron with Gandi & Hetzner

              luckowL 1 Reply Last reply
              0
              • jdaviescoatesJ jdaviescoates

                @luckow said in Is there a way to rate limit connections to a site for certain user agent strings?:

                @d19dotca Install your own WAF. We have been testing https://www.bunkerweb.io/ for almost a month. And it works.

                Sounds good. How?

                luckowL Offline
                luckowL Offline
                luckow
                translator
                wrote last edited by
                #7

                @jdaviescoates
                The good old traditional method: https://docs.bunkerweb.io/latest/integrations/#linux
                Runs on a CX22 on https://www.hetzner.com/cloud/.

                Bunkerweb acts as a reverse proxy for a Cloudron app that is ‘behind it’. Currently, we only use it in front of our own website (mainly because we are still learning, e.g. what happens when we block bots? Oh, there is no longer support for previews in rocket.chat). In my next spare moment, I'll try out what happens when a complete Cloudron instance is behind Bunkerweb. It should work. From what I've heard, this is the case with Cloudflare, and Bunkerweb is similar (only self-hosted) 🙂

                Pronouns: he/him | Primary language: German

                1 Reply Last reply
                1
                • d19dotcaD d19dotca

                  Hello,

                  I have a particular website that for the last 2+ days has been reaching max memory and restarting frequently, a dozen times a day. I've tried increasing the memory which has helped of course but that's only a temporary workaround. The issue started when (according to the logs) the site started receiving an onslaught of traffic from Facebook crawler bots, specifically their Meta-ExternalAgent/1.1 one.

                  What I'd like to do is try to rate limit (within Cloudron if possible) the requests from certain user agents, to maybe 10 a minute for example instead of several a second (which is currently what I'm seeing). If this is possible, I'd love to know.

                  I may be able to use a plugin in WordPress to do that but my thinking is this will still take up Apache connections which can still saturate the connections. In fact I tried to do this with the .htaccess using something ChatGPT recommended, but this just slows down the data rate and doesn't really slow down the indexing from Facebook / Meta, so I suspect this will simply increase the connection saturation if each request takes a bit longer to respond to.

                  # BEGIN Meta-ExternalHit Throttling
                  <IfModule mod_rewrite.c>
                      RewriteEngine On
                      # Detect Meta-ExternalHit user agent
                      RewriteCond %{HTTP_USER_AGENT} "Meta-ExternalHit" [NC]
                      # Set an env var if matched
                      RewriteRule ^ - [E=IS_META_BOT:1]
                  </IfModule>
                  
                  <IfModule mod_ratelimit.c>
                      # Apply rate limit if Meta bot detected
                      SetEnvIf IS_META_BOT 1 META_BOT
                      <IfModule mod_filter.c>
                          AddOutputFilterByType RATE_LIMIT text/html text/plain text/xml application/json application/xml image/jpeg image/png image/webp image/avif
                      </IfModule>
                      # Limit to ~50 KB/s (value is KB per second)
                      SetEnvIf META_BOT 1 RATE_LIMIT 50
                  </IfModule>
                  # END Meta-ExternalHit Throttling
                  

                  This is leading to the health checks taking over 7000ms as well which I see in the logs.

                  Thank you in advance for any advice.

                  andreasduerenA Offline
                  andreasduerenA Offline
                  andreasdueren
                  wrote last edited by
                  #8

                  @d19dotca Think what you want of Cloudflare but their caching is prettyy good, plus they also hate AI Bots and have specific options to block them: https://developers.cloudflare.com/ai-crawl-control/

                  1 Reply Last reply
                  1
                  • d19dotcaD Offline
                    d19dotcaD Offline
                    d19dotca
                    wrote last edited by
                    #9

                    Thank you all for the suggestions! Good ideas!

                    Overnight after my message to the forum it turns out the bot traffic finally went back to normal levels overnight and the app has been stable ever since. But this definitely reminded me that getting a good WAF (or improving the robots.txt at a minimum) can be important and needs to be evaluated.

                    Hopefully Cloudron can integrate a simplistic WAF into the system directly in the future (maybe even using that BunkerWeb if possible). 🤞

                    --
                    Dustin Dauncey
                    www.d19.ca

                    1 Reply Last reply
                    2
                    Reply
                    • Reply as topic
                    Log in to reply
                    • Oldest to Newest
                    • Newest to Oldest
                    • Most Votes


                    • Login

                    • Don't have an account? Register

                    • Login or register to search.
                    • First post
                      Last post
                    0
                    • Categories
                    • Recent
                    • Tags
                    • Popular
                    • Bookmarks
                    • Search