Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Bookmarks
  • Search
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

Cloudron Forum

Apps | Demo | Docs | Install
  1. Cloudron Forum
  2. Discuss
  3. How to dedupe lines in a file efficiently

How to dedupe lines in a file efficiently

Scheduled Pinned Locked Moved Discuss
logs
10 Posts 3 Posters 1.6k Views 3 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • robiR Offline
    robiR Offline
    robi
    wrote on last edited by girish
    #1
    awk '!seen[$0]++' file.txt
    

    seen is an associative-array that Awk(gawk) will pass every line of the file to.
    If a line isn't in the array then seen[$0] will evaluate to false.
    The ! is the logical NOT operator and will invert the false to true. Awk will print the lines where the expression evaluates to true. The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
    Awk evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.

    @staff Can we use this for deduping lines in logs? Or shall I ask for a feature to handle printing duplicate lines differently?
    We'd have to find a way to ignore the timestamps.

    Conscious tech

    girishG 1 Reply Last reply
    0
    • robiR robi
      awk '!seen[$0]++' file.txt
      

      seen is an associative-array that Awk(gawk) will pass every line of the file to.
      If a line isn't in the array then seen[$0] will evaluate to false.
      The ! is the logical NOT operator and will invert the false to true. Awk will print the lines where the expression evaluates to true. The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
      Awk evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.

      @staff Can we use this for deduping lines in logs? Or shall I ask for a feature to handle printing duplicate lines differently?
      We'd have to find a way to ignore the timestamps.

      girishG Offline
      girishG Offline
      girish
      Staff
      wrote on last edited by
      #2

      @robi Oh, is this for "collapsing" similar lines in the log view? Like how the browser console does it?

      robiR 1 Reply Last reply
      0
      • girishG girish

        @robi Oh, is this for "collapsing" similar lines in the log view? Like how the browser console does it?

        robiR Offline
        robiR Offline
        robi
        wrote on last edited by
        #3

        @girish no, not in the view, but on disk. Some apps generate tens of MB of logs for not a lot of good reasons, which makes troubleshooting take longer and submitting relevant logs problematic.

        Conscious tech

        girishG 1 Reply Last reply
        0
        • robiR robi

          @girish no, not in the view, but on disk. Some apps generate tens of MB of logs for not a lot of good reasons, which makes troubleshooting take longer and submitting relevant logs problematic.

          girishG Offline
          girishG Offline
          girish
          Staff
          wrote on last edited by
          #4

          @robi Ah, I see. I guess one can write a logrotate script with the awk code to remove duplicate lines or something.

          robiR 1 Reply Last reply
          0
          • girishG girish

            @robi Ah, I see. I guess one can write a logrotate script with the awk code to remove duplicate lines or something.

            robiR Offline
            robiR Offline
            robi
            wrote on last edited by
            #5

            @girish that is the question should we either:

            1. post process the logs as written
            2. pre-process the logging before they're written

            Which feature request makes more sense for Cloudron long term?

            Conscious tech

            girishG 1 Reply Last reply
            0
            • robiR robi

              @girish that is the question should we either:

              1. post process the logs as written
              2. pre-process the logging before they're written

              Which feature request makes more sense for Cloudron long term?

              girishG Offline
              girishG Offline
              girish
              Staff
              wrote on last edited by
              #6

              @robi Not sure. Which app creates lots of redundant lines? Generally, the redundant lines are useful (for example, the apache log output for 200 responses). Currently, there is also no way to store that line x was repeated 10 times in a specific timeframe. I fear losing log information would only make debugging harder. Maybe we can make the specific app smarter to emit logs better.

              robiR 1 Reply Last reply
              0
              • girishG girish

                @robi Not sure. Which app creates lots of redundant lines? Generally, the redundant lines are useful (for example, the apache log output for 200 responses). Currently, there is also no way to store that line x was repeated 10 times in a specific timeframe. I fear losing log information would only make debugging harder. Maybe we can make the specific app smarter to emit logs better.

                robiR Offline
                robiR Offline
                robi
                wrote on last edited by
                #7

                @girish box to start with, then others as repeatable common easy ones like apache.

                Conscious tech

                girishG 1 Reply Last reply
                0
                • robiR robi

                  @girish box to start with, then others as repeatable common easy ones like apache.

                  girishG Offline
                  girishG Offline
                  girish
                  Staff
                  wrote on last edited by
                  #8

                  @robi Are you refering to the box:apphealthmonitor app health lines? Those lines are needed (for us) to know the apps responded at that instant.

                  O robiR 2 Replies Last reply
                  0
                  • girishG girish

                    @robi Are you refering to the box:apphealthmonitor app health lines? Those lines are needed (for us) to know the apps responded at that instant.

                    O Offline
                    O Offline
                    ochoseis
                    wrote on last edited by
                    #9

                    Don’t the log lines have time stamps that would make each unique? If not, cat log.txt | sort | uniq is another handy one-liner. If so, you could throw cut in to split the line.

                    1 Reply Last reply
                    1
                    • girishG girish

                      @robi Are you refering to the box:apphealthmonitor app health lines? Those lines are needed (for us) to know the apps responded at that instant.

                      robiR Offline
                      robiR Offline
                      robi
                      wrote on last edited by
                      #10

                      @girish I am referring to any that don't need to be written to disk.

                      You can use the lines for what you need in memory.

                      Conscious tech

                      1 Reply Last reply
                      0
                      Reply
                      • Reply as topic
                      Log in to reply
                      • Oldest to Newest
                      • Newest to Oldest
                      • Most Votes


                      • Login

                      • Don't have an account? Register

                      • Login or register to search.
                      • First post
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • Bookmarks
                      • Search