Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


    Cloudron Forum

    • Register
    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular

    How to dedupe lines in a file efficiently

    Discuss
    logs
    3
    10
    331
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • robi
      robi last edited by girish

      awk '!seen[$0]++' file.txt
      

      seen is an associative-array that Awk(gawk) will pass every line of the file to.
      If a line isn't in the array then seen[$0] will evaluate to false.
      The ! is the logical NOT operator and will invert the false to true. Awk will print the lines where the expression evaluates to true. The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
      Awk evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.

      @staff Can we use this for deduping lines in logs? Or shall I ask for a feature to handle printing duplicate lines differently?
      We'd have to find a way to ignore the timestamps.

      Life of Advanced Technology

      girish 1 Reply Last reply Reply Quote 0
      • girish
        girish Staff @robi last edited by

        @robi Oh, is this for "collapsing" similar lines in the log view? Like how the browser console does it?

        robi 1 Reply Last reply Reply Quote 0
        • robi
          robi @girish last edited by

          @girish no, not in the view, but on disk. Some apps generate tens of MB of logs for not a lot of good reasons, which makes troubleshooting take longer and submitting relevant logs problematic.

          Life of Advanced Technology

          girish 1 Reply Last reply Reply Quote 0
          • girish
            girish Staff @robi last edited by

            @robi Ah, I see. I guess one can write a logrotate script with the awk code to remove duplicate lines or something.

            robi 1 Reply Last reply Reply Quote 0
            • robi
              robi @girish last edited by

              @girish that is the question should we either:

              1. post process the logs as written
              2. pre-process the logging before they're written

              Which feature request makes more sense for Cloudron long term?

              Life of Advanced Technology

              girish 1 Reply Last reply Reply Quote 0
              • girish
                girish Staff @robi last edited by

                @robi Not sure. Which app creates lots of redundant lines? Generally, the redundant lines are useful (for example, the apache log output for 200 responses). Currently, there is also no way to store that line x was repeated 10 times in a specific timeframe. I fear losing log information would only make debugging harder. Maybe we can make the specific app smarter to emit logs better.

                robi 1 Reply Last reply Reply Quote 0
                • robi
                  robi @girish last edited by

                  @girish box to start with, then others as repeatable common easy ones like apache.

                  Life of Advanced Technology

                  girish 1 Reply Last reply Reply Quote 0
                  • girish
                    girish Staff @robi last edited by

                    @robi Are you refering to the box:apphealthmonitor app health lines? Those lines are needed (for us) to know the apps responded at that instant.

                    O robi 2 Replies Last reply Reply Quote 0
                    • O
                      ochoseis @girish last edited by

                      Don’t the log lines have time stamps that would make each unique? If not, cat log.txt | sort | uniq is another handy one-liner. If so, you could throw cut in to split the line.

                      1 Reply Last reply Reply Quote 1
                      • robi
                        robi @girish last edited by

                        @girish I am referring to any that don't need to be written to disk.

                        You can use the lines for what you need in memory.

                        Life of Advanced Technology

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        Powered by NodeBB