Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


  • Categories
  • Recent
  • Tags
  • Popular
  • Bookmarks
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse

Cloudron Forum

Apps | Demo | Docs | Install

How to dedupe lines in a file efficiently

Scheduled Pinned Locked Moved Discuss
logs
10 Posts 3 Posters 336 Views
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • robiR Offline
    robiR Offline
    robi
    wrote on last edited by girish
    #1
    awk '!seen[$0]++' file.txt
    

    seen is an associative-array that Awk(gawk) will pass every line of the file to.
    If a line isn't in the array then seen[$0] will evaluate to false.
    The ! is the logical NOT operator and will invert the false to true. Awk will print the lines where the expression evaluates to true. The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
    Awk evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.

    @staff Can we use this for deduping lines in logs? Or shall I ask for a feature to handle printing duplicate lines differently?
    We'd have to find a way to ignore the timestamps.

    Life of sky tech

    girishG 1 Reply Last reply
    0
  • girishG Offline
    girishG Offline
    girish Staff
    replied to robi on last edited by
    #2

    @robi Oh, is this for "collapsing" similar lines in the log view? Like how the browser console does it?

    robiR 1 Reply Last reply
    0
  • robiR Offline
    robiR Offline
    robi
    replied to girish on last edited by
    #3

    @girish no, not in the view, but on disk. Some apps generate tens of MB of logs for not a lot of good reasons, which makes troubleshooting take longer and submitting relevant logs problematic.

    Life of sky tech

    girishG 1 Reply Last reply
    0
  • girishG Offline
    girishG Offline
    girish Staff
    replied to robi on last edited by
    #4

    @robi Ah, I see. I guess one can write a logrotate script with the awk code to remove duplicate lines or something.

    robiR 1 Reply Last reply
    0
  • robiR Offline
    robiR Offline
    robi
    replied to girish on last edited by
    #5

    @girish that is the question should we either:

    1. post process the logs as written
    2. pre-process the logging before they're written

    Which feature request makes more sense for Cloudron long term?

    Life of sky tech

    girishG 1 Reply Last reply
    0
  • girishG Offline
    girishG Offline
    girish Staff
    replied to robi on last edited by
    #6

    @robi Not sure. Which app creates lots of redundant lines? Generally, the redundant lines are useful (for example, the apache log output for 200 responses). Currently, there is also no way to store that line x was repeated 10 times in a specific timeframe. I fear losing log information would only make debugging harder. Maybe we can make the specific app smarter to emit logs better.

    robiR 1 Reply Last reply
    0
  • robiR Offline
    robiR Offline
    robi
    replied to girish on last edited by
    #7

    @girish box to start with, then others as repeatable common easy ones like apache.

    Life of sky tech

    girishG 1 Reply Last reply
    0
  • girishG Offline
    girishG Offline
    girish Staff
    replied to robi on last edited by
    #8

    @robi Are you refering to the box:apphealthmonitor app health lines? Those lines are needed (for us) to know the apps responded at that instant.

    O robiR 2 Replies Last reply
    0
  • O Offline
    O Offline
    ochoseis
    replied to girish on last edited by
    #9

    Don’t the log lines have time stamps that would make each unique? If not, cat log.txt | sort | uniq is another handy one-liner. If so, you could throw cut in to split the line.

    1 Reply Last reply
    1
  • robiR Offline
    robiR Offline
    robi
    replied to girish on last edited by
    #10

    @girish I am referring to any that don't need to be written to disk.

    You can use the lines for what you need in memory.

    Life of sky tech

    1 Reply Last reply
    0

  • Login

  • Don't have an account? Register

  • Login or register to search.
  • First post
    Last post
0
  • Categories
  • Recent
  • Tags
  • Popular
  • Bookmarks
  • Login

  • Don't have an account? Register

  • Login or register to search.