How to dedupe lines in a file efficiently
robi last edited by girish
awk '!seen[$0]++' file.txt
seen is an associative-array that Awk(gawk) will pass every line of the file to.
If a line isn't in the array then seen[$0] will evaluate to false.
The ! is the logical NOT operator and will invert the false to true. Awk will print the lines where the expression evaluates to true. The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
Awk evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.
@staff Can we use this for deduping lines in logs? Or shall I ask for a feature to handle printing duplicate lines differently?
We'd have to find a way to ignore the timestamps.
@robi Oh, is this for "collapsing" similar lines in the log view? Like how the browser console does it?
@girish no, not in the view, but on disk. Some apps generate tens of MB of logs for not a lot of good reasons, which makes troubleshooting take longer and submitting relevant logs problematic.
@robi Ah, I see. I guess one can write a logrotate script with the awk code to remove duplicate lines or something.
@girish that is the question should we either:
- post process the logs as written
- pre-process the logging before they're written
Which feature request makes more sense for Cloudron long term?
@robi Not sure. Which app creates lots of redundant lines? Generally, the redundant lines are useful (for example, the apache log output for 200 responses). Currently, there is also no way to store that line x was repeated 10 times in a specific timeframe. I fear losing log information would only make debugging harder. Maybe we can make the specific app smarter to emit logs better.
@girish box to start with, then others as repeatable common easy ones like apache.
@robi Are you refering to the
box:apphealthmonitor app healthlines? Those lines are needed (for us) to know the apps responded at that instant.
Don’t the log lines have time stamps that would make each unique? If not,
cat log.txt | sort | uniqis another handy one-liner. If so, you could throw
cutin to split the line.
@girish I am referring to any that don't need to be written to disk.
You can use the lines for what you need in memory.