How to dedupe lines in a file efficiently
-
awk '!seen[$0]++' file.txt
seen is an associative-array that Awk(gawk) will pass every line of the file to.
If a line isn't in the array then seen[$0] will evaluate to false.
The ! is the logical NOT operator and will invert the false to true. Awk will print the lines where the expression evaluates to true. The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
Awk evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.@staff Can we use this for deduping lines in logs? Or shall I ask for a feature to handle printing duplicate lines differently?
We'd have to find a way to ignore the timestamps. -
awk '!seen[$0]++' file.txt
seen is an associative-array that Awk(gawk) will pass every line of the file to.
If a line isn't in the array then seen[$0] will evaluate to false.
The ! is the logical NOT operator and will invert the false to true. Awk will print the lines where the expression evaluates to true. The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
Awk evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.@staff Can we use this for deduping lines in logs? Or shall I ask for a feature to handle printing duplicate lines differently?
We'd have to find a way to ignore the timestamps. -
@robi Oh, is this for "collapsing" similar lines in the log view? Like how the browser console does it?
-
@girish no, not in the view, but on disk. Some apps generate tens of MB of logs for not a lot of good reasons, which makes troubleshooting take longer and submitting relevant logs problematic.
-
@robi Ah, I see. I guess one can write a logrotate script with the awk code to remove duplicate lines or something.
-
@girish that is the question should we either:
- post process the logs as written
- pre-process the logging before they're written
Which feature request makes more sense for Cloudron long term?
@robi Not sure. Which app creates lots of redundant lines? Generally, the redundant lines are useful (for example, the apache log output for 200 responses). Currently, there is also no way to store that line x was repeated 10 times in a specific timeframe. I fear losing log information would only make debugging harder. Maybe we can make the specific app smarter to emit logs better.
-
@robi Not sure. Which app creates lots of redundant lines? Generally, the redundant lines are useful (for example, the apache log output for 200 responses). Currently, there is also no way to store that line x was repeated 10 times in a specific timeframe. I fear losing log information would only make debugging harder. Maybe we can make the specific app smarter to emit logs better.
-
@robi Are you refering to the
box:apphealthmonitor app health
lines? Those lines are needed (for us) to know the apps responded at that instant. -
@robi Are you refering to the
box:apphealthmonitor app health
lines? Those lines are needed (for us) to know the apps responded at that instant.