How to dedupe lines in a file efficiently
-
awk '!seen[$0]++' file.txt
seen is an associative-array that Awk(gawk) will pass every line of the file to.
If a line isn't in the array then seen[$0] will evaluate to false.
The ! is the logical NOT operator and will invert the false to true. Awk will print the lines where the expression evaluates to true. The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
Awk evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.@staff Can we use this for deduping lines in logs? Or shall I ask for a feature to handle printing duplicate lines differently?
We'd have to find a way to ignore the timestamps. -
@robi Not sure. Which app creates lots of redundant lines? Generally, the redundant lines are useful (for example, the apache log output for 200 responses). Currently, there is also no way to store that line x was repeated 10 times in a specific timeframe. I fear losing log information would only make debugging harder. Maybe we can make the specific app smarter to emit logs better.