[Trisquel-users] Re : Sort and Uniq fail to remove all duplicates from a list of hostnames and their IPv4 addresses

lcerf at dcc.ufmg.br lcerf at dcc.ufmg.br
Sun Feb 17 17:38:08 CET 2019


Sort -u by itself had no effect on the file size.

If you give 'sort' a text file in argument, it will not be modified, if that  
is what you mean.  It is how all text-processing command work (well, 'sed'  
actually has an --in-place option): you must redirect their output to a file  
(or use the option --output, if the command has such an option) and you can  
then move that file onto the original one, if you are certain you do not need  
it anymore.

Since 'sort -u' removes duplicates, its output is always at most as large as  
its input, but it can be much smaller: one single line if the input only  
contains this line repeated many times.

I'm OK until sed 's/ $//' because in man sed it looks like
it might be sed 's///' instead. Am I missing something ?

What is between the first and the second / is a regular expression the  
substitution command ('s') will substitute.  "$" at the end of a regular  
expression means "end of the line".  So " $" is a single space at the end of  
the line (in the command line I gave, there cannot be more space because tr  
-s ' ' squeezes the sequences of spaces).  It is substituted by nothing (what  
is between the second and the third /), i.e., it is removed.

If you want to learn text-processing commands, I have slides:  
https://dcc.ufmg.br/~lcerf/en/mda.html#slides (5 to 9).

Is there a "sort" in
any freedom-compatible application which can put numbers in numerical order ?

'sort' can do that (and much more).  I assume there can be any number of dots  
in a hostname.  That is why, below, I use 'awk' rather than 'tr' and 'sed' to  
not only remove the supernumerary spaces, but also to write the IPv4  
addresses before the hostnames (in this way, 1, 2, 3 and 4 are the numbers of  
the dot-separated columns containing the four numbers in an IPv4 address):
$ awk '{ print $2 "." $1 }' file | sort -ut . -k 1,1n -k 2,2n -k 3,3n -k 4,4n  
-k 5 | sed 's/\./ /4'

If you want to swap back the two columns, just append "| awk '{ print $2, $1  
}" to the command line.

If you plan to reuse the command:

Write that in a new file to write a directory listed in your PATH variable,  
e.g., in /usr/local/bin/sort-n-remove-duplicates:
#!/bin/sh
awk '{ print $2 "." $1 }' | sort -ut . -k 1,1n -k 2,2n -k 3,3n -k 4,4n -k 5 |  
sed 's/\./ /4'
Save.
Turn the file executable (e.g., using 'chmod +x' or using your file browser).


You can then execute 'sort-n-remove-duplicates < my-file > my-file.sorted',  
where "my-file" is whatever file you want to process and "my-file.sorted" is  
wherever you want to redirect the output.


More information about the Trisquel-users mailing list