[Trisquel-users] Sort and Uniq fail to remove all duplicates from a list of hostnames and their IPv4 addresses

amenex at amenex.com amenex at amenex.com
Fri Feb 15 02:57:16 CET 2019

At Magic Banana's suggestion, I applied sort -u to the test file in my  
posting, with a perfect elimination of the extra files, but when I tried the
same approach on another set of data that I had first sorted with LibreOffice
Calc, sort -u reduced the numer of rows from 67 to 46, but four of these were
still duplicates in pairs and in threes.

In the meantime I used the hands-on approach to the original body of the test
file (2,000+ rows), first cutting it down to about half that number of rows
with the duplicate-removal function supplied by LibreOffice Calc, which left
me with about 500 pairs of duplicate rows, with a few singles mixed in. Took
about an hour to select and delete the rows, one at a time.

Still, sort -u is a big improvement if it gets rid of most of the duplicates.

I had thought that Leafpad would cleanse a text file of LibreOffice Calc  
but there may be some stuff still in there. The test file had additional  
what with its trip to Spain and back via much commotion that would have lost  
the invisible stuff compared to the text file that I was working with, fresh  
a LibreOffice Calc file.

I tried sort -u on the main body of the test file, with the output literally  
of duplicates. There's something about LibreOffice Calc that's leaving  
that fool the sorting software ...

Thanks for thinking about this.

BTW, my original example.com raw data was about 18,000 rows, which I put into
LibreOffice Calc, alphabetized, and then divided into about thirty smaller  
more manageable files, each starting with one of the letters of the alphabet.
the extra four files hold the hostnames beginning with numbers and three  
of macro-multiples of example.com like the test file. There will be many
thousands of unresolvable hostnames represented in the final result.

George Langford

