[Trisquel-users] The join command is missing the IPv4 addresses in long mixed lists of strings

amenex at amenex.com amenex at amenex.com
Sat Jul 6 19:48:37 CEST 2019


Considering that at present there are only eighteen source files to be  
joined, I applied my two-step approach to the task of
first checking the sort status of my input files, then sorting them, and  
finally re-checking them, for example:

 > $ sort -c HBgky.txt
 > sort: HBgky.txt:2: disorder: 4532218122
 > $ sort HBgky.txt > HBgkyA.txt
 > $ mv HBgkyA.txt HBgky.txt
 > $ sort -c HBgky.txt [null response]

All eighteen had sorting flaw(s) ... took about half an hour to perform these  
fixes; but read on ... LVgky.txt had duplicates !

Then I set about the task of removing duplicates from my 154 joined pairs,  
starting with the largest one first, on the theory that
the largest ones are the ones with duplicates, but the smaller ones won't  
have duplicates ... :

 > tr -s ' ' < Join-LVgky-NWrkr.txt | sed 's/ $//' | sort -u > LVgky-NWrkr.txt  
==> from 1400 kB to 698 kB, properly sorted ... OK so far
 > join LVgky-NWrkr.txt NWrkr.txt > LVgky-NWrkr-NWrkr.txt ==>  
LVgky-NWrkr-NWrkr.txt file length matches LVgky-NWrkr.txt file length ...  
Great !

At this point, I rejoiced: the duplicates were all still common to both files  
... being suspicious, I continued:

 > join LVgky-NWrkr.txt LVgky.txt > LVgky-NWrkr-LVgky.txt ==>  
LVgky-NWrkr-LVgky.txt file length is doubled to 1400 kB ... WaitaMinnit !

Aha ! LVgky.txt was the culprit ... it was full of duplicates.

 > tr -s ' ' < LVgky-NWrkr-LVgky.txt | sed 's/ $//' | sort -u >  
LVgky-NWrkr-LVgky-test.txt ==> File length halved as for the first no-dupes  
script.

Magic Banana suspected that:

 > join's output certainly has duplicates because the input files have  
duplicates (is that normal?). Just add the option --unique (or simply -u)
 > to the sort commands.

My repair task was therefore reduced to the removal of duplicates from all  
the joined pairs of LVgky.txt and the other seventeen input files,
vastly easier than redoing the 154 join pairs that I had created ... just  
eighteen repairs in all, including the repair to LVgky.txt.

Re-joining the repaired output pair against the repaired LVgky.txt
(remembering that both files have been checked for duplicates and their sort  
status):

 > join LVgky-NWrkr.txt LVgky.txt > LVgky-NWrkr-LVgky-test.txt ==> 698 kB as  
for NWrkr.txt. OK !

All this hassle would have been avoided with the modified join command in  
spite of the dire warning in info join:

 >> ‘join’ writes to standard output a line for each pair of input lines  
that have identical join fields.  Synopsis:
 >>
 >>     join [OPTION]... FILE1 FILE2
 >>
 >>  Either FILE1 or FILE2 (but not both) can be ‘-’, meaning standard  
input.  FILE1 and FILE2 should be sorted on the join fields.

Here are my processing choices:

Sort each input file with my two-step routine, then apply the simple join  
command (forgetting to check the input files for duplicates)...
or
Apply Magic Banana's "sort --unique" concurrently to the modified join  
command:

 > join 


More information about the Trisquel-users mailing list