[Trisquel-users] The join command is missing the IPv4 addresses in long mixed lists of strings
amenex at amenex.com
amenex at amenex.com
Sat Jul 6 19:48:37 CEST 2019
Considering that at present there are only eighteen source files to be
joined, I applied my two-step approach to the task of
first checking the sort status of my input files, then sorting them, and
finally re-checking them, for example:
> $ sort -c HBgky.txt
> sort: HBgky.txt:2: disorder: 4532218122
> $ sort HBgky.txt > HBgkyA.txt
> $ mv HBgkyA.txt HBgky.txt
> $ sort -c HBgky.txt [null response]
All eighteen had sorting flaw(s) ... took about half an hour to perform these
fixes; but read on ... LVgky.txt had duplicates !
Then I set about the task of removing duplicates from my 154 joined pairs,
starting with the largest one first, on the theory that
the largest ones are the ones with duplicates, but the smaller ones won't
have duplicates ... :
> tr -s ' ' < Join-LVgky-NWrkr.txt | sed 's/ $//' | sort -u > LVgky-NWrkr.txt
==> from 1400 kB to 698 kB, properly sorted ... OK so far
> join LVgky-NWrkr.txt NWrkr.txt > LVgky-NWrkr-NWrkr.txt ==>
LVgky-NWrkr-NWrkr.txt file length matches LVgky-NWrkr.txt file length ...
At this point, I rejoiced: the duplicates were all still common to both files
... being suspicious, I continued:
> join LVgky-NWrkr.txt LVgky.txt > LVgky-NWrkr-LVgky.txt ==>
LVgky-NWrkr-LVgky.txt file length is doubled to 1400 kB ... WaitaMinnit !
Aha ! LVgky.txt was the culprit ... it was full of duplicates.
> tr -s ' ' < LVgky-NWrkr-LVgky.txt | sed 's/ $//' | sort -u >
LVgky-NWrkr-LVgky-test.txt ==> File length halved as for the first no-dupes
Magic Banana suspected that:
> join's output certainly has duplicates because the input files have
duplicates (is that normal?). Just add the option --unique (or simply -u)
> to the sort commands.
My repair task was therefore reduced to the removal of duplicates from all
the joined pairs of LVgky.txt and the other seventeen input files,
vastly easier than redoing the 154 join pairs that I had created ... just
eighteen repairs in all, including the repair to LVgky.txt.
Re-joining the repaired output pair against the repaired LVgky.txt
(remembering that both files have been checked for duplicates and their sort
> join LVgky-NWrkr.txt LVgky.txt > LVgky-NWrkr-LVgky-test.txt ==> 698 kB as
for NWrkr.txt. OK !
All this hassle would have been avoided with the modified join command in
spite of the dire warning in info join:
>> ‘join’ writes to standard output a line for each pair of input lines
that have identical join fields. Synopsis:
>> join [OPTION]... FILE1 FILE2
>> Either FILE1 or FILE2 (but not both) can be ‘-’, meaning standard
input. FILE1 and FILE2 should be sorted on the join fields.
Here are my processing choices:
Sort each input file with my two-step routine, then apply the simple join
command (forgetting to check the input files for duplicates)...
Apply Magic Banana's "sort --unique" concurrently to the modified join
More information about the Trisquel-users