[Trisquel-users] The join command is missing the IPv4 addresses in long mixed lists of strings
amenex at amenex.com
amenex at amenex.com
Thu Jul 4 21:52:20 CEST 2019
When I look for matching strings in a pair of one-column files, join has been
ignoring the entries that are IPv4 addresses.
When I try the join command with short files, admittedly seeded with inserted
data to complement the existing matches, the
result includes all the matching IPv4 addresses.
When I apply the same command to a pair of much longer files, all the IPv4
addresses are ignored.
Here are the two commands:
First: join -1 1 -2 1 file01.txt file02.txt &> Join-01-02.txt [It doesn't
matter whether or not I add --nocheck-order]
Second: join -1 1 -2 1 /pathtofileA/fileA.txt /pathtofileB/fileB.txt &>
Join-A-B.txt [Sorting the files twice doesn't help]
The joined output in the first instance includes the matching IPv4 addresses;
in the second, no IPv4 addresses are listed,
but I'm sure there are many matching fields.
When I visually picked sequences in each original file that encompass several
confirmed matches, including both alphanumeric
and plain unencumbered IPv4 addresses in four-octet format, the joined output
file includes both types of strings.
I checked whether the paths interfere ... it doesn't matter whether the input
files are in the same folder or in different
folders. But for the larger files (6MB and 2MB) the join command skips all
the IPv4 entries; when they're in the same
directory, the system takes twice as long (0.008 sec.) as when they're in
different directories (0.004 sec.). Adding the
--nocheck-order argument doesn't change anything but removes join's
complaints about the sorting.
I even tried viewing the System Monitor during the large-file sorting.
Then I realized that it might be better to put the smaller of the files to be
joined first ... nope.
Lastly, I split each file at a common matching IPv4 address at about the
halfway point where the sorting places most of
the numerical IPv4 addresses at the top of the file, with nearly all
alphanumeric entries from there on to the end of the
file. I had >||< this much success: In the joined front-half pair, there are
no IPv4 addresses, not even the known one,
missing some known matches. In the joined back-half pair, there are a few
matched IPv4 addresses, but other known matches
are not represented in the joined output. The matched IPv4 address at the
break points is the same for all four files.
One of the longer files is short enough for testing: GBsmt-front.txt
My ThinkPad T420 has 8GB of memory, running Trisquel's flidas operating
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
More information about the Trisquel-users