[Trisquel-users] The join command is missing the IPv4 addresses in long mixed lists of strings

amenex at amenex.com amenex at amenex.com
Thu Jul 4 21:52:20 CEST 2019


When I look for matching strings in a pair of one-column files, join has been  
ignoring the entries that are IPv4 addresses.

When I try the join command with short files, admittedly seeded with inserted  
data to complement the existing matches, the
result includes all the matching IPv4 addresses.

When I apply the same command to a pair of much longer files, all the IPv4  
addresses are ignored.

Here are the two commands:

First:  join -1 1 -2 1  file01.txt file02.txt &> Join-01-02.txt [It doesn't  
matter whether or not I add --nocheck-order]

Second: join -1 1 -2 1 /pathtofileA/fileA.txt /pathtofileB/fileB.txt &>  
Join-A-B.txt [Sorting the files twice doesn't help]

The joined output in the first instance includes the matching IPv4 addresses;  
in the second, no IPv4 addresses are listed,
but I'm sure there are many matching fields.

When I visually picked sequences in each original file that encompass several  
confirmed matches, including both alphanumeric
and plain unencumbered IPv4 addresses in four-octet format, the joined output  
file includes both types of strings.

I checked whether the paths interfere ... it doesn't matter whether the input  
files are in the same folder or in different
folders. But for the larger files (6MB and 2MB) the join command skips all  
the IPv4 entries; when they're in the same
directory, the system takes twice as long (0.008 sec.) as when they're in  
different directories (0.004 sec.). Adding the
--nocheck-order argument doesn't change anything but removes join's  
complaints about the sorting.

I even tried viewing the System Monitor during the large-file sorting.  
Nothing ...

Then I realized that it might be better to put the smaller of the files to be  
joined first ... nope.

Lastly, I split each file at a common matching IPv4 address at about the  
halfway point where the sorting places most of
the numerical IPv4 addresses at the top of the file, with nearly all  
alphanumeric entries from there on to the end of the
file. I had >||< this much success: In the joined front-half pair, there are  
no IPv4 addresses, not even the known one,
missing some known matches. In the joined back-half pair, there are a few  
matched IPv4 addresses, but other known matches
are not represented in the joined output. The matched IPv4 address at the  
break points is the same for all four files.

One of the longer files is short enough for testing: GBsmt-front.txt

My ThinkPad T420 has 8GB of memory, running Trisquel's flidas operating  
system.

George Langford


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: GBsmt-front.txt
URL: <http://listas.trisquel.info/pipermail/trisquel-users/attachments/20190704/1cc97b80/attachment-0001.txt>


More information about the Trisquel-users mailing list