[Trisquel-users] Grep consumes all my RAM and swap ding a big job

amenex at amenex.com amenex at amenex.com
Sun Jul 28 22:20:40 CEST 2019


Hmmm. We seem both to be writing at once ...

Magic Banana is saying:

Quoting amenex:
 > I want to guard against double-counting, as with 01j01.txt or 01j02.txt vs  
02j01.txt, and that requires
 > some heavy-duty concentration.

 >> "My" solution (since my first post in this thread) joins one file with all  
the other files. Not pairwise.
 >> There is nothing to concatenate at the end.

amenex again:
 > I have a script that does a nice job of grouping the duplicated hostnames,  
but it won't separate them with
 > blank lines ... (yet).

 >> "My" solution (since my first post in this thread) outputs the hostnames  
in order. They are already grouped.
 >> To prepend them with blank lines, the output of every join can be piped  
to:
 >>> awk '$1 != p { p = $1; print "" } { print }'

 >> However, I believe I have finally understood the whole task and I do not  
see much point in having the
 >> repetitions on several lines (uselessly repeating the hostname). AWK can  
count the number of other files
 >> where the hostname is found, print that count, the hostname (once) and the  
rest (the number and the file
 >> name). 'sort' can then sort in decreasing order of count. The whole  
solution is:

amenex:
I'll try that later ... right now I'm worried that the problem may be  
analyzed another way, simply by
concatenating all the Recent Visitors into one [huge] file while retaining  
each hostname's association with the
domains' Webalizer data, then grouping the Recent Visitor hostnames according  
to the quantities of their
occurrences, and therefter discarding the smallest numbers of duplicate  
hostnames. The data total 39 MB.

Making the current directory that in which the numerically coded two-column  
hostname lists reside:
time awk '{print FILENAME"\t"$0}' 01.txt 02.txt 03.txt 04.txt 05.txt 06.txt  
07.txt 08.txt 09.txt 10.txt 11.txt 12.txt 13.txt 14.txt 15.txt 16.txt 17.txt  
18.txt 19.txt 20.txt 21.txt 22.txt 23.txt 24.txt 25.txt 26.txt 27.txt 28.txt  
29.txt 30.txt 31.txt 32.txt 33.txt 34.txt 35.txt 36.txt 37.txt 38.txt 39.txt  
40.txt 41.txt 42.txt 43.txt 44.txt 45.txt >  
Joins/ProcessedVisitorLists/FILENAME.txt ... 46.2 MB; 1,038,048 rows (0.067  
sec.)

 > time sort -k3 FILENAME.txt > Sorted.txt (0.112 sec.)
 > time awk 'NR >= 2 { print $1, $2, $3 }' 'Sorted.txt' | uniq --skip-fields=2  
--all-repeated=none | awk '{ print $1 "\t" $2 "\t" $3}' > Duplicates.txt ...  
7.0 MB; 168,976 rows (0.093 sec.)

Forgive me for my use of the unsophisticated script ... the groups are in the  
appropriate bunches, but the bunches are in
alphabetical order, but all the original domain data are still present. Sure  
beats grep, though ...

Print only the hostname column: > time awk '{ print $3 }' 'Duplicates02.txt'  
 > CountsOrder.txt (now 5.5 MB; 0.016 sec.)

Now do the counting step: > time uniq -c CountsOrder.txt > OrderCounts.txt  
... back up to 1.1 MB; 0.009 sec

Finally, sort them according to count frequency: > time sort -rg   
OrderCounts.txt > SortedByFrequency.txt ... still 1.1 MB; 0.003 sec.

Truncate to include only counts greater than 2: > SortedByFrequencyGT2.txt  
714 KB (attached)

There are a lot of high-count repetitions.




More information about the Trisquel-users mailing list