[Trisquel-users] Grep consumes all my RAM and swap ding a big job

amenex at amenex.com amenex at amenex.com
Mon Jul 29 01:22:07 CEST 2019


Magic Banana politely requested:

 >> please express the actual problem clearly. [paraphrasing] in less than ten  
paragraphs

The folowing scheme worked OK with a list of about a million visitors'  
hostnames.

(1) Collect Recent Visitor data with a Google search on {stats "view all  
sites"}. Pick a recent month (such as 201906); then copy and save the usually  
very long list of hostnames (last column) and the number of occurrences  
(first column). Discard the middle columns and retain the files 01.txt ...  
NN.txt

(2) Concatenate all the resulting two-column files into one multi-megabyte  
text file and add the filenames of the component files in the first column:
 >> awk '{print FILENAME "\t" $0}' 01.txt 02.txt ... NN.txt >  
Joins/FILENAME.txt

(3) Sort the resulting multi-megabyte text file on its third column:
 >> time sort -k3 FILENAME.txt > Sorted.txt

(4) Collect all the many duplicate hostnames in the sorted multi-megabyte  
text file. A workable script is:
 >> awk 'NR >= 2 { print $1, $2, $3 }' 'Sorted.txt' | uniq --skip-fields=2  
--all-repeated=none | awk '{ print $1 "\t" $2 "\t" $3}' > Duplicates.txt

This is a good file to keep for research purposes, as the domains subjected  
to the repeated visits are still associated with the hostnames of the  
visitors.

(5) Strip all but the third column from Duplicates.txt with
 >> awk '{ print $3 }' 'Duplicates02.txt' > CountsOrder.txt

(6) To produce a file which can be counted with this script:
 >> uniq -c CountsOrder.txt > OrderCounts.txt

(7) Sort this file to place the hostnames with the most numerous counts at  
the top:
 >> sort -rg OrderCounts.txt > SortedByFrequency.txt

(8) Truncate the megabyte-plus SortedByFrequency.txt file to include only  
duplicates of three or more: SortedByFrequencyGT2.txt:
 >> https://trisquel.info/files/SortedByFrequencyGT2.txt
  ... the scholar will add a suitable script for this task when NN increases  
to an unmanageably large number.

(9) Find all the IPv4 addresses which resolve to the most numerously repeated  
visitor hostnames. That's another 


More information about the Trisquel-users mailing list