[Trisquel-users] Grep consumes all my RAM and swap ding a big job

amenex at amenex.com amenex at amenex.com
Tue Jul 23 17:22:34 CEST 2019


Here's the task at hand:

There are forty-five sets of Recent Visitor webalizer files that I've  
collected with Google
within which I'm trying to find which hostnames are most frequently found in  
those sets.

I've managed to do this once:

First, create a single-column hostname list from the original two-column  
list:
 >> time awk < HNs.bst.lt.txt '{print $2}' > HNusage/HNs.bst.lt/temp
(the first column is Webalizer's count of the occurrence of each hostname)
	
Second, find all the instances that the hostnames from the temp file are  
found in the forty-five Current Visitor lists:
 >> time grep -f HNusage/HNs.bst.lt/temp *.txt >  
HNusage/HNs.bst.lt/HNs.bst.lt.visitors.txt

This script managed to create a 159,100 row file without saturating memory  
and swap, but other combinations can't finish.
I'll address the analysis of that 159,100 row file in a separate posting.

It appears that grep is storing all its output in memory without posting  
intermediate results.

How do I re-write my script so that grep takes just one hostname at a time  
for use as its pattern to search the 45 other lists ?
When I do this by hand, each such single search takes ~0.015 seconds; the  
input file has ~2000 hostnames, so the total search
time would be ~30 seconds. My failed grep script, on the other hand, took ~23  
seconds of processor time and 15 minutes real time,
but ran out of memory (7.7 GB RAM, 7.8 GB swap, both 100% at the end).

Here's my best effort, which starts OK, but shows no progress ...
 >> time awk '-f  HNusage/HNs.bst.lt/temp { for (i = 0; ++i   
HNusage/HNs.bst.lt/HNs.bst.lt.visitors08.txt
(the path statements are to keep the output file away from the from the  
Recent Visitor files that are in the current directory)

Attached are the master file (HNs.bst.lt.txt) and two target files out of the  
45-file list.

George Langford


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: HNs.bst_.lt_.txt
URL: <http://listas.trisquel.info/pipermail/trisquel-users/attachments/20190723/f5c6ac19/attachment-0003.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: HNs.www_.barcodeus.com_.txt
URL: <http://listas.trisquel.info/pipermail/trisquel-users/attachments/20190723/f5c6ac19/attachment-0004.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: HNs.www_.outwardbound.net_.txt
URL: <http://listas.trisquel.info/pipermail/trisquel-users/attachments/20190723/f5c6ac19/attachment-0005.txt>


More information about the Trisquel-users mailing list