[Trisquel-users] Grep consumes all my RAM and swap ding a big job

amenex at amenex.com amenex at amenex.com
Wed Jul 24 20:23:40 CEST 2019


Followup:

Starting with the setup script:

 > time awk < HNs.bst.lt.txt '{print $2}' > HNusage/HNs.bst.lt/temp

When the grep command acts on a group of files that have been sorted, the  
script works much more quickly as
well as utilizing much less RAM without need of swap support:

 > time grep -f HNusage/HNs.bst.lt/temp   
HNusage/HNs.bst.lt/HNs.bst.lt.visitors18.txt
	The names of the searched files are not recorded, however.

My attempt to force inclusion of the searched files' filenames:

 > time grep -H -f HNusage/HNs.bst.lt/temp   
HNusage/HNs.bst.lt/HNs.bst.lt.visitors18.txt
	Lists the name of a binary file (/dev/fd/63:) not the names of the data sets  
associated with each match.

After some  opposition from bash, I combined the setup and grep scripts:

 > time awk < HNs.bst.lt.txt '{print $2}' > HNusage/HNs.bst.lt/temp | time  
grep -H -f HNusage/HNs.bst.lt/temp   
HNusage/HNs.bst.lt/HNs.bst.lt.visitors18.txt

But the output still includes that /dev/fd/63 filename, which I suspect  
includes all the target filenames ...

However, man grep states:

 >> Output Line Prefix Control
 >>        -H, --with-filename
 >>              Print the file name for each match.  This is the default when  
there is more than one file to search.

With my initial grep script, the filenames were included, apparently at the  
expense of RAM and swap ... All I'm doing
now is sorting the data sets that are being searched. I suspect that I'll  
have to sort each of the forty-five data sets,
one at a time, before starting the grep script:

 > time awk < HNs.bst.lt.txt '{print $2}' > HNusage/HNs.bst.lt/temp | sort -k  
2 *.txt | time grep -H -f HNusage/HNs.bst.lt/temp  *.txt >  
HNusage/HNs.bst.lt/HNs.bst.lt.visitors20.txt

With a little feedback from bash:
 >> 702.25user 1.98system 11:48.88elapsed 99%CPU (0avgtext+0avgdata  
2960936maxresident)k
 >> 0inputs+20104outputs (0major+740271minor)pagefaults 0swaps

But this very long script manages to retain the domains' filenames along with  
the two columns of matches ... 10.3 MB worth,
but in eleven minutes and without taxing RAM or even using swap. Those 10.3  
MB match my earlier and more RAM-extravagant results.




More information about the Trisquel-users mailing list