[Trisquel-users] Re : Grep consumes all my RAM and swap ding a big job

lcerf at dcc.ufmg.br lcerf at dcc.ufmg.br
Mon Jul 29 02:31:01 CEST 2019


That is not a problem.  That is an algorithm whose first step is not even  
clear: a Google search on {stats "view all sites"} returns a "normal  
response", a list of 224,000 pages from different websites.

Still assuming that the three files you posted in the original post are a  
sample of your input, I actually wonder if all you want is not simply:
$ awk '{ print FILENAME, $0 }' *.txt | sort -k 3 | awk 'p != $3 { if (p !=  
"") print c, p r; p = $3; c = 0; r = "" } { ++c; r = r " " $1 " " $2 }' |  
sort -nrk 1,1 > out

If *.txt catches the three files, "out" is (with the same semantics as  
explained in my previous post, except that the file name is now before the  
number):
3 xhsjs.preferdrive.net HNs.bst_.lt_.txt 1 HNs.www_.barcodeus.com_.txt 2  
HNs.www_.outwardbound.net_.txt 6
3 webislab40.medien.uni-weimar.de HNs.bst_.lt_.txt 1  
HNs.www_.barcodeus.com_.txt 1 HNs.www_.outwardbound.net_.txt 2
(...)
1 027a74fd.bb.sky.com HNs.www_.outwardbound.net_.txt 188
1 014199116180.ctinets.com HNs.www_.outwardbound.net_.txt 3

The input files can then be removed: all the information is in "out".  You  
can query it with 'grep' and 'awk'.  For instance:

To only get the lines with hostnames in "HNs.bst_.lt_.txt" (hence a selection  
with as many lines as "HNs.bst_.lt_.txt"):
$ grep -F ' HNs.bst_.lt_.txt ' out
To additionally impose that the selected hostnames are in at least another  
file (as in the problem I stated):
$ grep -F ' HNs.bst_.lt_.txt ' out | awk '$1 > 1'
To only keep the hostnames of the previous output:
$ grep -F ' HNs.bst_.lt_.txt ' out | awk '$1 > 1 { print $2 }'



More information about the Trisquel-users mailing list