[Trisquel-users] Re : Grep consumes all my RAM and swap ding a big job

lcerf at dcc.ufmg.br lcerf at dcc.ufmg.br
Wed Jul 24 01:34:02 CEST 2019


Again: you have no idea how much time you would save by stopping for ~10  
hours and actually learning the commands you use (e.g., as I have already  
told you, the "system" function must be used in AWK to call system  
commands... but you do not need that here), regular expressions (unless  
escaped, a dot means "any character"; 'grep -F' must be used to interpret the  
patterns as fixed strings), simpler commands (your first AWK program just  
does 'cut -f 2'; 'grep' is rarely needed with structured files), etc.  Here,  
the simplest solution is to use 'join', which was also the solution in the  
previous thread you created on this forum...  However, your input looks  
wrong: on line 674 of HNs.bst_.lt_.txt, the second column only contains the  
character 0... and your 'grep' selects (among others) all the lines that  
include this character.  I assume you want whole domain matches.

The solution with 'join':
$ cut -f 2 HNs.bst_.lt_.txt | sort > temp
$ sort -k 2 HNs.www_.* | join -1 2 - temp

If you want the output formatted like the inputs, append "| awk '{ print $2  
"\t" $1 }'" to the last command.

If HNs.bst_.lt_.txt may be much larger and if you care about improved  
execution time, you can have all four commands run in parallel using a named  
pipe, created with 'mkfifo', instead of a temporary file.  Doing that in a  
Shell script taking as input first the file with the domain names to search  
and then all other files:
#!/bin/sh

if [ -z "$2" ]
then
     printf "Usage: $0 searched-domain-file file1 [file2 ...]
"
     exit
fi

searched="$1"
shift

TMP=$(mktemp)
trap "rm $TMP 2>/dev/null" 0

mkfifo $TMP
cut -f 2 "$searched" | sort > $TMP &
sort -k 2 "$@" | join -1 2 - $TMP


More information about the Trisquel-users mailing list