[Trisquel-users] Re-order groups of identical strings separated by blank lines so that the most frequently repeated strings are at the top

amenex at amenex.com amenex at amenex.com
Tue Jul 23 18:02:35 CEST 2019


Starting with a 109,000 row LibreOffice Calc file containing many groups of  
identical strings separated by blank rows,
how can I re-order this file with the largest groups at the top, still  
separated by blank rows ?

Here's how that original file was created:

First, sort the original 159,000 row file on the third (hostname) column:
 >> sort -k3 HNs.bst.lt.visitors03.txt > HNs.bst.lt.sorted03.txt

The original file had two fields separated by a colon(:) that I converted to  
tabs with Leafpad in ~30 minutes.
It now has the Current Visitor filename in the first column, the occurrence  
count in the second column, and
the master file's hostnames in the third column, the one on which this file  
has to be sorted.

Second, gather the duplicates into bunches separated by blank lines,  
retaining the original general sorted sequence:
 >> time awk 'NR >= 2 { print $1, $2, $3 }' 'HNs.bst.lt.sorted03.txt' | uniq  
-f 2 --all-repeated=separate | awk '{ print $1 "\t" $2 "\t" $3 }' >  
HNs.bst.lt.Duplicates03.txt

This makes a very pretty .ods file that's slightly larger than the original  
file, but there's so much there that it
would be tedious to rank the most-often-repeated Recent Visitors' hostnames  
by hand.

How can I re-sort the .ods file so that the longest identical-hostname lists  
are at the top ? I want to retain the associations
with the original domains' Recent Visitor data. This is a task which has to  
be repeated many times ...

BTW, the longest identical-hostname list that I found visually in the .ods  
file was 42 out of 45 original Recent Visitor lists.
All the original data is freely available on the Internet.

George Langford


More information about the Trisquel-users mailing list