[Trisquel-users] Re-order groups of identical strings separated by blank lines so that the most frequently repeated strings are at the top

amenex at amenex.com amenex at amenex.com
Wed Jul 24 13:38:12 CEST 2019


Magic Banana wonders:

 > The original file had two fields separated by a colon(:) that I converted  
to tabs with Leafpad in ~30 minutes.

 >> I bet 'tr : \\t' takes at most seconds...

Alas, there are IPv6 fields in the data; when I asked LibreOffice Calc to  
treat the colon (:) as a field separator,
chaos ensued. That succinct script might do the same ...

Also:

 > All the original data is freely available on the Internet.

 >> Where is HNs.bst.lt.visitors03.txt?

It's 10.3 MB, so you'll have to grow your own ...

That starts here: http://www.bst.lt/webalizer/site_201906.html
Then take out the middle seven columns from the text file, separate the IPv4  
entries from that data, run a grep
search against a number of other webalizer data sets (discoverable with a  
Google search on webalizer and "view
all sites" followed by the removal of those middle seven columns each time)  
and then remove the pesky colons
that appear after the HNs.domain.txt file names. I don't know how thos IPv6's  
crept in ...

My processing of the mixed data that is online to separate each data set into  
IPv4 and hostname fractions looks
tedious from the number of steps, as it involves swapping the positions of  
two columns of data several times,
but each step is in the blink-of-an-eye category. I use one of Magic Banana's  
scripts to identify the first four
octets of each entry, followed by a join operation to match the results of  
that script to the actual original
data, giving me a list  of pure IPv4 addresses. Separating the hostnames uses  
another join script to retrieve
the hostnames from the original data set by listing the entries that are not  
IPv4's.




More information about the Trisquel-users mailing list