[Trisquel-users] Re : finding particular pages within PDFs

magicbanana at gmail.com magicbanana at gmail.com
Sat Aug 30 05:23:23 CEST 2014


See the attachment. You owe me a beer. ;-)

An example of use:
$ pdf-page-grep *.pdf
regexp: GNU
OR regexp (empty to stop): [fF]ree
OR regexp (empty to stop):

matching pages in "Agglomerating Local Patterns Hierarchically with  
ALPHA.pdf":
matching pages in "A Parameter-Free Associative Classification Method.pdf": 1  
3 5 7 9 11 12
matching pages in "Artificial Regulatory Networks Evolution.pdf": 2
matching pages in "Closed and Noise-Tolerant Patterns in n-ary  
Relations.pdf": 8 9 11 20
matching pages in "Closed Patterns Meet n-ary Relations.pdf": 16 21 23 27
matching pages in "Complete Discovery of High-Quality Patterns in Large  
Numerical Tensors.pdf": 7
matching pages in "Constraint-Based Search of Different Kinds of  
Discriminative Patterns.pdf": 4 6
matching pages in "Constraint-Based Search of Straddling Biclusters and  
Discriminative Patterns.pdf": 6 10
matching pages in "Data-Peeler: Constraint-Based Closed Pattern Mining in  
n-ary Relations.pdf": 7 9
matching pages in "Descoberta de n-Conjuntos Fechados Eficiente e Restrita a  
Grupos de Interesse.pdf": 8
matching pages in "Discovering Descriptive Rules in Relational Dynamic  
Graphs.pdf": 12 19
matching pages in "Discovering Inter-Dimensional Rules in Dynamic  
Graphs.pdf": 8 12
matching pages in "Discovering Relevant Cross-Graph Cliques in Dynamic  
Networks.pdf": 7
matching pages in "Distributed Skycube Computation with Anthill.pdf": 3 5
matching pages in "Exploiting Temporal Locality to Determine User Bias in  
Microblogging Platforms.pdf":
matching pages in "Extraction de motifs fermés dans des relations n-aires  
bruitées.pdf":
matching pages in "Mining Constrained Cross-Graph Cliques in Dynamic  
Networks.pdf": 8 20
matching pages in "Multidimensional Association Rules in Boolean  
Tensors.pdf": 7 12
matching pages in "Parameter-free classification in multi-class imbalanced  
data sets.pdf": 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
matching pages in "Reachability Queries in Very Large Graphs: A Fast Refined  
Online Search Approach.pdf": 4
matching pages in "Sémantiques et Calculs de Règles Descriptives dans une  
Relation n-aire.pdf": 2 14 15
matching pages in "Tackling Closed Pattern Relevancy In n-ary Relations.pdf":  
5
matching pages in "Un nouveau cadre de travail pour la classification  
associative dans les données aux classes disproportionnées.pdf": 4
matching pages in "Watch me Playing, I am a Professional: A First Study on  
Video Game Live Streaming.pdf":

Output written to "Un nouveau cadre de travail pour la classification  
associative dans les données aux classes disproportionnées-matches.pdf"

It actually is pretty fast: a little bit more than 10s for the example above  
that processes 24 documents (341 pages in total) and generates a 62-page PDF.

As you can see, I decided to name the output with the "basename" of the last  
matching PDF followed by "-matches.pdf". I initially wanted to concatenate  
the names of all matching PDFs but you may then reach the size limit for file  
names!

I do not know if you really wanted regular expressions (instead of simple  
strings). Maybe you wanted whole-word matches and/or to ignore the case.  
Those things are simple options to add to the 'grep' command.


More information about the Trisquel-users mailing list