Buscando modos de parsear el contenido de un documento PDF (ya contenga imágenes, tablas, texto, etc) me encuentro con esta respuesta de Jay Riggs https://stackoverflow.com/a/2554230/3270873
You can’t ‘parse’ an existing PDF file using iText, you can only ‘read’ it page per page.
What does this mean?
The pdf format is just a canvas where text and graphics are placed without any structure information. As such there aren’t any ‘iText-objects’ in a PDF file. In each page there will probably be a number of ‘Strings’, but you can’t reconstruct a phrase or a paragraph using these strings. There are probably a number of lines drawn, but you can’t retrieve a Table-object based on these lines. In short: parsing the content of a PDF-file is NOT POSSIBLE with iText. Post your question on the newsgroup news://comp.text.pdf and maybe you will get some answers from people that have built tools that can parse PDF and extract some of its contents, but don’t expect tools that will perform a bullet-proof conversion to structured text.
Esto, sumado a mis bajos conocimientos en iText y a las pruebas realizadas intentando conseguirlo, me llevan a pensar que tiene razón. Ahí os lo dejo.
PD: si alguno sabe un modo de hacerlo, que lo comparta