Let's switch this to one of the visible modes of text rendering. Now we can find above string easily (and there is only one occurrence in each file). So we uncompress them as far as possible with the help of qpdf: qpdf \ You cannot find this string (yet) in the original from_abbyy.pdf nor in from_ghostscript.pdf because parts of the PDFs are compressed. The PDF code to set text rendering to invisible is this: 3 Tr How can we make the invisible text visible? What you see or print is the original scanned image. It's there, but it's not rendered on screen (or on paper if printed). If you scan a page with text into a PDF and run an OCR application on it, then the text will be added to the page, but the "text rendering mode" is set to invisible. I do not see it happen with the Ghostscript versions I've again tested it with: current Git (v9.10GIT) nor with Ghostscript v9.06. This bug does seem to have been fixed meanwhile. URL of bug entry in Ghostscript's bugzilla: Because the above quoted PDF snippets are the same in v8.71 output as in v9.02 output. Most likely it has to do with the font metrics embedded in the output PDF. That means: there is a problem in Ghostscript 9.02 that wasn't there in 8.71. And what should I say? The copy'n'paste problem does not occur with v8.71 output! When realizing that, I did try to make the conversion with Ghostscript v8.71 instead of v9.02. I'll update here with the bug number when I've done it.Īfter pondering a bit more about the replaced Tm operator, I now think this shouldn't be the root of the problem. On Linux also, the text is not correctly searchable, and also shows the extra spaces when doing copy'n'paste. In this case, there are no additional spaces (but a few extra linebreaks). Note however, that this problem does not occur, if I use the Linux Acrobat Reader 9.4.2 and use the menu action "File -> Save as Text.". I'll submit a bug report to Ghostscript's bugzilla and see if they are interested in solving it. To give you a little idea what the PDF graphic operators used here do mean, here is a short list: Tj - show textĪs you can see, Ghostscript replaced the original Tm ( text matrix) operator by a Td ( move text current point) one, and it also added an extra 2.16501 0 Td. In qdf-after_ghostscript.pdf: ( Deutsche)Tjģ6.235 0 Td %% extra Td = 'move text current point' operatorĢ.16501 0 Td %% Td = 'move text current point' instead of Tm Looking at one of the first occurrences where an extra space gets inserted (it is the original string "Bund Deutscher Gymnastik-Schulleiter" turning into "Bun d Deutsche r GymnastikSchulleiter"), I find the following PDF snippets: In qdf-from_abbyy.pdf: ( Deutsche) Tjġ 0 0 1 143.236 265.140 Tm %% Tm = 'text matrix' operator I found this an interesting problem and had a closer look.įirst, I used the qpdf commandline tool to un-compress PDF data streams so I could better see the source codes of both files: qpdf.exe ^ I can reproduce the effect with the following minimal parameter set for Ghostscript: -sDEVICE=pdfwrite ^ This has the main negative effect that you cannot search for whole words in Acrobat Reader. Novembe r 195 5 anläßlic h eine r Zusammenkunf tĭer Leiterinne n un d Leite r de r private n deutsche n GymnastikAusbildungsstätte Now the first sentence looks strange - there is an extra space before the last character of each word.ĭer ✻un d Deutsche r GymnastikSchulleiter ![]() November 1955 anläßlich einer Zusammenkunft der Leiterinnen und Leiter der privaten deutschen Gymnastik-Ausbildungsstätten gegründet.Īfter some processing with Ghostscript 9.02 (64 bit Windows) I get this file: You can copy & paste the first sentence and get this (very good) text result:ĭer ✻und Deutscher Gymnastik-Schulleiter« wurde am 20. This PDF was produced by Abbyy Finereader 10:
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |