Customizing PDF Import Options in Data Prep Studio

When you import a PDF file into Monarch Data Prep Studio, the application performs an analysis of the file to try and determine the optimum method of transforming the data accurately. In most cases, Monarch Data Prep Studio’s auto-detection routines will produce the best results. Under certain conditions, however, adjustments to the PDF import options may be necessary.

Previous PDF engines used in Monarch used the idea of monospaced and free-style text flow to adjust for text alignment. These older engines are usually adequate for:

  • PDF files containing tables with tightly compacted columns.

  • PDF files containing multiple font sizes and the data of interest is in a smaller font than most of the other text in the PDF, thereby causing the auto-calculated font size to be too large.

  • PDF files containing mixed mono- and variable-spaced fonts but the data of interest uses monospaced fonts.

  • PDF files containing mixed free-form and tabular data.

However, in newer PDF reports:

  • Text alignment on pages with sparse text is inconsistent.

  • Text wrapping may cause horizontal misalignment.

  • Alignment of centered text is unpredictable.

Moreover, PDF reports are now published by numerous software products and may show some unpredictability in terms of their use of fonts, backgrounds, and line colors. Thus, a rendering engine that can tolerate any combination of fonts (including both monospaced and free-form fonts) and background colors is required.

Monarch introduces a new PDF engine (version 4.5) that improves the accuracy of text extraction by identifying graphical elements, such as vertical and horizontal lines and rectangles, on rendered PDF pages and using these elements to form grids to which text will be aligned. This new feature addresses alignment issues that render some trapping operations in PDF files extremely difficult.

  1. Open a PDF report in the Report Design window. In this example, we'll use Composers.pdf, which is usually available in C:\Users\Public\Documents\Altair Monarch\Reports.

    As highlighted in the image below, when this report is opened in the Report Design window, the second column appears skewed if a lower PDF engine is used.

     

  2. Click the Document Options tab to display PDF import settings.

  3. Specify the desired settings for the following options:

    • Auto Adjust - Select this button to have Data Prep Studio automatically select the optimum settings for the displayed sample page. Note that if you have changed any of the PDF import settings, clicking this will likely restore the original settings.

    • Text Flow

      • Monospaced (for PDF Engine versions 4.4 and below)

      • This setting specifies that a monospaced font (i.e., a fixed-width or non-proportional one) was used in the PDF file. Monospaced fonts are fonts in which each character has the same width. For example, in a monospaced font, the "o" and "i" characters would have the same width, i.e., they would take up the same amount of horizontal space on a line. (Other terms for monospaced are fixed-width and non-proportional. The opposite of monospacing is proportional spacing, in which different characters have different widths, e.g., in a proportionally spaced font, the letter "o" would be wider than the letter "i").

        When you import a PDF file into Monarch Data Prep Studio, the application tries to detect when monospaced fonts are used and optimizes the conversion accordingly. In some cases, Monarch Data Prep Studio may not detect that monospaced fonts were used for the PDF file. When this happens, it is usually due to a mix of monospaced and proportional fonts existing in the same PDF file. If you know that the PDF file uses monospaced fonts, but the fonts are not displaying correctly, you can select this setting to force Monarch Data Prep Studio to optimize for Monospaced fonts. While proportionally spaced fonts look more appealing, monospaced fonts are superior for tabular data because the uniform width of each character makes alignment of columns easier.

        In general, PDF files generated using monospaced fonts will convert more successfully, so if you are trying to optimize your PDF producing application for Monarch Data Prep Studio, use monospaced fonts. Some of the more common ones are: Andale Mono, Anonymous, Crystal, Bitstream Vera Sans Mono, Courier, Courier New, Elronet Monospace, Everson Mono Latin 6, Fixedsys, Lucida Sans Typewriter, Lucida Console, and PrestigeFixed.

      • Free-form (for PDF Engine versions 4.4 and below)

      • This option tries to optimize text that is more free-form than columnar or grouped columnar text. A columnar document is a simple table format, while grouped columnar might be something similar to one of the Monarch Data Prep Studio sample reports, such as Betty’s Music Store (Classic.pdf). A typical document that might benefit from using this setting would be an academic report that is 95% text, but which also contains a few tables that you want to extract. Note: This setting will sometimes work effectively on columnar documents when the default settings are not producing a good result.

      • Snap Text Left (PDF Engine version 4.5) - Select this option to align the text to the left of the imputed PDF grid

      • Snap Text Up (PDF Engine version 4.5) - Select this option to align the text to the top of the imputed PDF grid

      • Always Align Left (PDF Engine version 4.5) - Select this option to always align text the to left of the imputed PDF grid

    • Suppress Left Whitespaces (PDF Engines v4.2–4.5)

    • Instructs Monarch to remove all left-side white spaces when displaying the report.

    • Stretch

    • This option governs how much spacing is used during the conversion process. When Monarch Data Prep Studio analyzes the PDF file, it tries to match the spacing as far as possible to the original document, but there are many factors that can make it necessary to introduce more spacing into the conversion than appears to exist in the original PDF file. Such factors can include hidden data in the PDF file, i.e., data which is not visible on screen but still exists within the PDF file itself. This can be the result of columns that truncate the data, for example. At first glance, it is not apparent that any data is missing, but Monarch Data Prep Studio will convert all the data in the PDF file, not just what might be visible in a PDF viewing application. In this case, in order to try and maintain a proper column justification, Monarch Data Prep Studio will have to recalculate and pad the spacing, as the original column spacing would not be enough to hold the data safely.

      In general, Monarch Data Prep Studio uses a larger amount of spacing than in the PDF file. When viewed in the Report window, this will make the document look like it is stretched wider than the original PDF file, but Monarch Data Prep Studio errs on the side of caution so that columns won’t run into each other. This is also done so that if a later iteration of the same report (or a similar one) contains wider data values, the model will likely still work with it.

      If you know your reports well, then you can decrease the stretch value to make the reports look more presentable, thereby avoiding very small font sizes in the Report Window or the necessity of horizontal scrolling.

      Use the + and - buttons provided to specify a stretch value.

    • Crop

    • Select to crop extra space from the PDF page. Use the + and - buttons provided to specify a crop value.

  4. When you have finished specifying PDF import options, click the Accept button to save them and apply them to the PDF file. In the example below, the options Always Left Align and Suppress Left Whitespaces were selected.

     

  5. Monarch Data Prep Studio imports the PDF file using the import options you specified.

 

 

 

© 2024 Altair Engineering Inc. All Rights Reserved.

Intellectual Property Rights Notice | Technical Support