• ahall889

Compressing and Optimizing PDFs in Pure Golang using UniPDF

Updated: Oct 13, 2020

th the release of UniPDF v3, the library included support for optimizing PDFs, composite fonts (Unicode characters), digital signatures and, a powerful text and image extraction feature. The adoption of Unicode characters now allows the library to handle the processing and creation of more complex PDF documents that contain Unicode text and symbols. Some minor updates of v3 were styled paragraphs, invoice generation, table of contents and many more that you can read about in the v3 press release.

The ability to optimize (compress) PDF output was a fundamental update and also a difficult one. It involves a multi-step procedure, which consists of a few mostly independent optimization steps:

  1. Combine duplicate objects and streams (lossless)

  2. Combining indirect objects to compressed object streams (lossless)

  3. Reducing resolution of images (near lossless for specified display resolutions)

  4. Higher compression of images and objects (lossy)

The compression feature allows you to select the optimization level that best suits your need. You can opt for lossy optimization that compresses the PDF really well but can lead to degraded text and images. That is why we generally recommend that you select non-lossy optimizations and customize the settings to best fit your application.

How to Optimize PDFs using UniPDF

Let's see how we can optimize a PDF by using the library. First of all, go to the unipdf-examples GitHub repository and download the compression example code.

The pdf_optimize.go file contains the optimization code that we will be using in this tutorial. The optimization of PDF output is implemented in the writer method of the UniPDF library and it contains the following options, accessible using (optimize.Options):

You can select the options according to the level of optimization needed. We allow you to select the quality of images, which ranges from 1(lowest) to 100(highest). You can even be more selective and select the Pixel Per Inches (PPI) of images in the PDF. This provides you with a fine-grain control over the quality of your PDFs. Other options include allowing the compression of streams and objects and combining duplicate streams and objects.

If you simply want to run the default example, just download the pdf_optimize.go file from the GitHub link and run it with the proceeding command. Just make sure that your system supports Go language.

Code Breakdown

Let's breakdown the code into chunks so that it is easier to comprehend. We'll be using the example found in the repo and explain how you can customize it according to your requirements.

At the start, we import the relevant packages, including the UniPDF model package, which contains the reader and writer method. Including model package allows you to easily work with PDFs, provided you have an understanding of PDF format and structure. You can read more about the model package in the v2 release. The usage variable describes how the executable file will run and accept parameters.

Reading and Preparing Writer

The main function starts by reading the information of the input file, which we use at the end to provide statistics of compression. The reader is then used to read the PDF file and get the number of pages in the input PDF.

After we've stored the number of pages in the pages variable, we create a new PDF writer: writer and use it to store all of the pages of the input PDF. After the loop is completed, all of the pages are stored in the writer variable.

If the input PDF has AcroForms, then the writer will transfer the AcroForms to the output PDF.

Set Optimizer

Now comes the code that does the magic of optimizing the PDF. It's as simple as calling the function of SetOptimizer(optimize.Options{...}).

In the code, we've called the optimizer function and set its parameters. This effectively optimizes the input PDF according to the set parameters. You can adjust the parameters according to your requirements.

After the optimizer has been set, we simply use the os package to create the output file. The file is based on the output path provided in the command line.

If the file has been created successfully, the writer will write to the output file.

Optimization Statistics

The last few lines of the code highlight the result of optimization. The code displays the compression ratio, the time it took to complete the optimization and a few other details. It does so by getting the output PDF info and comparing it with the input PDF info, which was extracted at the start.

Compression Example

Now let's test the code on a real life example. We downloaded United Nation Secretary-General's report on the climate action summit 2019 and passed it through the pdf_optimize.go code.

These were the results:

The example ran within 4.5 seconds for a 38 pages long report that includes colorful graphics at every page. The UniPDF library compressed the report by 87.47% from 8 mb to approximately 1 mb. Note that this uses the default parameters, one can then play around with the optimization parameters to see the influence on the output quality as well as the processing time.

Optimization while Creating or Modifying PDFs

If you're using UniPDF to create or modify PDF documents then you can optimize the newly created or modified document by using the same SetOptimizer(...) function. The current code examples of creating documents using UniPDF do not include the optimization bit but it can be added quite easily.

In the create new document code example, the creator is creating the new document. We can simply call the optimizer function using the creator. This will be best cleared by looking at the proceeding code, which shows a portion of the example pdf_report.go.

This is near the end of the example code where we are setting the footer. We can simply use the creator c, which has been created earlier, to set the optimization of the file that is about to be created in the next step. The creator is well equipped to handle everything. The parameters can be adjusted to get the desired level of compression. This feature might become the default way of operating in the future.

Use UniCLI to Try Without Writing Any Code

You can use UniCLI if you want to avoid interacting with the code. The UniCLI is another feature offered by the UniPDF library that enables users to use the libraries functions without having to interact with too much code.

To start using UniCLI, simply clone the relevant repoand build it using Go language. Having a system that supports Go language is a requirement for using any of the UniPDF libraries. You can read more about how to install and use the CLI by visiting its repository page.

Optimization using UniCLI

To get started quickly, you can use UniCLI, which also allows you to optimize PDFs in a batch by selecting a directory as input. The CLI will then handle the rest and optimize all of the PDFs found in the directory. If you want the CLI to process files in subdirectories as well, simply pass the recursive flag -r while writing the command. The CLI is mostly intended for prototyping and a handy tool.

You can run the optimization by simply running the following command in the CLI:

The command will optimize the files using the default parameters.

What's Next?

We're adding more optimization options in the near future and are particularly focused on scanned documents. We have already added support for CCITT encoding, which has improved our ability to implement lossless compression of image files. We are also currently implementing JBIG2 encoding, which will further improve the compression ratio of PDF documents without loss to quality and is particularly good for scanned files and image masks. We will be adding more optimization options in the future to take advantage of those.

You can check out the example scripts on the UniPDF GitHub page. The examples will help you get started with using UniPDF. If you feel more examples are needed or found a bug, open a new issue in the examples repository or contact us.