UniPDF JBIG2 Encoding Support in Golang
Updated: Oct 13, 2020
UniPDF is constantly improving itself and a while ago, we had achieved CCITT encoding, which allowed our users to better optimize PDFs. We’ve set another milestone and have integrated JBIG2 compression standard in our library. When it comes to the optimization of black and white images, JBIG2 is recognized as the best out there and offers a compression rate of 100:1.
JBIG2 is the standard for bi-level image compression, developed by the Joint Bi-level Image Experts Group. It is designed to compress black and white images in both lossless and lossy modes with better performance than the traditional JBIG and Fax Group 4 standards.
For PDF users, integration of JBIG2 means smaller file sizes without a loss to quality and readability. JBIG2 is highly effective when it comes to scanned documents and is the industry standard followed by acrobat while working with PDFs.
As an example we would take a bi-level image of the checkerboard encoded in JPEG format with dimensions of 1920x1920 pixels and a size of 142,794 bytes.
Currently UniPDF's JBIG2 encoder allows users to encode in a lossless mode. The checkerboard image compressed into the JBIG2 encoding format will only take 377 bytes, which can be achieved using the UniPDF's JBIG2 encoder. This gives the compression ratio (uncompressed size/compressed size) of 378.76 times, which leads to 99.735% of optimized space savings the for the given example. Inserting images with JBIG2 Encoder filter into PDF files leads to a similar compression ratio of 160,121/1686 bytes, which is 94.97 times the original size and takes 98.947% less space.
The file size of a typical scanned document at 300dpi for a TIFF is around 75KB-125 KB per image. The same document encoded using JBIG2 would be about 5 to 10 times smaller (10KB - 15KB per image).
The performance also depends on the method used by the encoder. UniPDF allows users to encode black and white images using a generic lossless-encoding method, which is fast and has a relatively good compression ratio. For the future. We are focusing on building a proprietary lossy-encoding method. This would have a better compression ratio - especially for scanned text documents.
JBIG2 standard allows to encode bi-level images in two modes:
lossless - the image quality is the same as original, no data is lost
lossy - better compression ratio, but some image parts might be lost
LOSSLESS GENERIC REGION ENCODING
UniPDF library allows users to encode black and white images lossless-ly by providing a generic method. The encoder takes the whole image as a generic region and encodes it using the arithmetic coder. It reduces the file size by encoding the line duplicates using a single bit. This is used by setting DuplicateLinesRemoval. The more lines are duplicated the better the compression rate. This method is relatively fast with a basic compression.
LOSSY - CLASSIFIED - COMPONENT ENCODING (UPCOMING)
UniPDF is working on a classified component, lossy-encoding method. The lossy encoder would read and scan all pages of the provided document. The content of an image is decomposed into symbols and matched for similar occurrences. The symbols are stored in a Symbol Dictionary segment using an arithmetic coder stored at a given class index.
The encoder then takes all occurrences of the symbol classes and stores their position in a segment called 'Text Region'. This encoding method has the best performance on images of text documents, i.e. scans. However, pure scans are imperfect in their quality and some letters may differ in a few bits representing a single pixel. Due to this, there is a correlation threshold parameter that allows the encoder to match 'similar' symbols even if they differ slightly. The value of that parameter is in range [0.0 - 1.0], where: - 0.0 - the symbols could be absolutely different in order to match (this value should never be used at all) - 1.0 - the symbols needs to be absolutely the same in order to match
For most scenarios, the threshold parameter in the range of 0.7 - 0.95 should provide us the best results. This parameter directly corresponds to the compression ratio - the lower the parameter the better the compression ratio. However, while lowering it's value we should be very careful as this encoding method is lossy. The lower the threshold, the more lossy results we would obtain.
JBIG2 Encoder allows you to store multiple black and white pages into a single JBIG2 document. This allows us to use an entity called 'Global Symbols', which acts as a single Symbol Dictionary for all encoded pages stored as separate byte streams.
Having a single store for the common symbols allows the encoder to reduce the size of the resulting JBIG2 byte stream for each page. It might be compared to the single, globally defined dictionary for the letters, where each page just takes a reference not the value by itself.
Global Symbols might only be used in the upcoming, lossy - classified method.
EXAMPLE - ENCODE IMAGE INTO PDF FILE
UniPDF Converting Go Images to JBIG2Images
UniPDF JBIG2 Encoder accepts as an input only black and white images where the data is represented in a 1 bit per pixel (1bpp). The bits are written from Top Left corner defined row by row. Within a single byte, pixels from the left are located as the most significant bits. For images with width not divisible by 8 extra padding would occur on the last byte per row.
Unidoc UniPDF allows users to convert golang image.Image into core.JBIG2Image. It is done by the core.GoImageToJBIG2 function which takes two parameters: - image.Image - an input image to convert into binary JBIG2 format - Threshold - used for image conversion into Black and White pixels only. It might take the value in range [0.0 - 1.0]. This value represents how likely the value would take the white pixel value. For the special case when it's value is 0.0 the converter computes the histogram of an image and adjusts the threshold value to it's result.
UniPDF Supports All JBIG2 Decoding Formats
UniPDF golang library allows you to decode JBIG2 encoded files and byte streams. The library decodes byte streams implemented in all possible combinations using: MMR decoder, Arithmetic decoder and huffman tables decoder. It decodes all types of JBIG2 segments.
DECODER EXAMPLE WITH GLOBALS
The integration of JBIG2 encoding is the latest improvement in UniPDF and will allow our users to produce better optimized PDF documents. The introduction of JBIG2 also means that you can optimize your black and white scanned documents in a more efficient manner. This will help save storage without fearing the loss of quality. For the future, we are planning on developing a smart PDF compression system that will be able to identify whether the images are 1 bit or close to 1 bit and automatically compress them using JBIG2. We are developing the lossy JBIG2 encoding method, which would provide a better compression ratio. We are focused on building the most optimized PDF builder for our customers and will keep on perfecting UniPDF.
We gratefully acknowledge the following open source projects that served as references during development.
Apache Java PDFBox JBIG2 Decoder, Apache License 2.0. In order to achieve full support for the JBIG2 Decoder, it was necessary to implement all possible decoding combinations defined in the JBIG2 standard, aka ITU T.88 and ISO/IEC 14492. With a lack of Golang JBIG2 Open Source package, we’ve decided that it would be best to base our own implementation on some solid and reliable library. The Apache PDFBox JBIG2 library fulfilled all our requirements. It has a really good quality of the code along with the detailed comments on each function and class. It also implemented MMR, Huffman tables and arithmetic decompressors along with all JBIG2 segments.
AGL JBIG2 Encoder, Apache License 2.0. The complexity and lack of comprehensive documentation for the JBIG2 encoding process, lead us to look at the AGL JBIG2 Encoder library. At the moment of implementing our encoder it was the only Open Source JBIG2 encoder. It’s a C++ based library that implements both lossless and lossy encoding methods, where most of the image operations are done using DanBloomberg Leptonica library. The core encoding processes in the UniPDF JBIG2 Encoder were based on that well documented and solid library
DanBloomberg Leptonica, The 2-Clause BSD License, DanBloomberg Leptonica is an amazing C/C++ Open Source library. It provides raster operations, binary expansion and reduction, JBIG2 component creators, correlation scoring and a lot more perfectly commented image operation functions. That library was used as a very solid base for our image operation algorithms used by the JBIG2 Encoder.