UniPDF version 3 released
Updated: Oct 13, 2020
We are pleased to release version 3.0.0 of our PDF library, UniPDF (formerly UniDoc). UniPDF is available on GitHub under https://github.com/unidoc/unipdf. The old unidoc repository will still be available as read-only for backward compatibility under https://github.com/unidoc/unidoc although all new development will be shifting to the unipdf repository.
The reason for the name change is twofold:
We recently added a new product UniOffice for working with Office formats (docx, xslx, pptx) so the unidoc name for our PDF library was confusing
With Go modules becoming an important part of Go going forward, we wanted to release version 3 as a Go module and it seemed like a clean approach to go with a fresh repository supporting modules and semantic import paths from the scratch.
The journey from version 2 to version 3 has been a pretty long one, starting with a compositefonts branch supporting composite fonts (such as many international fonts use) and then most of the work being done in development branch v3. The latest v2 release was v2.2.0 on 17th November 2018 (524 commits), whereas the work on v3 started in July 2018 and is being released now at 1450 commits. The release post for v2 was published on July 26, 2017, so almost 2 years since the last major version release.
Composite fonts support (unicode character support)
The original plan with v3 was to add support for composite fonts to be able to handle more complex documents both for creation and processing purposes. In v2, we had decent support for simple fonts which are fonts where each glyph (symbol) is represented by 8-bit character code. In composite fonts, each glyph can be represented by multiple bytes. The raw bytes in the PDF contents are mapped to character codes and the character codes are mapped to glyph ID. There are multiple ways of doing this, and multiple types of fonts.
In addition, the process is not the same for extraction (which uses ToUnicode maps) as creation, so it turned out to be a nontrivial exercise. Version 3 now supports the most common types of fonts and encodings. We will continue to work on fonts and adding support for the less common types going forward but the support is already quite comprehensive.
A few important use cases of this are:
Extracting unicode text from PDF files
Creating PDFs with unicode text and symbols
While we were working on the complex font support, we also wanted to add a few more features that had the potential to lead to breaking changes so we decided to work on those in the v3 branch as well.
Optimizing PDF output (aka PDF compression)
Optimizing PDF documents in the PDF writing requires a multi-step approach, for example:
Combine duplicate objects and streams
Combine many uncompressed indirect objects into compressed object streams
Reduce the resolution of overly large images down to a specified pixels-per-inch threshold
Higher compression of objects and images (sometimes lossy).
There can be multiple other ways as some image encodings can provide better results. Each optimization has certain options and the approaches are chained together.
The optimization provides a significant compression in many cases and we generally recommend using the non-lossy optimizations when writing out PDFs. We may consider making those the default going forward (currently no optimization is the default).
Digital signatures and append mode (revisions)
Digital signatures are already important in many PDF tasks. We wanted to add basic support for creating and validating signatures to start our journey into this area and make it possible in Go.
An important feature of digital signatures is that the hash of the entire document (outside the signature contents itself) is calculated. Thus we soon learned that multiple signatures in a row would lead to an invalid hash unless we added a feature called incremental writing (or appending).
The idea with incremental writing is that upon signing a PDF, the original content is left unchanged, and only the changes from the previous revisions are added. This enabled creating many revisions where the hash of each signature remains valid.
As a result, we created PdfAppender which is capable of writing in append mode and supports digital signatures. In addition, we added ways for different ways of obtaining signatures (PKCS11 and external signing provider). We have prepared a few examples of digital signatures that are available at https://github.com/unidoc/unipdf-examples/tree/v3/signatures.
pdf_sign_generate_keys.go: Signing using generated private/public key pair.
pdf_sign_pkcs12.go: Signing using PKCS12 (.p12/.pfx) file.
pdf_sign_external.go: PKCS7 signing with an external service with an interim step, creating a PDF with a blank signature and then replacing the blank signature with the actual signature from the signing service.
pdf_sign_pkcs11.go: Signing with a PKCS11 service using SoftHSM and the crypto11 package.
pdf_sign_appearance.go: Creating signature appearance fields.
pdf_sign_validate.go: Signature validation.
Powerful text and image extraction
We introduced the extractor package in v2 which had the capability for simple text extraction. It worked great on many basic simple PDF files using Western character encoding and simple fonts. However, it did not work extremely well on advanced PDF files with more complex fonts.
Our work on composite supports has significantly enhanced the text extraction, coupled with better handling of ToUnicode maps, more advanced processing and improved testing. Internally we now use a content stream processor which essentially works similar to a basic renderer except focused only on text extraction. It has the capability for tracking coordinates etc.
We expect to keep improving the text extractor further and our goal is to provide vectorized text extraction, i.e. be able to provide both text and font information as well as bounding box coordinates. While much of this information is already available internally, we decided to leave it unexported as we finalize the internals and design the API like want it to be.
Examples related to text extraction:
pdf_extract_text.go: Extracting text from PDF.
pdf_search_replace.go: Search and replace text in PDF.
Image extraction has also been added to the extractor providing a single method for extracting images with position and dimension.
pdf_extract_images.go: Extract images from PDF.
The basic image extraction code is now:
Many improvements in the creator package
The creator package has been significantly enhanced. Frankly, it's on a totally different level than before with styled paragraph support, an invoice component, automatically generated outlines (bookmarks) for table of contents, subtables, and many fixes and enhancements.
We will be writing more about these enhancements individually in upcoming blog posts. We recently wrote an introduction to the new invoice component recently in a blog post: Simple invoice creation.
Other notable additions include form filling, both via FDF merging and JSON importing as well as appearance generation, CCITTFaxDecode support, multiple bug fixes and enhancements.
Next steps - the journey ahead
Going forward we would like to establish a more regular release schedule and get new features and fixes to our users quicker.
This essentially means that we need to have shorter-lived branches and regularly merge the development branch into master. We plan to follow semantic versioning on this journey and may release major versions up to 2 times per year. In the past, the rate has been more like one major version per year, but we expect this frequency may increase as we release features more rapidly. And rapid release means there's less chance to undo poor API design decisions once it has been released and until the next major version.
For instructions on getting started, see: https://github.com/unidoc/unipdf
Remember to star the repository to get notifications for new tagged releases.
UniPDF examples repository is available at: https://github.com/unidoc/unipdf-examples
For inquiries, feel free to contact us: email@example.com