Filling in PDF forms with Python (2018)

栏目: IT技术 · 发布时间: 4年前

内容简介：Recently, I worked on a project that involves filling in PDF forms programmatically.Spoiler alert:I had aBut if you have any experience dealing with PDFs, you already knew that.

Recently, I worked on a project that involves filling in PDF forms programmatically.

Spoiler alert:I had a horrible experience.

But if you have any experience dealing with PDFs, you already knew that.

We inherited a Django and an Ionic/Cordova codebase. One of the things the the Cordova app does is send data to the Django backend which in turn creates a PDF.

The situation is that we have a number of PDFs which are basically forms that get filled in based on the data sent from the mobile app. The nature of the project is such that we cannot change how the PDF looks like.

The mobile app isn’t used all year round, but only during certain sales cycles. Each sales cycle we are given a set of PDFs that not only look different from each other – but also look different from previous sales cycles.

That means for this particular part of the project, we can’t standardise the PDF that gets generated.

Essentially, we need to prepare the form for each sales cycle.

Typically, in the form there are a number of text fields, a lot of checkboxes, a number of radio buttons.

One unusual quirk was that it also requires a signature , which is captured from the mobile app.

Given these constraints, I tried a few things.

Idea #1: Use LibreOffice

I tried using LibreOffice.

This was actually how things were done before we inherited the codebase.

The idea was that we can put in named fields in an ODT document, which we can then fill in via something called UNO.

Basically the workflow goes like this:

Step 1: Convert all the pages in the form to images, and make sure it fits in A4.
Step 2: Create a new document in LibreOffice, and set the page styles to use these images as the background
Step 3: Put in named fields in the text fields, checkboxes, and radio buttons.
Step 4: Use UNO to put text into the fields. For checkboxes, we can put in a unicode checkbox character.
Step 5: Use UNO to convert the ODT document to PDF

Now, in theory that approach was promising – however there were a number of problems.

Firstly, have you tried aligning anything in an ODT document? It’s freaking awful .

The workflow to prepare the forms could easily take days, not to mention super annoying to do for 12 different forms. And I had to do that for each sales cycle.

Also, I wasn’t actually able to get this technique to work on my local machine, which has a newer version of LibreOffice. I was told OpenOffice works better.

Using images as the background of an ODT document also results in pretty poor quality documents. We could use less compression, of course, but we end up with massive files that are hard to work with within LibreOffice.

So after a couple of false starts, I abandoned this approach.

Idea #1.5: Use HTML-to-PDF or RML/ReportLab

So I looked at a number of different ideas. One of them was to use some HTML-to-PDF tools.

Unfortunately, that meant recreating the form in HTML and CSS. We definitely can’t do that.

That also rules out approaches that use RML or ReportLab.

Idea #2: Use FDF

So, we know fillable PDF forms are a thing.

Adobe actually has several standards to fill in PDF documents. One of them is called the Forms Data Format, or FDF.

I wish I could tell you I understood the FDF specification, but the information I see online isn’t very helpful. Just go read the Wikipedia page on it .

The idea is that an FDF file contains data either from a submitted PDF form, or data that is meant to be filled into a PDF (usually the former, from what I can gather).

That means given an FDF, it should be possible to merge it witha PDF form to create a final filled-in PDF form.

So, I talked to my employer who let me purchase Adobe Acrobat Pro DC for about $30 a month, and then used that to add PDF form fields to the PDFs the client gave me.

Great, so all I need to do is to find some Python libraries to create an FDF and merge it with the PDF form I created.

I found a few:

PyPDFtk

The PyPDFtk Python library is a Python wrapper for a server application called PDFtk .

It generates the (X)FDF for you, and merges it by calling PDFtk shell commands. It largely hides the implementation from you.

For most things, this works fine. Textboxes, radio buttons, checkboxes – all fine.

However, it doesn’t handle image fields.

Unfortunately, one of the requirements was to paste a signature to the form, so I can’t actually use this.

PdfJinja

This one does allow you to paste images, which was great!

Here’s the repo: https://github.com/rammie/pdfjinja

I was actually really surprised to see a Python library that does this. I was under the impression that it was a really obscure niche.

I’m not actually 100% sure how it achieves this, but it looks like it uses an API that allows for creating watermarks on PDFs. So watermarks are essentially images, so it does the job.

But… the library itself doesn’t seem to work for checkboxes and radio buttons.

In fact, I pulled the examples from the repo, and ran the example. Filled in text fields and the signature fine, but the checkboxes don’t work.

Perhaps it worked at some point in time, but at some point this obviously stopped working for checkboxes.

But I also knew that FDFs can select checkboxes correctly – PyPDFtk worked fine.

So, I looked at the differences in the implementations between PyPDFtk and PdfJinja, and what I found was that for some reason, if you want a checkbox to be checked – the FDF output needs to contain the specific value of “Yes”.

I created my own fork, and replaced the value with the string “Yes”.

That fixed checkboxes for me, but radio buttons were still broken.

So I pulled out my debugger to see exactly what went wrong. and it turns out the library PDFJinja uses, called PdfMiner, handles some grouped form controls like radio buttons with some subtle difference compared to normal text fields.

I changed that to handle that specific case, and now radio buttons worked for me.

I was hesitant to open a pull request against the original repo because:

I didn’t really understand what was happening
I didn’t know if my change actually breaks something else

In the end I decided to maintain my own fork.

Other weird shit

Sometimes, I run into some errors related to fonts. Searching the error on StackOverflow gave me this gem from 2014: https://stackoverflow.com/questions/23948647/font-issue-with-pdftk

The original developer of PDFtk (whose name is in the Java stacktrace) commented on the question with a recommendation to use a newer incarnation of PDFtk.

Unfortunately, that doesn’t have a Python SDK, which I could in theory create, but that seems like too much effort.

Someone else found a workaround – the solution is to open the PDF form in Preview (on Mac), type something in the text forms, remove the text, then save it again.

And then it magically works.

Turns out sometimes it happens for checkboxes too. Sometimes, I’ve had to open the PDF form, click on the checkboxes, unselect them, and then save. Magically works.

I really hate dealing with PDFs at this point.

Conclusion

The experience for me was horrible for dealing with a file format as widely used as PDFs.

If you’re looking to do some open source, there’s a lot of low hanging fruit here.

Another interesting path to go down would be to create a new document format with an API that is friendlier to developers – and then make it possible to convert that to PDFs. Seems like a herculean task, though.

If anyone is interested in looking at sample code, I created a Django project that does this, with signatures and all: https://github.com/yoongkang/pdffun

EDIT 29 May 2018:Some people seem to be interested in my fork of PdfJinja, which works for checkboxes and radio buttons. Here is the link to it: https://github.com/yoongkang/pdfjinja

以上所述就是小编给大家介绍的《Filling in PDF forms with Python (2018)》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Filling in PDF forms with Python (2018)

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

《裂变：秒懂人工智能的基础课》

王天一 / 电子工业出版社·博文视点 / 2018-6-13 / 59.00元

人工智能是指通过普通计算机程序实现的人类智能技术，这一学科不仅具有非凡的科学意义，对人类自身生存方式的影响也在不断加深。本书作为人工智能领域的入门读物，内容围绕人工智能的核心框架展开，具体包括数学基础知识、机器学习算法、人工神经网络原理、深度学习方法与实例、深度学习之外的人工智能和实践应用场景等模块。本书力图为人工智能初学者提供关于这一领域的全面认识，也为进一步的深入研究建立坚实的基础。一起来看看《《裂变：秒懂人工智能的基础课》》这本书的介绍吧!

码农工具