内容简介:Recently, I worked on a project that involves filling in PDF forms programmatically.Spoiler alert:I had aBut if you have any experience dealing with PDFs, you already knew that.
Recently, I worked on a project that involves filling in PDF forms programmatically.
Spoiler alert:I had a horrible experience.
But if you have any experience dealing with PDFs, you already knew that.
We inherited a Django and an Ionic/Cordova codebase. One of the things the the Cordova app does is send data to the Django backend which in turn creates a PDF.
The situation is that we have a number of PDFs which are basically forms that get filled in based on the data sent from the mobile app. The nature of the project is such that we cannot change how the PDF looks like.
The mobile app isn’t used all year round, but only during certain sales cycles. Each sales cycle we are given a set of PDFs that not only look different from each other – but also look different from previous sales cycles.
That means for this particular part of the project, we can’t standardise the PDF that gets generated.
Essentially, we need to prepare the form for each sales cycle.
Typically, in the form there are a number of text fields, a lot of checkboxes, a number of radio buttons.
One unusual quirk was that it also requires a signature , which is captured from the mobile app.
Given these constraints, I tried a few things.
Idea #1: Use LibreOffice
I tried using LibreOffice.
This was actually how things were done before we inherited the codebase.
The idea was that we can put in named fields in an ODT document, which we can then fill in via something called UNO.
Basically the workflow goes like this:
- Step 1: Convert all the pages in the form to images, and make sure it fits in A4.
- Step 2: Create a new document in LibreOffice, and set the page styles to use these images as the background
- Step 3: Put in named fields in the text fields, checkboxes, and radio buttons.
- Step 4: Use UNO to put text into the fields. For checkboxes, we can put in a unicode checkbox character.
- Step 5: Use UNO to convert the ODT document to PDF
Now, in theory that approach was promising – however there were a number of problems.
Firstly, have you tried aligning anything in an ODT document? It’s freaking awful .
The workflow to prepare the forms could easily take days, not to mention super annoying to do for 12 different forms. And I had to do that for each sales cycle.
Also, I wasn’t actually able to get this technique to work on my local machine, which has a newer version of LibreOffice. I was told OpenOffice works better.
Using images as the background of an ODT document also results in pretty poor quality documents. We could use less compression, of course, but we end up with massive files that are hard to work with within LibreOffice.
So after a couple of false starts, I abandoned this approach.
Idea #1.5: Use HTML-to-PDF or RML/ReportLab
So I looked at a number of different ideas. One of them was to use some HTML-to-PDF tools.
Unfortunately, that meant recreating the form in HTML and CSS. We definitely can’t do that.
That also rules out approaches that use RML or ReportLab.
Idea #2: Use FDF
So, we know fillable PDF forms are a thing.
Adobe actually has several standards to fill in PDF documents. One of them is called the Forms Data Format, or FDF.
I wish I could tell you I understood the FDF specification, but the information I see online isn’t very helpful. Just go read the Wikipedia page on it .
The idea is that an FDF file contains data either from a submitted PDF form, or data that is meant to be filled into a PDF (usually the former, from what I can gather).
That means given an FDF, it should be possible to merge it witha PDF form to create a final filled-in PDF form.
So, I talked to my employer who let me purchase Adobe Acrobat Pro DC for about $30 a month, and then used that to add PDF form fields to the PDFs the client gave me.
Great, so all I need to do is to find some Python libraries to create an FDF and merge it with the PDF form I created.
I found a few:
PyPDFtk
The PyPDFtk Python library is a Python wrapper for a server application called PDFtk .
It generates the (X)FDF for you, and merges it by calling PDFtk shell commands. It largely hides the implementation from you.
For most things, this works fine. Textboxes, radio buttons, checkboxes – all fine.
However, it doesn’t handle image fields.
Unfortunately, one of the requirements was to paste a signature to the form, so I can’t actually use this.
PdfJinja
This one does allow you to paste images, which was great!
Here’s the repo: https://github.com/rammie/pdfjinja
I was actually really surprised to see a Python library that does this. I was under the impression that it was a really obscure niche.
I’m not actually 100% sure how it achieves this, but it looks like it uses an API that allows for creating watermarks on PDFs. So watermarks are essentially images, so it does the job.
But… the library itself doesn’t seem to work for checkboxes and radio buttons.
In fact, I pulled the examples from the repo, and ran the example. Filled in text fields and the signature fine, but the checkboxes don’t work.
Perhaps it worked at some point in time, but at some point this obviously stopped working for checkboxes.
But I also knew that FDFs can select checkboxes correctly – PyPDFtk worked fine.
So, I looked at the differences in the implementations between PyPDFtk and PdfJinja, and what I found was that for some reason, if you want a checkbox to be checked – the FDF output needs to contain the specific value of “Yes”.
I created my own fork, and replaced the value with the string “Yes”.
That fixed checkboxes for me, but radio buttons were still broken.
So I pulled out my debugger to see exactly what went wrong. and it turns out the library PDFJinja uses, called PdfMiner, handles some grouped form controls like radio buttons with some subtle difference compared to normal text fields.
I changed that to handle that specific case, and now radio buttons worked for me.
I was hesitant to open a pull request against the original repo because:
- I didn’t really understand what was happening
- I didn’t know if my change actually breaks something else
In the end I decided to maintain my own fork.
Other weird shit
Sometimes, I run into some errors related to fonts. Searching the error on StackOverflow gave me this gem from 2014: https://stackoverflow.com/questions/23948647/font-issue-with-pdftk
The original developer of PDFtk (whose name is in the Java stacktrace) commented on the question with a recommendation to use a newer incarnation of PDFtk.
Unfortunately, that doesn’t have a Python SDK, which I could in theory create, but that seems like too much effort.
Someone else found a workaround – the solution is to open the PDF form in Preview (on Mac), type something in the text forms, remove the text, then save it again.
And then it magically works.
Turns out sometimes it happens for checkboxes too. Sometimes, I’ve had to open the PDF form, click on the checkboxes, unselect them, and then save. Magically works.
I really hate dealing with PDFs at this point.
Conclusion
The experience for me was horrible for dealing with a file format as widely used as PDFs.
If you’re looking to do some open source, there’s a lot of low hanging fruit here.
Another interesting path to go down would be to create a new document format with an API that is friendlier to developers – and then make it possible to convert that to PDFs. Seems like a herculean task, though.
If anyone is interested in looking at sample code, I created a Django project that does this, with signatures and all: https://github.com/yoongkang/pdffun
EDIT 29 May 2018:Some people seem to be interested in my fork of PdfJinja, which works for checkboxes and radio buttons. Here is the link to it: https://github.com/yoongkang/pdfjinja
以上所述就是小编给大家介绍的《Filling in PDF forms with Python (2018)》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
白话机器学习算法
[新加坡] 黄莉婷、[新加坡] 苏川集 / 武传海 / 人民邮电出版社 / 2019-2 / 49.00元
与使用数学语言或计算机编程语言讲解算法的书不同,本书另辟蹊径,用通俗易懂的人类语言以及大量有趣的示例和插图讲解10多种前沿的机器学习算法。内容涵盖k均值聚类、主成分分析、关联规则、社会网络分析等无监督学习算法,以及回归分析、k最近邻、支持向量机、决策树、随机森林、神经网络等监督学习算法,并概述强化学习算法的思想。任何对机器学习和数据科学怀有好奇心的人都可以通过本书构建知识体系。一起来看看 《白话机器学习算法》 这本书的介绍吧!