How to Work With a PDF in Python (Summary)

How to Work With a PDF in Python Andrew Stephen 01:41

The PyPDF2 package is quite useful and is usually pretty fast. You can use PyPDF2 to automate large jobs and leverage its capabilities to help you do your job better!

In this course, you learned how to do the following:

Extract metadata from a PDF
Rotate pages
Merge and split PDFs
Add watermarks
Add encryption

Also keep an eye on the newer PyPDF4 package as it will likely replace PyPDF2 soon. You might also want to check out pdfrw, which can do many of the same things that PyPDF2 can do.

If you’d like to learn more about working with PDFs in Python, then you should check out some of the following resources for more information:

Download

Course Slides (.pdf)

153.8 KB

Download

Sample Code (.zip)

2.7 KB

Download

Course Documents (.zip)

3.4 MB

Congratulations, you made it to the end of the course! What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment in the discussion section and let us know.

00:00 Welcome to the sixth and final part of the Real Python course on how to work with PDFs in Python. This course covered PyPDF2 history, an alternative PDF manipulation package called pdfrw, and the installation of the PyPDF2 module. Extracting document metadata was then covered, followed by rotating pages, merging and splitting PDFs, adding watermarks, and encryption.

00:28 This was all done using the PyPDF2 module. Moving forward, you should also keep an eye out for PyPDF4, which is likely to replace PyPDF2 soon. As mentioned earlier, checking out pdfrw might also help you as it has much the same functionality and capabilities as PyPDF2, barring encryption.

00:51 If you’d like to learn more about using Python to work with PDFs, you should check out the following resources, all of which are linked below the video. Again, if you download the slide presentation, each line is a link to itself, starting with the PyPDF2 website, the GitHub pages for PyPDF4 as well as pdfrw, the ReportLab website, the GitHub page for PDFMiner, which as mentioned earlier is a more robust option for extracting text from a PDF, and Camelot: PDF Table Extraction for Humans. Well done on completing the Real Python course for working with PDFs in Python. I’m Andrew Stephen, and thanks for joining me on this road of PDF manipulation. See you next time.

mikesult on March 1, 2020

Thank you Andrew for a great and very useful tutorial. I learned a lot about working with PDFs. I use pdf files as music charts quite a bit and these techniques will be very useful to split, merge and organize charts from pdf books. I appreciate your links to additional resources too.

fahmico on March 5, 2020

Thank you for the tutorial! You explain very well.^_^ This is really worth to learn.

Andrew Stephen RP Team on March 6, 2020

Hi @mikesult. Thanks for the feedback, glad you enjoyed the course and that you will be getting almost immediate real world use from what you have learnt.

Andrew Stephen RP Team on March 6, 2020

Hi @fahmico, Thanks for the kind words. Glad you enjoyed it!

rgusaas on March 7, 2020

Ditto on excellent presentation. The ReportLab reference was a real eye opener. Greatly appreciated.

Perhaps another lesson on reading a PDF’s contents. I wrote a PDF reader that would split a 100+ Page invoice document into separate pages and pulled the account manager name, invoice number and job number for the output file naming convention. Seems that most of the world struggles with how to strip out contents or search the contents of PDF files.

sion on March 23, 2020

Many thanks for an excellent and useful presentation. Some years ago I scraped PDF’s for this information. It was MESSY. Now, “never again” Thank you.

Alan ODannel on April 14, 2020

Very informative lesson. I’ll be able to put this to use in the near future.

dthomas01 on April 14, 2020

I’m late to the party....really enjoyed this tutorial. Thought I would mention that PyPDF2 hangs in the middle of writing out the encrypted PDF file. Switching to the newer PyPDF4 you earlier mentioned solved that issue. I’m using Python 3.7 on Windows 10 Pro. The rest of the programs ran flawlessly. Very impressive and hope you keep up the good work, Andrew!

Felix M on May 24, 2020

Very informative course. Thank you!

andresfmesad on Sept. 14, 2021

Very well explained! Is there a way to write a pandas dataframe to a PDF file and specify some format?

Hugh Tipping on Sept. 14, 2021

Very happy with this presentation. It gives a solid foundation in starting to work with PDFs with enough outside reference material to keep me busy for a long time. Many thanks.

Become a Member to join the conversation.