Extracting Metadata and Rotating Pages

How to Work With a PDF in Python Andrew Stephen 11:50

Now that we have the PyPDF package installed, let’s take a look at how to extract document metadata and begin to manipulate PDFs, starting with page rotation.

You can check out the following resources:

00:00 Welcome to part 3 of working with PDFs in Python. Now that we have covered history and installation of PyPDF2, let’s now take a look at extracting some document metadata.

00:11 This can be a useful task if you are doing certain types of automation on your pre-existing PDFs. Currently, the types of data that can be extracted is this: author, creator, producer, subject, title, and number of pages. Now, for the example we’re going to look at, you need to find a PDF. Any PDF that you have on your PC will do.

00:35 However, to keep things consistent, I have grabbed the same PDF as was used in the written tutorial of this course, which was a free sample of a book by Michael Driscoll through Leanpub.

00:46 There will be a link to that document below the video if you wish to get the same document.

00:51 The document is called reportlab-sample.pdf. Now, let’s take a look at some code that will give you access to the attributes that I mentioned earlier. Here, you import PdfFileReader from PyPDF2.

01:07 That’s line 3 here. This is a class that has several methods for interacting with PDF files. In this example, you call the .getDocumentInfo(), which will return an instance of document information.

01:21 This line just here.

01:24 This contains most of the information that was mentioned earlier, which of course was author, creator, producer, subject, title, and number of pages. You also call .getNumPages() on the reader object, which returns the number of pages within the document.

01:40 That’s the only thing that’s not returned by the .getDocumentInfo() line. The information variable has several instance attributes that you can use to get the rest of the metadata you are after. You then print out that information as well as return it for future use if needed. The information that is extracted from the document is then put into an f-string, or formatted string literal. Now, in case you’re unfamiliar with what an f-string is—as they were only introduced in Python 3.6—these are string literals that have an f at the beginning, as you can see just here, and curly braces containing expressions that will be replaced with their values. For example, f'{pdf_path}', f'{information.author}', f'{information.creator}', et cetera, et cetera, as you go down.

02:26 These f-strings are evaluated at runtime, so they allow any and all valid Python expressions within them. This could include anything from inline arithmetic, calling functions or methods, or even using an object that is created from a class.

02:41 If you want to learn more about f-strings, there is—of course—a Real Python tutorial on the subject, which is linked below the video. If you wanted to attempt to extract text from a document, then it is recommended that you instead use PDFMiner as opposed to PyPDF2.

02:57 While PyPDF2 does have a .extractText() method, which can be used on a page object, it is a little inconsistent in that when extracting text, it will occasionally return an empty string.

03:09 PDFMiner is just a little more robust and was specifically designed for extracting text from PDFs. A link to PDFMiner is below the video if you do wish to check it out. If we take another look at this example, and we quickly have a look at the final three lines—so lines 25, 26, and 27—you get if __name__.

03:31 Now, keep in mind in Python that double underscore (__) is referred to as dunder, so moving forward in this course, if I need to refer to it again, I will be calling it dunder.

03:41 So if __name__ == '__main__': then path = 'reportlab-sample.pdf', and then it calls extract_information(), passing it the path which is 'reportlab-sample.pdf'.

03:59 Now, this code block is used in order to selectively execute when the extract_information() function runs. Of course, this was the one that we defined up here. By using this small code block, the extract_information() function will run when called from a command line, but not if the file is imported by the Python interpreter—at least not instantly. For a deeper dive into this matter, check out the Real Python tutorial on the subject. A link to it is below the video, and I must say that it’s a really, really great tutorial. Now, they all are on Real Python—don’t get me wrong—but this is actually the way I personally finally understood the purpose of that little code block there when I first got into Python.

04:40 It’s a really great tutorial and everything is explained very, very well. While we’re speaking of this little code block at the bottom, the variable path—please note that it needs to be identical to the name of the PDF you wish to extract the metadata from. The easiest way to do this is is to have the file itself in the same folder as the script, and then just make your path variable equal to the title, which is what I’ve done here, 'reportlab-sample.pdf'.

05:07 If the document is in a different folder, then you’ll need to include the path of the file in the path variable, and that can get a little bit awkward at times, so it’s easier just to have them both in the same working folder.

05:19 So. If we take a quick look at this, we run the current script—and there you go! You can say that it extracts the information about reportlab-sample.pdf with the Author: Michael Driscoll, the Creator, Producer. Subject, interestingly, is None. The Title: ReportLab - PDF Processing with Python, and 54 pages. Very good!

05:44 Now let’s take a look at rotating pages. Every now and then you will encounter a PDF that has pages that are in landscape mode instead of portrait mode, or vice versa, or potentially even upside down.

05:56 Now, you could print that out and read the paper version, or you could make the pages rotate using the power of Python. For this example, you can print out a Real Python article. Again, I have used the same article from the written version of this tutorial, and a link can be found below the video to the tutorial that I took a printed version of. The PDF itself will also be available for download.

06:21 Let’s take a look at how we can rotate a few of the pages within this article. Here, you need to import the PdfFileWriter as well as the PdfFileReader from PyPDF2, because you need to write to a new PDF.

06:36 The rotate_pages() method starts by taking in the path of the PDF that you want to modify. Now, within that function, you need to create a reader object and a writer object, which is the first two lines. These are named pdf_reader and pdf_writer, respectively. Next, you use the .getPage() method to navigate to the desired page, which is just here. Here, we use page 0, which is the first page of the PDF.

07:07 Then, you call the .rotateClockwise() method and pass it 90 degrees.

07:13 You then pass the rotated page—which is called page_1, as you can see here—to the .addPage() method to the pdf_writer object created earlier. Then, for page_2, you call .rotateCounterClockwise() and pass it 90 degrees as well, just here. You then do the same thing as page_1, and then pass it to the writer object with the .addPage() method.

07:43 At this point, it’s important to note that the PyPDF2 package only allows rotations of 90 degree increments and decrements. If you use a number like 80 or 45, or basically any number that’s not a multiple of 90, then you will encounter an AssertionError.

08:03 You can see here, page_3—I then take page 3 from the article, rotate it counterclockwise again, but this time I pass it 180 degrees.

08:14 This is just to show that it does rotate all the way around. As mentioned, 180 is a multiple of 90, so this will be okay. Alternatively, you could also call the .rotateCounterClockwise() method twice on the same page and pass it 90. Either method will work, this one is just a little bit cleaner. And again, you then add that page page_3, to the pdf_writer object using the .addPage() method. Finally, we add a page in normal orientation, for lack of a better word.

08:47 Because we’re not manipulating this page in any way, we can just add it directly by passing the .getPage() method of the reader object to the .addPage() method of the writer object.

09:00 And that’s what this line here is doing.

09:03 If we then take a look at these two lines here, line 20 and 21, the with open() notation—this code block creates a context manager, which means that the script will automatically allocate and release resources as required. In this case, using it as a file handler—which is what the fh stands for—it will open the 'rotate_pages.pdf' file—or create it if it doesn’t already exist—in write and binary mode, and that’s what 'wb' means.

09:34 So the open() method takes two arguments, the file you want to open and what mode you want to open it in. Binary mode means that the contents is returned without any decoding.

09:47 If you want more information on this, you can check the official Python documentation. A link to that is going to be below the video. Once the file is open, it then calls a pdf_writer object and writes all pages that have been added to it to the 'rotate_pages.pdf' file.

10:03 The context manager will then automatically close the file and release the resources it was using once the final page has been added. An important note about the title of the file, just here, is that it won’t confirm you wish to overwrite an existing document if a document by that title already exists.

10:23 So what you need to do is make sure that the title you want to use is not already used by a different file, because you will lose your original one if you run this script without checking first. Just a quick note. Now, finally, within the script we have this now-familiar main function code block, with the __name__ == '__main__', and when passing it the path 'Jupyter_Notebook.pdf', again, make sure that your path is equal to the title of the document you want to manipulate and make sure that that document is within that same folder where your script lies. It just makes life a little bit easier. Now, if we run this script—there we go—and check what the rotate_pages.pdf file looks like, we get something like this. Page 1 is rotated clockwise 90 degrees.

11:19 Page 2 is counter-clockwise 90 degrees.

11:23 And page 3 is upside down, or rotated counterclockwise 180 degrees. Of course, you can use clockwise 180 degrees—they are essentially the same in that you get the same result, so if you’re just turning a page upside down, it makes no difference.

11:38 And then you’ve got your final page, which is in the correct orientation.

11:42 Next up, you’re going to learn how to merge and split PDFs, but not until the next video. Hope to see you there.

rolandgarceau on Feb. 26, 2020

Where is the deep-dive for the explanation on the python interpreter from @4:15 in the video?

Chris Bailey RP Team on Feb. 26, 2020

Hi @rolandgarceau, The link that Andrew is referencing @4:15 is the one in called “Defining Main Functions in Python”.

Phil M on Feb. 26, 2020

Thank you, I had that same question. :)

mikesult on Feb. 29, 2020

I think line 7 should read

pdf_reader = PdfFileReader(pdf_path)

instead of

pdf_reader = PdfFileReader(path)

since the parameter passed in to rotate_pages() was named ‘pdf_path’.

It works as written but I think it’s because ‘path’ from the if name == ‘main’ section is available to rotate_pages() in this situation.

Is this correct?

Chris Bailey RP Team on March 1, 2020

Hi @mikesult, You are correct on both accounts. If you were to import this file as a module and try to use rotate_pages('Juypter_Notebook_An_Introduction.pdf') it would fail as path is not defined. I will mention this to the team. Thanks.

gracetan on May 6, 2020

Hi Chris, I am thinking how can i find the edited documents？ for example, if i rotated the documents, i want to see the updated documents.

Become a Member to join the conversation.