Creating and Modifying PDF Files
The PDF, or portable document format, is one of the most common formats for sharing documents over the internet. PDF files can contain text, images, tables, forms, and even rich media like videos and animations, all in a single file. Fortunately, the Python ecosystem has great packages for reading, manipulating, and creating PDF files. This file type is independent of any platform, like software, hardware, and operating systems.
Note: You need to install a package named “pypdf2” that can handle the file with the ‘.pdf’ extension.
- python3 -m pip install PyPDF2
Extract Text From a PDF
For understanding and practice, we will assume and utilize the “abc_python.pdf” file in our code.
# Open IDLE’s interactive window and import the PdfFileReader class from the PyPDF2 package:
from PyPDF2 import PdfFileReader
#To create a new instance of the PdfFileReader class, you’ll need to path to the PDF file that you want to open. Let’s get that now using the pathlib module:
from pathlib import Path
pdf_path = (Path.home() / “abc_python.pdf” )
# You may need to change pdf_path so that it corresponds to the location of the folder on your computer
# Now create the PdfFileReader instance:
pdf = PdfFileReader(str(pdf_path))
# pdf_path is converted to a string because PdfFileReader doesn’t know how to read from a pathlib.Path object.
# You don’t need to worry about opening or closing the PDF file The PdfFileReader object does all of this for you
# The getNumPages() method returns the number of pages contained in the PDF file:
pdf.getNumPages()
# You can also access some document information using the .documentInfo attribute:
pdf.documentInfo
# To get the title, use the title attribute:
pdf.documentInfo.title
The documentInfo object contains the PDF metadata, which is set when a PDF is created. The PdfFileReader class is the gateway to working with PDF files in Python. It provides all the necessary methods and attributes needed to access data in a PDF file.
Wrapping up above code:
from PyPDF2 import PdfFileReader
from pathlib import Path
# Path to the PDF file
pdf_path = (Path.home() / “abc_python.pdf”)
# Create a PdfFileReader instance
pdf = PdfFileReader(str(pdf_path))
# Get the number of pages
num_pages = pdf.getNumPages()
# Access document information
document_info = pdf.documentInfo
title = document_info.title
# Extract text from the first page
first_page_text = pdf.getPage(0).extractText()
Extract Text From a Page
PDF pages are represented in PyPDF2 with the PageObject class. You use PageObject instances to interact with pages in a PDF file. You can get an object representing a specific page by passing the page’s index to the PdfFileReader.getPage() method:
first_page = pdf.getPage(0)
type(first_page)
<class ‘PyPDF2.pdf.PageObject’>
# You can extract the page’s text with the PageObject.extractText() method:
first_page.extractText()
# Use a for loop to loop over all the pages in the PDF and print their text:
for page in pdf.pages:
print(page.extractText())
Extract Pages From a PDF
Now you’ll learn how to extract a page, or a range of pages, from an existing PDF and save them to a new PDF.
The PdfFileWriter class is used to created a new PDF file. In IDLE’s interactive window, import the PdfFileWriter class and create a new instance called pdf_writer:
from PyPDF2 import PdfFileWriter
# Create a PdfFileWriter instance
pdf_writer = PdfFileWriter()
# Add a blank page
blank_page = pdf_writer.addBlankPage(width=72, height=72)
# Write the contents to a new PDF file
with Path(“blank.pdf”).open(mode=”wb”) as output_file:
pdf_writer.write(output_file)
This creates a new file called blank.pdf in your current working directory.
PdfFileWriter objects can write to new PDF files, but they can’t create new content from scratch other than blank pages.
Extracting a Single Page From a PDF
We’ll open the PDF using a PdfFileReader class instance, extract the first page of the PDF, and then create a new PDF file containing just the single extracted page.
from pathlib import Path
from PyPDF2 import PdfFileReader, PdfFileWriter
#Now open the abc_python.pdf file with a PdfFileReader instance:
pdf_path = (Path.home() / “abc_python.pdf” )
input_pdf = PdfFileReader(str(pdf_path))
first_page = input_pdf.getPage(0)
pdf_writer = PdfFileWriter()
pdf_writer.addPage(first_page)
#Now write the contents of pdf_writer to a new file called first_page.pdf:
with Path(“first_page.pdf”).open(mode=”wb”) as output_file:
pdf_writer.write(output_file)
You now have a new PDF file saved in your current working directory with the name first_page.pdf that contains the cover page of the Pride_and_Prejudice.pdf file.
Extract Multiple Pages From a PDF File
# Using for loops, you can extract multiple pages from a PDF file and save them to a new PDF.
from PyPDF2 import PdfFileReader, PdfFileWriter
from pathlib import Path
pdf_path = (Path.home() / “abc_python.pdf” )
input_pdf = PdfFileReader(str(pdf_path))
# Our goal is to extract the pages at indices 1, 2, and 3, add these to a new PdfFileWriter instance, and then write them to a new PDF file.
pdf_writer = PdfFileWriter()
for n in range(1, 4):
page = input_pdf.getPage(n)
pdf_writer.addPage(page)
# Now pdf_writer has three pages added to it, which you can check with the .getNumPages() method:
pdf_writer.getNumPages()
# Finally, you can write the extracted pages to a new PDF file:
with Path(“chapter1.pdf”).open(mode=”wb”) as output_file:
pdf_writer.write(output_file)
Now you can open the chapter1.pdf file in your current working directory to read just the first chapter of Abc_Python.
Create a PDF File From Scratch
The PyPDF2 package is great for reading and modifying existing PDF files, but it has a major limitations. You can’t use it to create a new PDF file.
We use the ReportLab toolkit to generate PDF files from scratch. ReportLab is a full-featured PDF creation solution
To get started, you need to install ReportLab with pip:
- python3 -m pip install reportlab
from reportlab.pdfgen.canvas import Canvas
# Create a new Canvas instance for hello.pdf
canvas = Canvas(“hello.pdf”)
# Add text to the PDF
canvas.drawString(72, 72, “Hello World”)
# Save the PDF to a file
canvas.save()
You now have a PDF file called hello.pdf in your current working directory. You can open it with a PDF reader and see the text Hello World at the bottom of the page.
Setting The Page Size
You can change the page size when you instantiate a Canvas object with the optional pagesize parameter. For example, to set the page size to 8.5 inches width by 11 inches tall, you would create the following canvas:
# Create a Canvas with a custom page size
custom_canvas = Canvas(“hello_custom.pdf”, pagesize=(612.0, 729.0))
Conclusion:
With PyPDF2 and ReportLab, you’ve acquired the tools to manipulate existing PDFs and create new ones from scratch. Whether you’re extracting text, crafting new documents, or adjusting page sizes, Python provides a magical journey into the world of PDF manipulation.