Creating and Modifying PDF Files

The PDF, or portable document format, is one of the most common formats for sharing documents over the internet. PDF files can contain text, images, tables, forms, and even rich media like videos and animations, all in a single file.  Fortunately, the Python ecosystem has great packages for reading, manipulating, and creating PDF files. This file type is independent of any platform, like software, hardware, and operating systems.

Note: You need to install a package named “pypdf2” that can handle the file with the ‘.pdf’ extension.

  • python3 -m pip install PyPDF2

Extract Text From a PDF

For understanding and practice, we will assume and utilize the “abc_python.pdf” file in our code.

# Open IDLE’s interactive window and import the PdfFileReader class from the PyPDF2 package:

from PyPDF2 import PdfFileReader

#To create a new instance of the PdfFileReader class, you’ll need to path to the PDF file that you want to open. Let’s get that now using the pathlib module:

from pathlib import Path

 pdf_path = (Path.home() / “abc_python.pdf” )

# You may need to change pdf_path so that it corresponds to the location of the folder on your computer

# Now create the PdfFileReader instance: 

pdf = PdfFileReader(str(pdf_path))

# pdf_path is converted to a string because PdfFileReader doesn’t know how to read from a pathlib.Path object.

# You don’t need to worry about opening or closing the PDF file The PdfFileReader object does all of this for you

# The getNumPages() method returns the number of pages contained in the PDF file: 

pdf.getNumPages()

# You can also access some document information using the .documentInfo attribute: 

pdf.documentInfo

# To get the title, use the title attribute:

pdf.documentInfo.title 

The documentInfo object contains the PDF metadata, which is set when a PDF is created. The PdfFileReader class is the gateway to working with PDF files in Python. It provides all the necessary methods and attributes needed to access data in a PDF file.

Wrapping up above code:

from PyPDF2 import PdfFileReader

from pathlib import Path

# Path to the PDF file

pdf_path = (Path.home() / “abc_python.pdf”)

# Create a PdfFileReader instance

pdf = PdfFileReader(str(pdf_path))

# Get the number of pages

num_pages = pdf.getNumPages()

# Access document information

document_info = pdf.documentInfo

title = document_info.title

# Extract text from the first page

first_page_text = pdf.getPage(0).extractText()

Extract Text From a Page 

PDF pages are represented in PyPDF2 with the PageObject class. You use PageObject instances to interact with pages in a PDF file. You can get an object representing a specific page by passing the page’s index to the PdfFileReader.getPage() method:

first_page = pdf.getPage(0) 

type(first_page) 

<class ‘PyPDF2.pdf.PageObject’>

# You can extract the page’s text with the PageObject.extractText() method: 

first_page.extractText() 

# Use a for loop to loop over all the pages in the PDF and print their text:

for page in pdf.pages: 

print(page.extractText()) 

Extract Pages From a PDF

Now you’ll learn how to extract a page, or a range of pages, from an existing PDF and save them to a new PDF.

The PdfFileWriter class is used to created a new PDF file. In IDLE’s interactive window, import the PdfFileWriter class and create a new instance called pdf_writer:

from PyPDF2 import PdfFileWriter

# Create a PdfFileWriter instance

pdf_writer = PdfFileWriter()

# Add a blank page

blank_page = pdf_writer.addBlankPage(width=72, height=72)

# Write the contents to a new PDF file

with Path(“blank.pdf”).open(mode=”wb”) as output_file:

pdf_writer.write(output_file)

This creates a new file called blank.pdf in your current working directory.

PdfFileWriter objects can write to new PDF files, but they can’t create new content from scratch other than blank pages.

Extracting a Single Page From a PDF 

We’ll open the PDF using a PdfFileReader class instance, extract the first page of the PDF, and then create a new PDF file containing just the single extracted page.

from pathlib import Path 

from PyPDF2 import PdfFileReader, PdfFileWriter 

#Now open the abc_python.pdf file with a PdfFileReader instance:

pdf_path = (Path.home() / “abc_python.pdf” )

input_pdf = PdfFileReader(str(pdf_path)) 

first_page = input_pdf.getPage(0)

 pdf_writer = PdfFileWriter() 

pdf_writer.addPage(first_page) 

#Now write the contents of pdf_writer to a new file called first_page.pdf: 

with Path(“first_page.pdf”).open(mode=”wb”) as output_file: 

pdf_writer.write(output_file) 

You now have a new PDF file saved in your current working directory with the name first_page.pdf that contains the cover page of the Pride_and_Prejudice.pdf file.

Extract Multiple Pages From a PDF File 

# Using for loops, you can extract multiple pages from a PDF file and save them to a new PDF.

from PyPDF2 import PdfFileReader, PdfFileWriter 

from pathlib import Path 

pdf_path = (Path.home() / “abc_python.pdf” )

input_pdf = PdfFileReader(str(pdf_path)) 

# Our goal is to extract the pages at indices 1, 2, and 3, add these to a new PdfFileWriter instance, and then write them to a new PDF file.

pdf_writer = PdfFileWriter() 

for n in range(1, 4): 

page = input_pdf.getPage(n)

pdf_writer.addPage(page) 

# Now pdf_writer has three pages added to it, which you can check with the .getNumPages() method: 

pdf_writer.getNumPages() 

# Finally, you can write the extracted pages to a new PDF file: 

with Path(“chapter1.pdf”).open(mode=”wb”) as output_file: 

pdf_writer.write(output_file) 

Now you can open the chapter1.pdf file in your current working directory to read just the first chapter of Abc_Python.

Create a PDF File From Scratch

The PyPDF2 package is great for reading and modifying existing PDF files, but it has a major limitations. You can’t use it to create a new PDF file.

We use the ReportLab toolkit to generate PDF files from scratch. ReportLab is a full-featured PDF creation solution

To get started, you need to install ReportLab with pip: 

  •  python3 -m pip install reportlab

from reportlab.pdfgen.canvas import Canvas

# Create a new Canvas instance for hello.pdf

canvas = Canvas(“hello.pdf”)

# Add text to the PDF

canvas.drawString(72, 72, “Hello World”)

# Save the PDF to a file

canvas.save()

You now have a PDF file called hello.pdf in your current working directory. You can open it with a PDF reader and see the text Hello World at the bottom of the page.

Setting The Page Size

You can change the page size when you instantiate a Canvas object with the optional pagesize parameter. For example, to set the page size to 8.5 inches width by 11 inches tall, you would create the following canvas: 

# Create a Canvas with a custom page size

custom_canvas = Canvas(“hello_custom.pdf”, pagesize=(612.0, 729.0))

Conclusion:

With PyPDF2 and ReportLab, you’ve acquired the tools to manipulate existing PDFs and create new ones from scratch. Whether you’re extracting text, crafting new documents, or adjusting page sizes, Python provides a magical journey into the world of PDF manipulation.

Visited 1 times, 1 visit(s) today