Antonia's Website - Removing Background Images from PDF Documents

While board game makers have gotten extremely good at posting easily accessible PDF rules on their websites, the PDFs they offer usually have some fancy background that fits with the game theme. This may look great when printed as the glossy paper booklet that comes in the box, however, the lack of contrast yields a less than ideal experience when viewed on an e-ink screen due to the low contrast between the text and the background.

A Remarkable 2 showing the rules of a board game with a gray background decreasing contrast

So, to improve the readability of the rules, I thought that the easiest way would be removing the background images. However, due to the way PDF is designed and me being a total PDF noob with no clue how the format works turned what seemed like a ten minute task into several hours of work, but I think I’ve found a workflow that works okay and could be made a lot easier with a tiny bit of programming, based off this reddit thread.

Requirements §

For the workflow, we need qpdf, mutool and ghostscript, which can be installed with the following command when running Debian:

apt-get install qpdf mupdf-tools ghostscript

Finding the background images §

The first step is to convert the PDF to a format that is more inspectable:

qpdf -qdf Input.pdf Input_e.pdf

Then, using this file, we can inspect it using mutool:

mutool info -I Input_e.pdf

This command outputs a massive and hard to understand list of all the images used inside the PDF document. It should look like this:

[...]
Retrieving info from pages 1-20...
Images (570):
	1	(9 0 R):	[ DCT ] 796x1126 8bpc ICC (61 0 R)
	1	(9 0 R):	[ Raw ] 139x45 8bpc ICC (63 0 R)
	2	(10 0 R):	[ DCT ] 796x1126 8bpc ICC (75 0 R)
	2	(10 0 R):	[ Raw ] 806x1146 8bpc Idx (4441 0 R)
	2	(10 0 R):	[ DCT ] 806x1146 8bpc ICC (4443 0 R)
	2	(10 0 R):	[ DCT ] 51x49 8bpc ICC (4445 0 R)
	3	(11 0 R):	[ DCT ] 226x451 8bpc ICC (630 0 R)
	3	(11 0 R):	[ DCT ] 145x209 8bpc ICC (632 0 R)
	3	(11 0 R):	[ DCT ] 188x266 8bpc ICC (634 0 R)
	3	(11 0 R):	[ Raw ] 38x36 8bpc ICC (636 0 R)
	3	(11 0 R):	[ Raw ] 38x36 8bpc ICC (638 0 R)
	3	(11 0 R):	[ Raw ] 38x36 8bpc ICC (640 0 R)
	3	(11 0 R):	[ Raw ] 51x48 8bpc ICC (642 0 R)
[...]

The most relevant parameters are the first number in each row, which is the page number the image first occurs in, and the number in the last bracked, which is the object number, as well as the image resolution. Now, finding which images make up the background will require quite a bit of guesswork, a good heuristic is to take images that appear early in the PDF and have a very large resolution. The images can be inspected using

mutool extract Input_e.pdf <object number>

to double-check whether we have selected the right images. In the example PDF, it turned out that the images we were looking for are images 4441 and 4443. Image 61 looked like a good suspect at first glance, however, it turned out to be the entire cover of the rulebook.

Removing the images §

After having found the images, we can remove the object references using sed:

sed -e "s/<object number> 0 obj/999 0 obj/g" -i"" Input_e.pdf

This results in a broken pdf with broken references. Viewing it, at least using evince results in random images being used as the background. These issues can be fixed by running

mutool clean -g Input_e.pdf Output_broken.pdf
gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=Output.pdf Output_broken.pdf

This should yield a PDF with white background, which is more readable on e-ink screens, especially in poor light environments:

A Remarkable 2 showing the rules of a board game, which are more readable due to the removed background

Conclusions §

There are probably way better ways to do this with more knowledge of the PDF standard, and I also have my doubts at how generalizable this method is. It worked fine on the PDFs I tried it on, but there are probably plenty of PDFs out there that won’t work. Furthermore, the process is quite convoluted but could be streamlined by writing a program that previews the images to be deleted to make spotting the background image much faster and easier.