As a Kindle user, one of the most daunting tasks is converting a PDF (completely readable on a computer, etc.) to a readable format for my e-reader. Out of almost every file format I’ve come across, PDF remains the most difficult, and sadly, widely used. This isn’t great for my Kindle, that doesn’t like PDF, and prefers books in either .mobi or .epub format.
However, after visiting several guides and searching for three days for a solution, I’ve finally found one that works for me. I would love to share with you the procedure, but I have to inform you that not all PDF files contain the same layout, and this guide may not be applicable in some situations. Throughout the whole method, you only need one application —Calibre. It is a highly efficient ebook library (and converter) and a must-have for all e-reading bookworms.
These are the components I’ll be discussing in this blog-post:
- When you convert a PDF file to another format, the lines seem to form a paragraph on its on, causing an extremely frustrating layout. This will be cured using the line-wrap feature.
- Sometimes, you have page numbers in the footer of the PDF, and the whole alternating title/author text in the header. Unfortunately, when you convert this to another format, the text remains within the book itself, causing a painful and jolting reading experience. This will be fixed using the Search & Replace function.
- There are usually random line-breaks just above the header/footer, caused by an inconsistency in the newly formatted file. In order to fix this, we’ll be using the Search & Replace function as well.
If you have any other problems regarding the conversion, feel free to comment the problem in the comment section below. I am no expert, but I will try my best to assist you with the matter.
For the conversion, I will be using one of my favourite books of all time, The Picture of Dorian Gray, downloaded from Planet eBook. The following is what the format looks like:
As you can see, there is actually no header for this file; just a footer with alternating text (title and the source) and the page numbers. However, you will find controlling the header is more or less the same.
The first thing to do, after you’ve installed Calibre is to click the “Add Books” button (the very first) on the top of the screen. Add the PDF file you wish to convert to your library. Secondly, click on the file, then click the “Convert Books” button on the toolbar, and then choose the “Convert Individually” option. You can see the instructions on the picture below.
You should get the following dialog box.
Now, you’re ready for the conversion process. 🙂 (PS: For the rest of the tutorial, please refer back to the above image. I will help you in finding the tools you need) Also, I am converting this file to ePub, but it should work with all kinds of files.
The Unwrap Factor
In order to change the fragmented sentences, on the left side (that scroll box) click on the “PDF Input” button, which is eighth on the list. After clicking on it, you will see a Line-Unwrapping Factor and a default value of 0.4 next to it. Type in “0.1” into the box, and you should be able to get rid of the Unwrap Factor (if you’re still after removing the headers/footers, don’t click “Okay” just yet.
Here is the uncertain part: every PDF has a different line-wrapping factor, and it’s really a process of guess-and-check with different values. If 0.1 doesn’t work for you, you may need to the vary the numbers. If you find lines are forming paragraphs when they shouldn’t, raise the number. If you find they’re still fragmented and disjointed, lower the number. However, 0.1 has not failed for me yet, and I sincerely hope it works for you.
The Search and Replace Function
This is a slightly more complicated method, but I assure you, once you understand it, you’ll be fine. It’s uses very basic computer skills, but fear not if you’re not a technology-person; just skip the explanations and copy instead. Click on the “Search and Replace” button on the side and then click on the “magic wand” image next to the “Regular Search Expression” box.
Use the following image below for reference.
As this is a tutorial on how to remove the page numbers, I won’t touch upon skipping pages, every odd/even page, etc. Firstly, click on the Heuristic Processing button on the sidebar, then check the box “Enable Heuristic Processing.” Change the Line-Unwrapping factor to 0.1 (or the one that works for you). Then, click “Italicize Common Words and Patterns.”
Simply copy the following expression in the Reggex Builder.
If it matches a fair number of results, then you have officially identified the page number code (assuming you have page numbers in the footer that are converted into text). I am not entirely sure if it’s the same for every PDF file, but it has been for the ones I’ve handled. Click “Okay” (with the green check-mark) and make sure to keep the “Replacement Text” line absolutely blank.
You need to identify all the different variables circulated around the page-numbers/footers, and if you have alternating words (eg. title/author) you need two different sets of data you need to change.
>>> Example of what I mean
NOTE: The <a name=90></a> is also within the set. Sorry for any inconveniences.
Now, it’s time for the Search & Replace functions for the header and footer, but it’s basically a matter of copying/pasting what’s there on the sheet.
For Set 1 Only:
Basically, I am going to select the expression: Free eBooks a<a href=”http://www.planetebook.com/”>t Planet eBook.</a>com<br> (for one line). Make sure that the <br> is properly at the end of the expression, as this gets rid of the breaks caused by headers/footers. Click “Okay” and add it to the list of Search/Replace functions (ensuring that the “Replace” box is blank)
For Set 2 Only:
Same procedure, just a simple The Picture of Dorian Gray<br> (All I did was copy/paste the line, and hopefully it works for you)
For Both Set 1 and 2:
These are expressions that are not exclusive to either set, but for the whole document as a whole (eg. you can see the presence of <hr/> in both sets, so it fits into this category).
- </hr> (Insert this into the document. This is a code for spacing; a blank line in between text)
- <a name=[0-9]+></a> (It’s basically the same method used for the page-numbers, but just to suit the <a> code)
Extra Note: I’ve noticed some inconsistencies with the document/places where there’s a random (space)(space)0 or (space)0. This does not happen with normal PDFs, but if you too have inconsistencies, it’s easy to fix: simply copy the expression you want to get rid of, click “Test” and skim to see you’re not deleting any vital info. It’s very simple to address these minor issues.
And when I save this all, have a look at the final result:
No headers, no footers and no page-numbers; just easy, flowing text. However, as you’ve probably noticed, it’s a very manual process and requires a bit of time for each document (as each have different footer/header text) and so it’s not that efficient. If you are looking to convert just a few documents (like me), this is free method, and very easy once you get the hang of it. 🙂