Ebook Text formatter is a small handy tool that can be used to clean up text files before copying them to your favorite eBook reader to improve reading experience. It addresses several issues:
- Paragraph breaks. Many Internet websites provide texts that are preformatted for specific screen width. When eReader such as Kindle or Sony PRS-505 attempts to reflow such file lines that are too long to fit the width of the screen will have one or two words wrapped to the next line. This looks ugly and makes book much less readable. Sometimes such files have extra spaces added to align the right edge. These too don’t look nice on eReaders. Ebook Text formatter restores paragraphs buy joining all lines of text that don’t begin with a whitespace while preserving empty lines. It also collapses all consecutive whitespaces into a single whitespace. Although simple, this produces nice results in 99% of cases.
- HTML markup. Some files on websites like lib.ru have HTML markup. Ebook Text formatter can replace all HTML tags with whitespaces that are then collapsed into single whitespace as mentioned above.
- Character encoding. Sometimes text files are not in the same encoding that eReader expects so you see grabage instead of text. For russian characters Kindle expects codepage 1251, while Sony PRS-505 expects UTF8. Ebook Text formatter can change character encoding to the one your eBook reader expects.
- Download Ebook Text formatter and extract it to a directory of your choosing on your hard drive. It may be a good idea to put it into directory referenced by PATH environment variable so that you can run it from any directory.
- You will need .NET framework runtime 3.5 to run this program. It can be downloaded from Microsoft website and installed free of charge.
- Open command line prompt by pressing Win-R, typing cmd and pressing ‘Enter’.
- Run EBookTextFormatter.exe with the following arguments: inFile [outFile] [inEncoding] [outEncoding]
- inFile – it is a required parameter and it specifies either local path to the source file that you would like to clean up or HTTP url where this file can be downloaded from the Internet. Please note that if Internet file requires some kind of authentication (either by password or cookie) the download will most likely fail.
- outFile – if inFile is a local path, this parameter is required – it specifies the path to store resulting file. If input file is being downloaded from the Internet you can skip this parameter and it will default to the file name on the website. Specifying ‘-‘ (dash) as outFile will cause Ebook text formatter to output resulting text to the console.
- inEncoding – either codepage number (ex. 1251) or utf8. Specifies which encoding input file is in. If omitted, ecoding is chosen based on the config file (see below)
- outEncoding – either codepage number (ex. 1251) or utf8. Specifies output file encoding. For Russian text it should be 1251 for Kindle and utf8 for Sony eBook reader. If omitted, encoding is chosen based on the config file (see below)
- Please note that if you can’t skip parameters. For example if you would like to use non-default input encoding you must specify outFile (even if you are happy with the default value).
- Output file is created and you can copy it to your eBook reader if it’s not already there.
Configuration: Besides EBookTextFormatter.exe there is also EBookTextFormatter.config.exe file. It contains XML file that can be edited with any text editor (ex.: notepad.exe) and contains following useful parameters:
- inEncoding – default value for input file encoding. Either codepage number of ‘utf8’. See usage 4.3 above.
- outEncoding – default value for output file encoding. Either codepage number of ‘utf8’. See usage 4.4 above. Normally you should set it according to the device that you have. For Russian text use 1251 for Kindle and utf8 for Sony PRS-500 and PRS-505.
- paragraphOffset – number of whitespaces to insert before the first line of paragraph. Default value is 3.
- stripHtml – either True or False. If set to True, Ebook Text formatter will strip all HTML tags from input file
- overrideWebEncoding – either True or False. Normally web server tells the client which encoding particular file is in. However this information is not always correct or it’s missing. Setting this parameter to True will cause Ebook Text formatter to ignore encoding reported by the web server and use encoding from either command line (usage 4.3) or config file (configuration 1)
- Plug in your Kindle.
- Assuming it mounted as drive K:, type K: in the command prompt window.
- Type cd \documents
- Type EBookTextFormatter.exe http://lib.ru/STRUGACKIE/wolny.txt
- Assuming you copied EBookTextFormatter.exe to some location that is referenced in the PATH env. variable, you didn’t change the default config file (and of course that you have unicode font hack installed) you should now see readable version of “Волны гасят ветер” seen as wolny in your book list ready to read on your Kindle once you unplug the USB cable.
I’ve been using this script personally for a while and it is not as polished as true software product should be but it’s usable. Let me know if you have problems using it or would like to add some feature and I’ll see what I can do. Source code is also supplied with the executable file. You are free to use it in any way you like. If you make changes to it, please let me know and I’ll try to incorporate these changes in the version that is available here on BlogKindle.com.