Textstat / newline fix

Input


Export


Info

Goal: This script preprocesses and fixes false linebreaks of e.g. OCRed books to get more realistic Enter and Space counts for textstats and logical layout optimization. (Replaces single linebreaks with space but keeps double linebreaks intact.)

This is what I'm talking about (20+ linebreaks instead of just 2-4):

You can come across similar textfiles e.g. on Project Gutenberg, this example above is from Thomas Mann's Der Zauberberg.

This could be probably done with a regexp one-liner, but here is an interface for less regexpt-savvy folks.

Todo