cross-posted from: https://sopuli.xyz/post/8936481
I would like to get to the bottom of what I am doing wrong that leads to black and white documents having a bigger filesize than color.
My process for a color TIFF is like this:
①
tiff2pdf
②ocrmypdf
③pdf2djvu
Resulting color DjVu file is ~56k. When
pdfimages -all
runs on the intermediate PDF file, it shows CCITT (fax) is inside.My process for a black and white TIFF is the same:
①
tiff2pdf
②ocrmypdf
③pdf2djvu
Resulting black and white DjVu file is ~145k (almost 3× the color size). When
pdfimages -all
runs on the intermediate PDF file, it shows a PNG file is inside. If I replace step ① with ImageMagick’sconvert
, the first PDF is 10mb, but in the end the resultingdjvu
file is still ~145k. And PNG is still inside the intermediate PDF.I can get the bitonal (bilevel) image smaller by using
cjb2 -clean
, which goes straight from TIFF to DjVu, but then I can’t OCR it due to the lack of PDF intermediate version. And the size is still bigger than the color doc (~68k).#askFedi
update
I think I found the problem, which would not be evident from what I posted. I was passing the
--force-ocr
option toocrmypdf
. I did that just to push through errors like “this doc is already OCRd”. But that option does much more than you would expect: it transcodes the doc. Looks like my fix is to pass--redo-ocr
instead. It’s not yet obvious to me why--force-ocr
impacted bilevel images more.