Sounds like someone needs to run their own test cases and report back on which solution does a better job...
hersko1 hour ago
I have a flow where i extract text from a pdf with pdf-parse and then feed that to an ai for data extraction. If that fails i convert it to a png and send the image for data extraction. This works very well and would presumably be far cheaper as i'm generally sending text to the model instead of relying on images. Isn't just sending the images for ocr significantly more expensive?
I always render an image and OCR that so I don’t get odd problems from invisible text and it also avoids being affected by anything for SEO.
mimim1mi1 hour ago
By definition, OCR means optical character recognition. It depends on the contents of the PDF what kind of extraction methodology can work. Often some available PDFs are just scans of printed documents or handwritten notes. If machine readable text is available your approach is great.
sgc1 hour ago
How does this compare to dots.ocr? I got fantastic results when I tested dots.
Discussion is here: https://news.ycombinator.com/item?id=45652952
https://github.com/rednote-hilab/dots.ocr