Selfhosted

48963 readers

1090 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

HybridSarcasm@lemmy.world

HybridSarcasm@lemmy.hybridsarcasm.xyz

Pdf to odt/docx conversion has me weeping! (lemmy.world)

submitted 2 weeks ago by Maroon@lemmy.world to c/selfhosted@lemmy.world

22 comments fedilink hide all child comments

You will see that I have posted about this before asking for suggestions on which software I can use to convert PDF to docx/odt.

I am a teacher. During my time as a researcher I wrote a lot of documents and regularly draw upon them to teach my students. I often have to take the text, modify them, or build upon them. A lot of my material is bound up in PDFs. Sometimes, I have grant applications to write where a previous draft I wrote was stored as a PDF. Converting them to text has become the bane of my life.

I am forced to use online tools because none of the software I have seem to do the trick. Lot of people keep saying pandoc. Pandoc does not convert PDF to any other format. It can only be the output format.

Is there a magic open source solution that I have missed out?

you are viewing a single comment's thread
view the rest of the comments

[–] Treczoks@lemmy.world 9 points 1 week ago (1 children)

The problem lies in the PDFs themselves. In there are objects that represent lines of glyphs. If you are lucky. A conversion tool can guess which of those lines belong together and produce the text.

It cannot know any intentions behind it, though. Take a numbered list. The first line is two line objects: the number plus the . or the ), and the first line of text. The conversion tool can now guess. As the line blocks with the numbers are all left of the line blocks with text, this could be a numbered list. Or it could be a table with two columns. Nothing in the PDF is giving any hints.

And that is the easy part. This assumes that the document either uses default fonts, or keeps its embedded fonts untouched. If they use embedded fonts and a PDF optimizer that only embeds the used characters and renumbers them, any copy or conversion tool is bound to fail.

Same with protected PDFs where you simply cannot copy the text from the start.

And then there are PDFs that just consist of scanned pages. Here you would need an OCR software to get something readable out of them.

PDF is an archival, output format, the end of a process. Not something to work from.

Always preserve the original file. Keep it safe. If you change tools, make sure you have a conversion path into something editable. The PDF is for giving away, nothing else.

[–] ChaoticNeutralCzech@feddit.org 2 points 1 week ago* (last edited 1 week ago) (1 children)

Renumbering characters during font minimization? I haven't encountered that, it would break searching and copying.

Anyway, PDFs for example don't even say whether a line of text is left, center or justified – they usually store the coordinates of the first character and then spacing to each subsequent one unless defined by the font.

And what if the document contains text boxes, or other Word objects? Well, the text is separate from the underlying rectangle (if there is one) and it's up to the conversion tool to guess if it's part of the main text layer.

Sorry, it's really hard to edit PDFs. You might want to use Inkscape for editing the graphical parts. If you also need to edit paragraphs, I suggest recreating the document by pasting them into Word/LibreOffice, and importing any graphical shapes as SVGs (use Inkscape for the conversion, then you can try Word's "Graphic > Convert to Shapes" feature).

Really, every software that outputs PDF should treat it as an export process, hopefully making it clearer that "saving as PDF" is visually lossless but structurally lossy and messy.

[–] Treczoks@lemmy.world 1 points 1 week ago

The compressing and renumbering seems to be more common with embedded Chinese fonts - Space-wise it makes a lot of sense. But yes, mark and copy text, paste it into word or writer, and you get gibberish. Can't verify the search, though. And, of course, Google translate can't do anything with it, either.