I have been thinking about and experimenting on converting scientific papers written in tex to some standard form in plain text that is searchable.
Can we tweak the TeX engine so that there is a mode that outputs text with font and subscript/superscript information retained?
Hi @mathyingzhou, thanks for asking!
One of the major motivations for my work on Tectonic is that I’d like to have a platform to enable exactly the kind of thing that you’re interested in.
In my opinion, the answer to your question depends on the details of what exactly you’re interested in. If you want a really good text-like representation of a TeX document that preserves details like sub- and super-scripts, tables, figures, and so on, I believe that the overwhelmingly preferable answer is to implement really good HTML output. This is something that I have dreamed of for a long time, but haven’t been able to get all of the pieces together just yet. I believe it’s possible, but there is a lot of work left to do to create results that hit the quality level I want to achieve.
Less ambitiously, the PDF files created by Tectonic (and all PDF-enabled TeX engines) contain information about their textual contents already. Tectonic and XeTeX (and maybe other engines) implement the
ActualText PDF tag that aims to provide a better correspondence between the rendered PDF and the ideal Unicode representation of the underlying text.
In between, you could imagine an output mode that preserves this kind of text information, but doesn’t go all the way to detailed HTML — maybe even
.txt? But the challenge here is that there are lots of corner cases (e.g. tables) where you are going to lose a lot of information about the intended output, and sometimes that can be a big problem.