A brief outline of the plan for HTML support

You might have noticed that the pace of activity on Tectonic has slowed down significantly. Part of this is a good thing — I use it for my everyday document compilations and everything just works! Hopefully that is the case for other folks as well.

But it is also true that there are many bugs that need fixing, and I have not been making progress on my goal of starting work on support for native HTML output. Unfortunately all of my bandwidth this summer has been sucked up by other things.

In response to a query from GitHub user @bfirsh I thought I’d at least document my plans for HTML, since I think that HTML support is the most important big-picture feature that Tectonic can offer in the near future.

My work on an earlier TeX/HTML project, Webtex, convinced me that in order to get good HTML output out of TeX, you need to fundamentally alter the engine to turn off the line-breaker and page-output algorithms. These steps transform the source text in fundamental ways (e.g. breaking equations into lines, inserting page headers and footers) that are actively undesirable Web output, and I found that these transformations cannot be reversed in any tractable way after-the-fact.

The other thing I decided is that it is undesirable, if not impossible, to literally recreate every aspect of the formatting specified in the (La)TeX source that is being compiled. For instance, HTML output won’t try to use exactly the fonts that are specified in the document font. It will use the same sensible default body font regardless of what the source text calls for. Anything else is simply impractical if you’re aiming for a highly polished Web reading experience.

So the vision for Web output is that you add some special control variables to the engine that disable linebreaking and page output. Essentially the output of the engine would be a never-ending vlist of unaligned boxes, rather than a series of page boxes. Meanwhile, you hack up the LaTeX support files to emit HTML tags as needed.

The roadmap in my head goes as follows. Not everything needs to actually happen in the strict sequence given here:

  1. Document the format of the existing XDVI output files. I have not found a clear description of the detailed binary layout of this file format, and it needs to be well-understood for the following steps.
  2. Develop a slightly altered output file format that is like XDVI, but that conveys its output as a giant never-ending vlist, rather than a series of pages.
  3. Implement the “no linebreaking and no page layout” engine switch and teach the engine to respect it, with output in this new not-quite-XDVI format.
  4. Write code to process the not-quite-XDVI output and use it to create an HTML output file. I believe that latex2html and friends have defined \special commands for inserting HTML tags but new \special logic might need to be implemented.
  5. Build the infrastructure to create a hacked bundle file that is like the basic texlive bundle but patches up the style files to include hacks as needed to get the HTML to work. For instance, for certain fancy interactive effects to work, the final output document will need supporting CSS and JavaScript code, and the HTML output will need to hook into these; the bundle will need to provide all of these resource files.
  6. Wire everything up in the frontend with a nice user experience. Since I think that all of the Web-output alterations require the document to be compiled with a special non-default bundle, I’d like there to be some logic that makes if so that if you just run tectonic -b path-to-web-bundle mydoc.tex, the UX layer figures out that this bundle is intended for HTML output, and mydoc.html is output automagically.

Once this basic framework comes together, it’s then a question of iterating and iterating and iterating to make everything work together smoothly, and probably a never-ending series of patches to the packages in the texlive bundle to polish the HTML experience.

A footnote: this post contains some information about XDV format (there’s also a link to the original source in the comments).

Thanks for the link!