XML Workflow Summary
Overview and case study
As part of an ongoing project of the AAUP Design & Production Committee to research and outline a digital workflow that facilitates production of scholarly content in several delivery formats, Terri O'Prey (Princeton UP) has developed an XML Workflow Summary word document. This is an outline of best practices for XML workflow based on the Princeton experience of transitioning to an XML-first workflow using the Scribe Well-Formed Document Workflow. This outline gives a brief history and description of XML and its value as a flexible, stable, editable format to store and transport data. It offers specific tips in manuscript processing that, while described relative to Scribe's DTD, can be applied in other workflows.
- Open Typesetting Stack (formerly the PKP XML Parsing Service): (sourcecode, beta site) -- can be used to convert from Word documents
- eXtyles – a set of related products supporting composition in Word and subsequent conversion to XML. Can be used with Typefi Publish to import into InDesign. One implementation is P-Shift, a service from the University of Toronto Press that allows you to pay them to carry out production using these tools.
- Scribe Well-Formed Document Workflow – composition in Word, then conversion to XML, then import into InDesign. Scribe also offers service where you pay them to carry out production using these tools.
- Typesetera – composition in Word, then automatic creation (using XML) of sample, student, and scholar editions in PDF and e-book formats
- Open Typesetting Stack (formerly the PKP XML Parsing Service): (sourcecode, beta site) -- can be used to convert from PDF documents
- GROBID – uses heuristics to convert a vector PDF into TEI. See overview, documentation, sourcecode and demo sites: scite-it.edu (not working) or science-miner.com. Is being integrated into Apache Tika.
- pdfx – uses heuristics to convert a vector PDF into “pdfx-xml” (see a paper about it)
- pdf2htmlEX – Liza Daly said (?) at AAUP 2013 that it creates HTML5 with SVG (nonreflowable output).
- Box View API (formerly "Crocodoc") – an API for converting PDF (and other formats) to HTML5. See Box Platform Developer Documentation for full documentation, quick start guide, and FAQ.
- CERMINE - Content ExtRactor and MINEr – is been integrated into the PKP XML Parsing Service (sourcecode, beta site)
- PDFUnbound — converts to various structured output formats
Some of these and others are listed with short annotations on the tools section of "Jailbreaking the PDF" website. The convenor of that workshop reported on jats-list that Crocodoc became available after the workshop, and that he has found it to be more reliable than the others.
Many people feel that for publishing you can get nearly all of the benefits of XML but just relying on clean HTML code.
- PressBooks – Can be installed and run locally, or you can pay them to carry out production using the platform they develop.
- O'Reilly Atlas – You can pay them to carry out production using the platform they develop.
- IGP:Digital Publisher – Includes an authoring environment but also allows importing from .doc and .odt files.