Docutils: Multiple Input Files
- Date:
- 2007-08-21
- Revision:
- 5417
- Copyright:
- This document has been placed in the public domain.
1 Introduction
We would like to support documents whose source text comes from multiple files. For instance, the Docutils documentation could be considered a single large document; parsing all files into one single document tree would enable us to do cross-linking between parts of the documentation (our current way to cross-link between files is to link to HTML files and fragments, which is limited to HTML). Another example is a reference manual for a customized software system. The manual is built from a set of sub-documents that may differ from installation to installation.
Note that this issue is separate from that of output to multiple files; after implementing support for multiple input files, all we will be able to do is to generate a single huge output file.
This is a collection of notes and semi-random thoughts (many of which are credit to David, from IM conversations). Feel free to add yours and/or make changes as you see fit!
You can also discuss this proposal on the Docutils-develop mailing list, or reach us individually via email or Jabber/Google Talk at LeWiemann@gmail.com and dgoodger@gmail.com, respectively.
2 Terminology
Right now, we are using the following terminology: A document which includes other documents (using the subdocument directive described below) is called a super-document. The included documents are called sub-documents. Sub-documents can in turn include other documents and can thus be super-documents themselves. Any top-most super-document in the hierarchy, which is not included by any other document, is called a compound document.
The set of all documents that can be part of a compound document is the document set (or docset). The directory that is ancestor to all documents in the document set is the docset root.
3 The subdocument Directive
The "include" directive is not usable for this because we want to have independent parsing contexts (for instance, section title adornment should not have to be consistent across input files).
So create a "subdocument" directive (syntax: ".. subdocument:: file.txt"). This directive causes the referenced file to be parsed and its document tree to be inserted in place.
The sub-document must only consist of top-level sections and transitions, or it must have a document title. (The document title will become a section title when the sub-document is inserted using the subdocument directive.) We only support per-document docinfos -- if a sub-document contains multiple top-level sections, don't touch field lists at all.
This restriction may be lifted later; there is no theoretical reason that prevents sub-documents from containing arbitrary text (within the limits of the DTD, e.g. no body elements may end up after section elements) -- it is merely somewhat harder to implement.
As long as above restrictions apply, the subdocument directive can be treated like a section at parse-time; that is, no elements except for more sections or transitions may follow.
In order to facilitate assembling a large number of hierarchical files into a large document, the subdocument directive should allow specifying any number of files, like this:
.. subdocument:: chapter1.txt chapter1-section1.txt chapter1-section2.txt chapter2.txt chapter2-section1.txt ...
Specifying an indented file (like chapter1-section1.txt) is equivalent to inserting ".. subdocument:: chapter1-section1.txt" at the end of chapter1.txt.
Lists of files should be required to be directive content, not parameters, because file lists as parameters would be prone to uncaught user errors. In this example, the indentation of "chapter1-section1.txt" would be stripped by the directive parser, which is contrary to what the user expects:
.. subdocument:: chapter1.txt chapter1-section1.txt
The sub-document files should each be processable stand-alone (without the other files), each forming a document on its own.
What to do with docinfos in subdocuments:
Allow for section infos by generalizing the existing docinfo node.
Add an option to either strip or leave docinfos, specifiable as an option to the "subdocument" directive:
Uniform handling:
.. subdocument:: :leave-docinfo: chapter1.txt chapter2.txt chapter3.txt
Non-uniform handling:
.. subdocument:: chapter1.txt .. subdocument:: chapter2.txt :leave-docinfo: .. subdocument:: chapter3.txt
There is currently no way to have per-section "section infos" in reStructuredText files, as opposed to only file-wide docinfos. So the only way to get section infos in a document is to use sub-documents. This might be just fine though.
For a first implementation, just go the easy route and strip all docinfos in sub-documents.
In order to facilitate multi-file output that parallels the input file structure, add "source" attributes to section nodes for sections that come from different input files.
Restriction: Do not allow sub-documents without a top-level section, or with body elements in front of the first section. IOW, the sub-document may only contain PreBibliographic elements, sections, and transitions. The PreBibliographic elements in front of the first section get moved into the section.
David says we shouldn't have this restriction -- for instance you might want to have an introductory paragraph in front of the first section of the first sub-document. Lea doesn't mind the restriction and says you could use the "include" directive. Since having the restriction makes the implementation somewhat easier, we agreed on having this restriction, waiting until the first user presents a good use case to remove it, and calling it a YAGNI until then.
Silently drop header and footer in sub-documents. (Document this in directives.txt though.)
To do: Explore alternatives besides "subdocument" for the directive name.
[DG] "subdoc" is a good alias. "Subdocument" implies a single item; perhaps "manifest" instead, for multiple items?
docutils.conf files in the sub-documents' respective directories are honored.
You may want to read some insightful remarks by Joaquim Baptista on how files should be expected to be part of different compound documents.
4 Cross-References
A major issue to think about is how to do cross-references (colloquially known as xrefs) between files. Things like local references or substitutions should not be shared between files (their definitions can simply be loaded using the "include" directive). However, sharing external targets and thereby allowing cross-references between files is one of the major features of an architecture that supports multiple input files.
(Implementation note: For this to work, we need to apply transforms to sub-documents; basically, all transforms but the one resolving external references should be applied. This means that a new reader instance must be created. r5266 makes applying transforms to sub-documents possible by pulling the responsibility for applying transforms out of the Publisher.)
Issues arise once we think about how to group target names into namespaces. Unfortunately, simply putting all targets into a global, document-wide namespace is bound to cause collisions; files that were processable stand-alone are no longer processable when used in conjunction with other files because they share common target names.
Since linking to targets outside the scope of the current sub-document has significant disadvantages, we will need some form of qualifiers.
4.1 Namespace Identifiers
This makes it necessary to add a notion of namespace identifiers.
It is possible to always name the namespace of the current file (as it is done in C++). For instance, ".. namespace:: frob" at the beginning of a file could declare that the namespace of the current file is called "frob". However, this is a little verbose as it adds a line at the top of each file (see the sidebar). Also, it removes the reader's ability to easily look up the referenced files (you might not know which file(s) declare the "frob" namespace).
On the other hand, namespace names could also be derived from paths and file names. (Note though that these two options need not be mutually exclusive.) Since using only the file name would cause ambiguity, it is necessary to include its path in the namespace name. For instance, the file docs/dev/todo.txt could be referenced by the implicit namespace identifier docs/dev/todo.txt; a reference would look like `<docs/dev/todo.txt> large documents`_. Using paths relative to the current file makes it hard to move files or document parts. Therefore, we need to establish the notion of a docset root which path names are relative to.
The docset root could be specified using a "docset-root" directive at the top of each sub-document that uses external named references. On the other hand, perhaps we do not need to know the docset root until we process the compound document (in which case it can be implicitly derived from the location of the master file). So let's wait with implementing a "docset-root" directive until the need arises.
4.2 Namespace Aliases
A general disadvantage of using paths as namespace identifiers is that changes in the directory structure cause a massive amount of changes in the reStructuredText files, because all the paths need to be updated. This is not any worse than the current situation. However, to improve maintainability it would be desirable to make the namespace of an often-referenced files known under a shorter name. (The shorter namespace identifier should only be valid within the file where it is declared, and possibly sub-documents.) For instance, one could make "docs/ref/restructuredtext.txt" known as "spec" using one of the following syntax alternatives:
.. namespaces:: :spec: docs/ref/restructuredtext.txt
Namespace aliases can also be used make one namespace refer to different physical files depending on the super-document. Namespace definitions should therefore be inherited from super-documents to sub-documents. The "namespaces" directive overrides namespace definitions inherited from super-documents, unless the :inherit: option is specified. (The :inherit: option thus allows to provide default paths for namespace aliases, which can still be overridden in super-documents.)
4.3 Qualifier Syntax
Angled brackets:
`<namespace> target`_
This is similar to the syntax for embedded URI's (`target <URI>`_). It fits well into the existing syntax.
5 Implementation
In order to parse sub-documents, we need to create new parser instances.
For now, we'll instantiate them by calling parser.__class__(); in the long run the reader, parser, and writer parameters of the Publisher should be turned into classes (or callbacks) instead of instances.
The Parser must know about the Reader (or about the Publisher) and calls Reader.parse_[sub]document in order to parser sub-documents.
6 Dumpster
You can stop reading now. This section is only to archive sections we're no longer interested in.
6.1 Rejected Proposal: Local and Global Namespace, no Qualifiers
An obvious solution would be to add a notion of a file-local and a global namespace. When trying to resolve a reference, first the target name is looked up in the local namespace of the current file; if no suitable target is found there, the target name is searched for document-wide, in the global namespace; if the target name exists and is unique within the compound document, the reference can be resolved.
If references to the global namespace are not marked up as such, however, the individual files are no longer processable stand-alone because they contain unresolvable references. While it may be acceptable that external named cross-references do not (fully) work any longer when a file is processed stand-alone, it would be nice to be able to handle unresolved external references somehow (at least by marking them as "unresolvable" in the output), rather than simply throwing an error (see the sidebar).
This can be solved by marking external references as such, like this:
`local target name`_ `-> global target name`_
where "local target name" must be a unique target name within the current file, and "global target name" must be a unique target name within the current compound document.
(We would need to explicitly establish a notion of "stand-alone" vs. "full document" processing in this case. But since this proposal is being rejected, I'm not going to explore this further.)
6.1.1 Drawbacks
This approach turns out to have a major drawback though: External named references depend on the context of the containing super-document. However, as Joaquim pointed out, files should be expected to be part of several super-documents. This means that once a file is put into the context of a new document, its external named references might point to non-existing or duplicate targets. This seems like a maintenance problem for complex (large) collections of documentation.
Another peculiarity of this system is that, as long as a file is processed stand-alone, external named references are not associated with the file that defines the target. This brings the advantage that renaming and moving files won't invalidate reference names. On the downside, it lacks clarity for the reader because the file containing a target is often not inferable from the target name (try to guess which file `-> html4css1`_ links to) -- this may be significant since reStructuredText should be readable in its source form.
6.2 Importing Namespaces
While namespaces should generally be available without explicitly importing them (in order to avoid length headers), it would probably be handy to have a means of inserting all targets of another namespace into the current one (similar to Python's "from module import *"). The disadvantage is that it may cause confusion.
Contenders for the syntax:
.. import:: namespace (Pythonic) .. import-targets:: namespace (more verbose) .. using:: namespace (like C++)
Or, provided that we use ".. namespace:: short-name <- namespace", and "`namespace -> target`_" as reference syntax, this would be a logical fit:
.. namespace:: <- namespace `-> target`_ (instead of `namespace -> target`_)
The advantage of this syntax is that we can prohibit importing more than one namespace, which might cause confusion. Importing only one namespace might be a handy shortcut though.
6.3 Caching
In order to be able to regenerate the whole compound document in a timely manner after changing a single file, it is necessary to implement a caching system.
Processing a document is done in the following steps:
For each file in the docset, parse it and turn the target names into file-local ID's (this includes error handling for duplicate target names). Cache the parse tree, the name-to-ID mapping, and the list of all files included using the "include" directive. Skip this step for files whose cache entry date-stamp is newer than the file's mtime and ctime, and all included files mtimes and ctimes.
This means that the "subdocument" directive must be resolved at transform time (and not at parse time), because otherwise we cannot store the doctree before the sub-document has been inserted.
For each file, run transforms, resolving external named references using the cached name-to-ID mappings of other files.
Write out the resulting document (currently a single file). (The writer needs to turn namespace/ID pairs into output-file-local ID's.)
Processing a file stand-alone is done in the same way, except that steps 1 and 2 are only performed for the file being processed, not for each file in the docset. If other files' cached name-to-ID mappings are not up-to-date (when being accessed in step 2), they should be automatically updated.
All cache entries should be stored in a docset cache file, in order to avoid LaTeX-like creation of many junk files. Possible names include docutils.cache, docutils.aux, or either of them with leading dot. The file is stored in the docset root and contains a header and a large pickle string (reading and writing even large strings of pickled data is reasonably fast). In the header of the cache file, store sys.version and docutils.__version__; discard cache files that have the wrong version.
Potential security issue: Since unpickling is unsafe, an attacker could provide a carefully crafted cache file, which is then automatically picked up by Docutils. Remedies: Insert some unguessable system-specific key (generate randomly and store in ~/.docutils.cache.key), and automatically discard cache files that have the wrong key. Or simply place a big warning in the documentation not to accept cache files from strangers.
No caching is done if no docset-root is defined (which means that the file being processed is independent and not part of any compound document).
6.4 Implementation
As described in section Caching, when processing files stand-alone and resolving their external named references, it may be necessary to process (or re-process) referenced files. Since this is during transform-time, the parser instance is no longer available; it is therefore necessary to create a new instance.
All requests for doctree and name-to-ID mappings should go through the caching system. In case of a miss, the caching system instantiates a parser and (re-)parses the requested file.
In fact, all calls by the standalone reader to the reStructuredText parser should go through the cache. In the case of independent files which are not part of a larger docset, the system always assumes a cache miss.