PEP: | 258 |
---|---|
Title: | Docutils Design Specification |
Version: | 9906 |
Last-Modified: | 2024-08-15 10:43:38 +0200 (Do, 15. Aug 2024) |
Author: | David Goodger <goodger at python.org> |
Discussions-To: | <doc-sig at python.org> |
Status: | Rejected |
Type: | Standards Track |
Content-Type: | text/x-rst |
Requires: | 256 257 |
Created: | 31-May-2001 |
Post-History: | 13-Jun-2001 |
While this may serve as an interesting design document for the now-independent docutils, it is no longer slated for inclusion in the standard library.
This PEP documents design issues and implementation details for Docutils, a Python Docstring Processing System (DPS). The rationale and high-level concepts of a DPS are documented in PEP 256, "Docstring Processing System Framework" [1]. Also see PEP 256 for a "Road Map to the Docstring PEPs".
Docutils is being designed modularly so that any of its components can be replaced easily. In addition, Docutils is not limited to the processing of Python docstrings; it processes standalone documents as well, in several contexts.
No changes to the core Python language are required by this PEP. Its deliverables consist of a package for the standard library and its documentation.
Project components and data flow:
+---------------------------+ | Docutils: | | docutils.core.Publisher, | | docutils.core.publish_*() | +---------------------------+ / | \ / | \ 1,3,5 / 6 | \ 7 +--------+ +-------------+ +--------+ | READER | ----> | TRANSFORMER | ====> | WRITER | +--------+ +-------------+ +--------+ / \\ | / \\ | 2 / 4 \\ 8 | +-------+ +--------+ +--------+ | INPUT | | PARSER | | OUTPUT | +-------+ +--------+ +--------+
The numbers above each component indicate the path a document's data takes. Double-width lines between Reader & Parser and between Transformer & Writer indicate that data sent along these paths should be standard (pure & unextended) Docutils doc trees. Single-width lines signify that internal tree extensions or completely unrelated representations are possible, but they must be supported at both ends.
The docutils.core module contains a "Publisher" facade class and several convenience functions: "publish_cmdline()" (for command-line front ends), "publish_file()" (for programmatic use with file-like I/O), and "publish_string()" (for programmatic use with string I/O). The Publisher class encapsulates the high-level logic of a Docutils system. The Publisher class has overall responsibility for processing, controlled by the Publisher.publish() method:
Calling the "publish" function (or instantiating a "Publisher" object) with component names will result in default behavior. For custom behavior (customizing component settings), create custom component objects first, and pass them to the Publisher or publish_* convenience functions.
Readers understand the input context (where the data is coming from), send the whole input or discrete "chunks" to the parser, and provide the context to bind the chunks together back into a cohesive whole.
Each reader is a module or package exporting a "Reader" class with a "read" method. The base "Reader" class can be found in the docutils/readers/__init__.py module.
Most Readers will have to be told what parser to use. So far (see the list of examples below), only the Python Source Reader ("PySource"; still incomplete) will be able to determine the parser on its own.
Responsibilities:
Examples:
Standalone (Raw/Plain): Just read a text file and process it. The reader needs to be told which parser to use.
The "Standalone Reader" has been implemented in module docutils.readers.standalone.
Python Source: See Python Source Reader below. This Reader is currently in development in the Docutils sandbox.
Email: RFC-822 headers, quoted excerpts, signatures, MIME parts.
PEP: RFC-822 headers, "PEP xxxx" and "RFC xxxx" conversion to URIs. The "PEP Reader" has been implemented in module docutils.readers.pep; see PEP 287 and PEP 12.
Wiki: Global reference lookups of "wiki links" incorporated into transforms. (CamelCase only or unrestricted?) Lazy indentation?
Web Page: As standalone, but recognize meta fields as meta tags. Support for templates of some sort? (After <body>, before </body>?)
FAQ: Structured "question & answer(s)" constructs.
Compound document: Merge chapters into a book. Master manifest file?
Parsers analyze their input and produce a Docutils document tree. They don't know or care anything about the source or destination of the data.
Each input parser is a module or package exporting a "Parser" class with a "parse" method. The base "Parser" class can be found in the docutils/parsers/__init__.py module.
Responsibilities: Given raw input text and a doctree root node, populate the doctree by parsing the input text.
Example: The only parser implemented so far is for the reStructuredText markup. It is implemented in the docutils/parsers/rst/ package.
The development and integration of other parsers is possible and encouraged.
The Transformer class, in docutils/transforms/__init__.py, stores transforms and applies them to documents. A transformer object is attached to every new document tree. The Publisher calls Transformer.apply_transforms() to apply all stored transforms to the document tree. Transforms change the document tree from one form to another, add to the tree, or prune it. Transforms resolve references and footnote numbers, process interpreted text, and do other context-sensitive processing.
Some transforms are specific to components (Readers, Parser, Writers, Input, Output). Standard component-specific transforms are specified in the default_transforms attribute of component classes. After the Reader has finished processing, the Publisher calls Transformer.populate_from_components() with a list of components and all default transforms are stored.
Each transform is a class in a module in the docutils/transforms/ package, a subclass of docutils.tranforms.Transform. Transform classes each have a default_priority attribute which is used by the Transformer to apply transforms in order (low to high). The default priority can be overridden when adding transforms to the Transformer object.
Transformer responsibilities:
Transform responsibilities:
Examples of transforms (in the docutils/transforms/ package):
Writers produce the final output (HTML, XML, TeX, etc.). Writers translate the internal document tree structure into the final data format, possibly running Writer-specific transforms first.
By the time the document gets to the Writer, it should be in final form. The Writer's job is simply (and only) to translate from the Docutils doctree structure to the target format. Some small transforms may be required, but they should be local and format-specific.
Each writer is a module or package exporting a "Writer" class with a "write" method. The base "Writer" class can be found in the docutils/writers/__init__.py module.
Responsibilities:
Examples:
I/O classes provide a uniform API for low-level input and output. Subclasses will exist for a variety of input/output mechanisms. However, they can be considered an implementation detail. Most applications should be satisfied using one of the convenience functions associated with the Publisher.
I/O classes are currently in the preliminary stages; there's a lot of work yet to be done. Issues:
Responsibilities:
Examples of input sources:
Examples of output destinations:
Package "docutils".
Module "__init__.py" contains: class "Component", a base class for Docutils components; class "SettingsSpec", a base class for specifying runtime settings (used by docutils.frontend); and class "TransformSpec", a base class for specifying transforms.
Module "docutils.core" contains facade class "Publisher" and convenience functions. See Publisher above.
Module "docutils.frontend" provides runtime settings support, for programmatic use and front-end tools (including configuration file support, and command-line argument and option processing).
Module "docutils.io" provides a uniform API for low-level input and output. See Input/Output above.
Module "docutils.nodes" contains the Docutils document tree element class library plus tree-traversal Visitor pattern base classes. See Document Tree below.
Module "docutils.statemachine" contains a finite state machine specialized for regular-expression-based text filters and parsers. The reStructuredText parser implementation is based on this module.
Module "docutils.urischemes" contains a mapping of known URI schemes ("http", "ftp", "mail", etc.).
Module "docutils.utils" contains utility functions and classes, including a logger class ("Reporter"; see Error Handling below).
Package "docutils.parsers": markup parsers.
See Parsers above.
Package "docutils.readers": context-aware input readers.
See Readers above.
Package "docutils.writers": output format writers.
Subpackages of "docutils.writers" contain modules and data files (such as stylesheets) that support the individual writers.
See Writers above.
Package "docutils.transforms": tree transform classes.
See Transforms above.
Package "docutils.languages": Language modules contain language-dependent strings and mappings. They are named for their language identifier (as defined in Choice of Docstring Format below), converting dashes to underscores.
Third-party modules: "extras" directory. These modules are installed only if they're not already present in the Python installation.
The tools/ directory contains several front ends for common Docutils processing. See Docutils Front-End Tools [4] for details.
A single intermediate data structure is used internally by Docutils, in the interfaces between components; it is defined in the docutils.nodes module. It is not required that this data structure be used internally by any of the components, just between components as outlined in the diagram in the Docutils Project Model above.
Custom node types are allowed, provided that either (a) a transform converts them to standard Docutils nodes before they reach the Writer proper, or (b) the custom node is explicitly supported by certain Writers, and is wrapped in a filtered "pending" node. An example of condition (a) is the Python Source Reader (see below), where a "stylist" transform converts custom nodes. The HTML <meta> tag is an example of condition (b); it is supported by the HTML Writer but not by others. The reStructuredText "meta" directive creates a "pending" node, which contains knowledge that the embedded "meta" node can only be handled by HTML-compatible writers. The "pending" node is resolved by the docutils.transforms.components.Filter transform, which checks that the calling writer supports HTML; if it doesn't, the "pending" node (and enclosed "meta" node) is removed from the document.
The document tree data structure is similar to a DOM tree, but with specific node names (classes) instead of DOM's generic nodes. The schema is documented in an XML DTD (eXtensible Markup Language Document Type Definition), which comes in two parts:
The DTD defines a rich set of elements, suitable for many input and output formats. The DTD retains all information necessary to reconstruct the original input text, or a reasonable facsimile thereof.
See The Docutils Document Tree [7] for details (incomplete).
When the parser encounters an error in markup, it inserts a system message (DTD element "system_message"). There are five levels of system messages:
Although the initial message levels were devised independently, they have a strong correspondence to VMS error condition severity levels [8]; the names in quotes for levels 1 through 4 were borrowed from VMS. Error handling has since been influenced by the log4j project [9].
The Python Source Reader ("PySource") is the Docutils component that reads Python source files, extracts docstrings in context, then parses, links, and assembles the docstrings into a cohesive whole. It is a major and non-trivial component, currently under experimental development in the Docutils sandbox. High-level design issues are presented here.
This model will evolve over time, incorporating experience and discoveries.
Abstract Syntax Tree mining code will be written (or adapted) that scans a parsed Python module, and returns an ordered tree containing the names, docstrings (including attribute and additional docstrings; see below), and additional info (in parentheses below) of all of the following objects:
(Extract comments too? For example, comments at the start of a module would be a good place for bibliographic field lists.)
In order to evaluate interpreted text cross-references, namespaces for each of the above will also be required.
See the python-dev/docstring-develop thread "AST mining", started on 2001-08-14.
What to examine:
Where:
Docstrings are string literal expressions, and are recognized in the following places within Python modules:
How:
Whenever possible, Python modules should be parsed by Docutils, not imported. There are several reasons:
Of course, standard Python parsing tools such as the "parser" library module should be used.
When the Python source code for a module is not available (i.e. only the .pyc file exists) or for C extension modules, to access docstrings the module can only be imported, and any limitations must be lived with.
Since attribute docstrings and additional docstrings are ignored by the Python byte-code compiler, no namespace pollution or runtime bloat will result from their use. They are not assigned to __doc__ or to any other attribute. The initial parsing of a module may take a slight performance hit.
(This is a simplified version of PEP 224 [2].)
A string literal immediately following an assignment statement is interpreted by the docstring extraction machinery as the docstring of the target of the assignment statement, under the following conditions:
The assignment must be in one of the following contexts:
Since each of the above contexts are at the top level (i.e., in the outermost suite of a definition), it may be necessary to place dummy assignments for attributes assigned conditionally or in a loop.
The assignment must be to a single target, not to a list or a tuple of targets.
The form of the target:
Blank lines may be used after attribute docstrings to emphasize the connection between the assignment and the docstring.
Examples:
g = 'module attribute (module-global variable)' """This is g's docstring.""" class AClass: c = 'class attribute' """This is AClass.c's docstring.""" def __init__(self): """Method __init__'s docstring.""" self.i = 'instance attribute' """This is self.i's docstring.""" def f(x): """Function f's docstring.""" return x**2 f.a = 1 """Function attribute f.a's docstring."""
(This idea was adapted from PEP 216 [3].)
Many programmers would like to make extensive use of docstrings for API documentation. However, docstrings do take up space in the running program, so some programmers are reluctant to "bloat up" their code. Also, not all API documentation is applicable to interactive environments, where __doc__ would be displayed.
Docutils' docstring extraction tools will concatenate all string literal expressions which appear at the beginning of a definition or after a simple assignment. Only the first strings in definitions will be available as __doc__, and can be used for brief usage text suitable for interactive sessions; subsequent string literals and all attribute docstrings are ignored by the Python byte-code compiler and may contain more extensive API information.
Example:
def function(arg): """This is __doc__, function's docstring.""" """ This is an additional docstring, ignored by the byte-code compiler, but extracted by Docutils. """ pass
Issue: from __future__ import
This would break "from __future__ import" statements introduced in Python 2.1 for multiple module docstrings (main docstring plus additional docstring(s)). The Python Reference Manual specifies:
A future statement must appear near the top of the module. The only lines that can appear before a future statement are:
- the module docstring (if any),
- comments,
- blank lines, and
- other future statements.
Resolution?
Rather than force everyone to use a single docstring format, multiple input formats are allowed by the processing system. A special variable, __docformat__, may appear at the top level of a module before any function or class definitions. Over time or through decree, a standard format or set of formats should emerge.
A module's __docformat__ variable only applies to the objects defined in the module's file. In particular, the __docformat__ variable in a package's __init__.py file does not apply to objects defined in subpackages and submodules.
The __docformat__ variable is a string containing the name of the format being used, a case-insensitive string matching the input parser's module or package name (i.e., the same name as required to "import" the module or package), or a registered alias. If no __docformat__ is specified, the default format is "plaintext" for now; this may be changed to the standard format if one is ever established.
The __docformat__ string may contain an optional second field, separated from the format name (first field) by a single space: a case-insensitive language identifier as defined in RFC 1766. A typical language identifier consists of a 2-letter language code from ISO 639 [11] (3-letter codes used only if no 2-letter code exists; RFC 1766 is currently being revised to allow 3-letter codes). If no language identifier is specified, the default is "en" for English. The language identifier is passed to the parser and can be used for language-dependent markup features.
In Python docstrings, interpreted text is used to classify and mark up program identifiers, such as the names of variables, functions, classes, and modules. If the identifier alone is given, its role is inferred implicitly according to the Python namespace lookup rules. For functions and methods (even when dynamically assigned), parentheses ('()') may be included:
This function uses `another()` to do its work.
For class, instance and module attributes, dotted identifiers are used when necessary. For example (using reStructuredText markup):
class Keeper(Storer): """ Extend `Storer`. Class attribute `instances` keeps track of the number of `Keeper` objects instantiated. """ instances = 0 """How many `Keeper` objects are there?""" def __init__(self): """ Extend `Storer.__init__()` to keep track of instances. Keep count in `Keeper.instances`, data in `self.data`. """ Storer.__init__(self) Keeper.instances += 1 self.data = [] """Store data in a list, most recent last.""" def store_data(self, data): """ Extend `Storer.store_data()`; append new `data` to a list (in `self.data`). """ self.data = data
Each of the identifiers quoted with backquotes ("`") will become references to the definitions of the identifiers themselves.
Stylist transforms are specialized transforms specific to the PySource Reader. The PySource Reader doesn't have to make any decisions as to style; it just produces a logically constructed document tree, parsed and linked, including custom node types. Stylist transforms understand the custom nodes created by the Reader and convert them into standard Docutils nodes.
Multiple Stylist transforms may be implemented and one can be chosen at runtime (through a "--style" or "--stylist" command-line option). Each Stylist transform implements a different layout or style; thus the name. They decouple the context-understanding part of the Reader from the layout-generating part of processing, resulting in a more flexible and robust system. This also serves to "separate style from content", the SGML/XML ideal.
By keeping the piece of code that does the styling small and modular, it becomes much easier for people to roll their own styles. The "barrier to entry" is too high with existing tools; extracting the stylist code will lower the barrier considerably.
[1] | PEP 256, Docstring Processing System Framework, Goodger (http://www.python.org/peps/pep-0256.html) |
[2] | PEP 224, Attribute Docstrings, Lemburg (http://www.python.org/peps/pep-0224.html) |
[3] | PEP 216, Docstring Format, Zadka (http://www.python.org/peps/pep-0216.html) |
[4] | https://docutils.sourceforge.io/docs/user/tools.html |
[5] | https://docutils.sourceforge.io/docs/ref/docutils.dtd |
[6] | https://docutils.sourceforge.io/docs/ref/soextblx.dtd |
[7] | https://docutils.sourceforge.io/docs/ref/doctree.html |
[8] | http://www.openvms.compaq.com:8000/73final/5841/841pro_027.html#error_cond_severity |
[9] | http://logging.apache.org/log4j/docs/index.html |
[10] | https://docutils.sourceforge.io/docs/dev/pysource.dtd |
[11] | http://www.loc.gov/standards/iso639-2/englangn.html |
[12] | http://www.python.org/sigs/doc-sig/ |
A SourceForge project has been set up for this work at https://docutils.sourceforge.io/.
This document has been placed in the public domain.
This document borrows ideas from the archives of the Python Doc-SIG [12]. Thanks to all members past & present.