Thoughts on pysource

Author:: Tibs
Date:: 19 Sep 2002

[This is the contents of an email sent from Tibs to me, to tie up -- or at least identify -- loose ends. When Tibs writes "you", he's referring to me. It is published with Tibs' permission. I mostly agree with what Tibs has written. I've added my own notes, indented in square brackets, like this one.

—David Goodger]

Well, having looked back at the pysource sources, I'm confirmed in my opinion that, if I were to start work on the project again, I would probably start a new version of pysource from scratch, using what I have learned (essentially, the content of visit.py), and taking advantage of all the new technology that is in the modern DOCUTILS.

Thus I would (at least):

produce a "frontend" using Optik
rewrite visit.py more neatly
construct DOCUTILS nodes to match pysource.dtd
possibly meld those into the classes in visit.py (but possibly not - some of the stuff in visit.py seems to me to be, of necessity, a bit messy as it is constructing stuff, and thus it may be premature to "DOCUTILS" it before it is necessary).
produce an example transform to amend the Python specific DOCUTILS tree into a generic DOCUTILS tree.

That's a fair amount of work, and I'm not convinced that I can find the sustained effort [1], especially if you might be willing to take the task on (and if I would have been refactoring anyway, a whole different head may be a significant benefit).

Some comments on pysource.txt and pysource.dtd

I believe that a <docstring> node is a Good Thing. It's a natural construct (if one believes that the other nodes are good things as well!), and it allows the "transform" of the Python specific tree to be done more sensibly.

[I think I finally understand what Tibs is getting at here. I thought he meant to have a <docstring> node in an otherwise standard Docutils doctree, which just contained the raw docstring. Instead, I believe he means that the Python-specific Docutils doctree elements like <class_section> and <module_section> (see pysource.dtd), should each have a <docstring> element which contains %structure.model;, instead of the current %structure.model; directly inside the element. In other words, a <docstring> element would be a simple container for all the parsed elements of the docstring. On second thought, each Python-specific section element would still have to have %structure.model;, in order to contain subsections. A <package_section> would contain <module_section>s, which would contain <class_section>s, and so on.

So a <docstring> element may be useful as a convenience to transforms. It would be a trivial change anyhow.

The initial (internal) data structure resulting from the parsing of Python modules & packages is a completely different beast. It should contain <docstring> nodes, along with <package>, <module>, <class>, etc., nodes. But these should not be subclasses of docutils.nodes.Node, and this tree should not be a "Docutils document tree". It should be a separate, parallel structure. Document tree nodes (docutils.nodes.Node objects) are not suited for this work.

—DG]

I recommend some initial "surgery" on the initial parse tree to make finding the docstrings for nodes easier.

I reckon that one produces a Python-specific doctree, and then chooses one of several transforms to produce generic doctrees.

By the way, I still vote for "static" introspection - that is, without actually importing the module. It does limit what one can do in some respects, but there are modules one might want to document that one does not wish to import (and indeed, at work we have some Python files that essentially cannot be introspected in this manner with any ease - they are "designed" to be imported by other modules after sys.path has been mangled suitably, and just don't work stand-alone).

I've tried to expand out the ideas you had on how the "pysource" tool should work below.

although it occurs to me that this is all terribly obvious, really. Oh well...

The pysource tool

The "input mode" should be the "Python Source Reader".

You can see where I'm starting from in pysource.txt.

This produces, as its output, a Python-specific doctree, containing extra nodes as defined in pysource.dtd (etc.).

One of the tasks undertaken by the Python Source Reader is to decide what to do with docstrings. For each Python file, it discovers (from __docformat__) if the input parser is "reStructuredText". If it is, then the contents of each docstring in that file will be parsed as such, including sorting out interpreted text (i.e., rendering it into links across the tree). If it is not, then each docstring will be treated as a literal block.

This admits the posibility of adding other parsers for docstrings at a later date - which I suspect is how HappyDoc does it. It is just necessary to publicise the API for the Docstring class, and then someone else can provide alternate plugins.

The output of the Python Source Reader is thus a Python-specific doctree, with all docstrings fully processed (as appropriate), and held inside <docstring> elements.

The next stage is handled by the Layout Transformer.

[What I call "stylist transforms". --DG]

This determines the layout of the document to be produced - the kind or style of output that the user wants (e.g., PyDoc style, LibRefMan style, generic docutils style, etc.).

Each layout is represented by a class that walks the Python-specific doctree, replacing the Python-specific nodes with appropriate generic nodes. The output of the Layout Transformer is thus a generic doctree.

The final stage is thus a normal DOCUTILS Writer - since it is taking input that is a generic doctree, this makes perfect sense. This also means that we get maximum leverage from existing Writers, which is essential (not just a Good Thing).

As you point out, some layouts will probably only be appropriate for some output formats. Well, that's OK. On the other hand, if the output of the Layout stage is a generic doctree, we're not likely to get fatal errors by putting it through the wrong Writer, so we needn't worry too much.

Thus one might envisage a command line something like:

pysource --layout:book --format:html this_module

Of course, other command line switches would be options for particular phases (e.g., to use frames or not for HTML output). I can see wanting a series of configuration files that one could reference, to save specifying lots of switches.

One specific thing to be decided, particularly for HTML, is whether one is outputting a "cluster" of files (e.g., as javadoc does). I reckon this can be left for later on, though (as can such issues as saying "other interesting sources are over there, so reference them if you can" - e.g., the Python library).