XML-Based Programming Systems
Greg Wilson
gvwilson@cs.utoronto.ca
September 2003

The realization that programs are just another kind of data is fundamental to computing. However, while data is stored in uniform, extensible, easy-to-process formats like XML, programs are stored as more-or-less arbitrary sequences of ASCII tokens. This primitive representation makes programs, and new programming tools, needlessly difficult to create. Switching to a richer storage format would facilitate better development practices, and allow new ideas to enter mainstream use more quickly.

Thesis

Mainstream programming languages are stuck in a rut

Inject information into tool chain at a limited number of points

Type of information that can be injected is not extensible

This is already changing

Unevenly, haphazardly, and un-self-consciously

Getting this right will revolutionize programming

And help research ideas move into the mainstream more quickly

Turing's Big Idea

Programs are data, data (can be) programs

1. Running programs are just bytes in memory

2. Source code can be---and should be---manipulated like any other text

3. Every moderately complex macro or configuration language is really a programming language

We're making some progress on #1

Java's biggest intellectual contribution is to bring reflection into the mainstream

But very little on the #2

Most C/C++ programmers don't think of CPP as a text-to-text transformer

View code generators with suspicion and/or superstitious awe

CASE tools are the world's best selling shelfware

And we keep trying to pretend that #3 isn't true

Every tool configuration syntax eventually needs conditionals, repetition, functions, etc.

But our initial designs repeatedly try to avoid including them

Most programming tools just don't get it

Have you seen a debugger that understands C++ templates?

Do you expect to get one for Java generics any time soon?

How easy is it to program your debugger?

Better question: why isn't it as easy as writing macros for your editor?

Theme and Variations

Java Server Pages (JSPs)

Embed code fragments and directives in web pages

Process JSP to create a Java servlet

Pro: reduce cognitive gap between programmers' mental model and textual artifact

Con:

JSP syntax is even harder to parse than its parents

When something goes wrong, programmers have to reverse engineer the translation in their heads

<HTML>
<BODY>
<TABLE BORDER=2>
<%
  for (int i = 0; i<n; i++) {
%>
    <TR>
    <TD>Number</TD>
    <TD><%= i+1 %></TD>
    </TR>
<%
  }
%>
</TABLE>
</BODY>
</HTML>

JavaDoc

Embed HTML in Java source using specially-marked comments

Along with special shorthand directives that aren't HTML

(Yet another) processor pulls them out and formats them

Pro: the closer documentation is to code, the more likely it is to be up to date

Con:

Makes source code ugly

Human beings should not have to type, or see, <b> in the early 21st Century

No guarantee that it's actually correct

Inextensible

A pale shadow of Knuth's Literate Programming

Debugging requires still more mental reverse engineering

/**
 * This class prints <em>odd</em> numbers.
 * See the <a href="copyright.html">copyright</a>.
 * @author Greg Wilson
 * @version 1.2
 */

public class Odds {
  public static void main(String[] args) {
    for (int i=0; i<10; ++i) {
      if (i % 2) {
        System.out.println(i);
      }
    }
  }
}

Ant

A replacement for Make developed as part of Apache's Jakarta efforts

Build specification written in XML

Invokes plugins written in Java to perform actions

Pro:

Real platform independence

Real extensibility

Con:

XML has lower signal-to-noise ratio than Makefiles

Have to drop down one level of abstraction in order to debug

Many attributes require extra parsing ($ expansion, value lists, etc.)

Which makes their content inaccessible to generic tools


<project name="MyProject" default="dist" basedir=".">

  <property name="src" location="src"/>
  <property name="build" location="build"/>
  <property name="dist"  location="dist"/>

  <target name="init" description="setup">
    <tstamp/>
    <mkdir dir="${build}"/>
  </target>

  <target name="compile" depends="init">
    <javac srcdir="${src}" destdir="${build}"/>
  </target>

  <target name="dist" depends="compile">
    <mkdir dir="${dist}/lib"/>
    <jar jarfile="${dist}/lib/MyProject-${DSTAMP}.jar" basedir="${build}"/>
  </target>

  <target name="clean">
    <delete dir="${build}"/>
    <delete dir="${dist}"/>
  </target>
</project>

XSLT

Language for specifying XML-to-text transformations

Output is usually XML or HTML, but flat text is possible

A declarative language (sort of)

Match/replace with forall and conditionals

Pro:

Better than the alternatives

And there are debuggers!

Con:

Did this wheel really need reinventing?

Most of the program isn't available via XML

<BODY bgcolor="{/FitnessCenter/Member/FavoriteColor}">
    Welcome <xsl:value-of select="/FitnessCenter/Member/Name"/>!
    <xsl:if test="/FitnessCenter/Member/@level='platinum'">
        Our special offer to platinum members today is now open.
    </xsl:if>
    Your phone numbers are:
    <TABLE border="1" width="25%">
        <TR><TH>Type</TH><TH>Number</TH></TR>
        <xsl:for-each select="/FitnessCenter/Member/Phone">
            <TR>
                <TD><xsl:value-of select="@type"/></TD>
                <TD><xsl:value-of select="."/></TD>
            </TR>
        </xsl:for-each>
    </TABLE>
</BODY>

Many other relevant examples

JSR-175: Adding Metadata to Java

Allows programmers to insert small tags of the form @something into code

E.g. to specify that a class needs a remote stub, or that a field is a property

C++ Template Metaprogramming

C++ template expansion mechanism is Turing-complete

Recursion, conditionals, integer arithmetic

So cleverly-written template classes can be used to turn the compiler into a code generator

Todd likes his music Baroque, too

SUIF (Stanford University Intermediate Format)

Compiler loads a set of optimization modules

Each module reads and writes a uniform format

It's a pity they stopped with code generation...

Apache

Increasingly a framework for invoking dynamically-loaded plugins

"We'll take care of communciation, you take care of content"

Configuration files are, well, challenging...

So what's really going on here?

Trying to squeeze new types of content into old formats

Blurring the distinction between code and data

Programming our programming tools

And Now For Something Completely the Same

Unix: an improvement on all its successors

The "lots of little tools" paradigm made it an extremely productive environment

But what made LOLT work?

Common data format: newline-separated strings

Common communications protocol: stdin, stdout, exit(0)

The world's first component-based programming system

Many important aspects of programs cannot be captured by the "stream of strings" paradigm

Proof: Java doesn't have a preprocessor, and Bjarne would like you to use C++'s less

There's a new universal data format in town

(Just about) everything either is, or can pretend to be, XML

Yes, there's some bandwagoneering going on

But there's a lot to be said for being able to process everything in a uniform way

Scheme: possibly the cleanest programming language ever invented

Represented data and code in a single format

Which in practice meant that it didn't really distinguish between them

And which made powerful programming tools easier to build

That uniformity also allowed Scheme to provide a workable syntax extension mechanism

Safely translate user-defined forms into established forms

As far ahead of its time as Smalltalk

Which may be one of the reasons it never became more than a boutique language

A Modest Proposal

Store programs as XML documents

That is, store all program structure explicitly in a single format

Create smart tools that:

Support WYSIWYG editing

Just like everyone else's editors these days

Real programmers don't read tags

Support extension via namespace-protected metadata

Which will often itself be programs to be executed by tools

What does this buy us?

Can view source however we want

Just as we customize views of web pages, CAD drawings, and other documents

More important: freely mix program source and other types of content

Documentation

Diagrams

My secretary can put a sketch of the new floor plan in email

Why can't I put a class diagram in source code?

Processing instructions

What was that last one?

Using a uniform storage format gives us an extensible way to embed arbitrary metadata in programs

Free from the (many) limitations of the examples given above

Each tool in the chain can inspect, inject, modify, or process whatever it wants

LOLT taken to the next level

This includes tools that run after the compiler is done

BLOB on disk includes instructions for debugger, profiler, logger, etc.

So what does this look like?

<program>
  <doc>
    ...something like XHTML...
  </doc>
  <codegen>
    ...stuff telling code generator how to generate extra code...
  </codegen>
  <staticcode>
    ...invariant stuff to be compiled...
    <doc>
      ...which may itself contain documentation blocks...
    </doc>
  </staticcode>
  <debugger>
    ...code to be executed by debugger, e.g. to customize display...
    <runtime-help>
      ...which may itself contain documentation blocks...
    </runtime-help>
  </debugger>
  <profiler>
    ...and so on...
  </profiler>
</program>

Instructions themselves

<for-loop>
  <for-loop-head>
  </for-loop-head>
  <stmt-seq>
    <doc>Only replace below threshold</doc>
    <cond>
      <test>
        <compare-expr operator="less">
          <field-expr field="age"><evaluate>record</evaluate></field-expr>
          <evaluate>threshold</evaluate>
        </compare-expr>
      </test>
      <body>
        <invoke-expr method="release"><evaluate>record</evaluate></invoke-expr>
      </body>
    </cond>
  </stmt-seq>
</for-loop>

Good gracious, that's ugly!

But who cares? This is just a model

As computer scientists, we ought to understand (and be comfortable with) the difference between models and views

One possible view of the code block above:

// Only replace below threshold
for (record in candidates) {
    if (record.age < threshold) {
        record.release();
    }
}

Another view of the same model:

;;; Only replace below threshold
(foreach record candidates
  (if (< (field record 'age) threshold)
    (record 'release))
)

Examples

Design by contract

Use of certain tags triggers a transformation module

Converts expressions tagged pre and post into legal Java

Injects records formatted for debugger API to allow transformation back to original source

Debugger customization

A standard way to ask the compiler to preserve information in the generated code

So that the debugger can display templated variables using the original semantics

Round-trip CASE

CASE tool can store model descriptions and link to them from code

Standard (XML) editors can be told what to show and hide

Model checker plugin for tool chain will come bundled with the syntax extensions, formats, etc.

We're already doing this

JAR (Java ARchive) and WAR (Web Application Archive) files already contain heterogeneous content

Executable code destined for final product may be only a fraction of content

Objections

"I want to see my programs as they actually are!"

1. You can

Text representation of XML is the equivalent of assembly language

2. You never really have

Remember, it takes over 300,000 lines of C to make magnetized regions on disk appear as formatted text on your screen

/   *   *  \r  \n       *       T   h   i   s       c   l   a
s   s       p   r   i   n   t   s       <   e   m   >   o   d
d       n   u   m   b   e   r   s   <   /   e   m   >   .  \r
\n      *       S   e   e       t   h   e       <   a       h
r   e   f   =   "   {   @   d   o   c   R   o   o   t   }   /
c   o   p   y   r   i   g   h   t   .   h   t   m   l   "   >
C   o   p   y   r   i   g   h   t   <   /   a   >   .  \r  \n
    *       @   a   u   t   h   o   r       G   r   e   g    
W   i   l   s   o   n  \r  \n       *       @   v   e   r   s
i   o   n       1   .   2  \r  \n       *   /  \r  \n  \r  \n
p   u   b   l   i   c       c   l   a   s   s       O   d   d
s       {  \r  \n           p   u   b   l   i   c       s   t
a   t   i   c       v   o   i   d       m   a   i   n   (   S
t   r   i   n   g   [   ]       a   r   g   s   )       {  \r
\n                  f   o   r       (   i   n   t       i   =
0   ;       i   <   1   0   ;       +   +   i   )       {  \r
\n                          i   f       (   i       %       2
)       {  \r  \n                                   S   y   s
t   e   m   .   o   u   t   .   p   r   i   n   t   l   n   (
i   )   ;  \r  \n                           }  \r  \n        
        }  \r  \n           }  \r  \n   }  \r  \n

"Vee Eye or Die!"

Do you use lynx?

More importantly, do you think a generation that has grown up with Word and IE is going to put up with Emacs?

"This can all be done with existing tools"

Sure---and you can write a web server in Fortran-77

The real question is, is it feasible with existing tools?

Answer by inference is "no"

"You don't need XML to do this"

S-expressions would work just as well

Difference is, people will actually use XML...

"This kind of extensibility will make programs harder to understand"

Harder than what?

Scattered directives in a variety of syntaxes?

Or the contortions programmers go through to squeeze metaprogramming, lazy execution, and unification into existing syntaxes?

Remember, every new idea in programming is an amplifier

Allows good programmers to be better

Allows bad programmers to be worse

"'Big Bang' changes never work"

Unless the whole tool chain is owned by a single vendor

MATLAB 7.0 could do this

VB.NET could have used XML as its storage format

If it had, we would be hailing Anders Hejlsberg as a visionary

And hundreds of companies would be building productivity tools right now

Summary

This is already happening

All the examples above

Superx++

Jelly

Many (like Water) that only pretend to be XML

Won't (can't) take off until general-purpose WYSIWYG XML editors appear

But they're on their way, because everyone else needs them

It is the nature of revolutions to begin where no-one is looking

And it would be a lot of fun


$Id: xmlprog.html,v 1.3 2003/09/19 17:54:08 gvwilson Exp $