Kudos to Kendall Clark for stating XML is Not Self-Describing, as I did last month.

He writes;

Well, I’ve read too much Wittgenstein (not to mention too much Aquinas, Meister Eckhart, and Julian of Norwich) to think that a name is necessarily a self-description

I haven’t read them at all (8-), but I think I have a pretty good understanding of self-description that I developed “Bottom up” during my study of Web architecture over the past few years. As Kendall brought this up again, I’d thought I’d write a few more words about it.

As I see it, description is always with respect to some context. For example, “The sky is blue” is not a self-descriptive statement unless you know;

  • ASCII
  • English
  • Which sky I mean
  • Which colour blue I mean

For any bag-o-bits, it seems to me that there exists a finite amount of contextual knowledge which is necessary in order to be able to understand it. “Self-describing” then, should mean that the bag itself contains sufficient information to identify the required contextual knowledge.

Tim Berners-Lee likes to talk a lot about this. Last year in Honolulu at WWW2002, his keynote was Specs Count, and much of it was about the value in the ability to be able to perform successive application of public specifications in order to understand a message. That’s contextual knowledge, and as you can see in his talk, it doesn’t begin with the HTTP message, it goes all the way down to the IP segment and Ethernet frame; even those bits must be considered (see an example of where this issue can show up in practice).

Where the Web fits in here, is with its contribution of an enormously valuable piece of contextual knowledge; RFC 2396 aka URIs. With respect to the example above, I can use URIs instead of strings, where those URIs can be used to provide the specifics of which blue I meant, by relating it to other colours.

There’s lots to be said about XML, RDF, and why SOA based Web services can never be self-descriptive (hint; too many methods). But I’ll leave it at that for now.

While noticing one of the scaling problems with Web services (a topic worthy of its own blog entry, but I can’t muster the energy), Dave Orchard suggests that XML’s “self-describing nature” will save the day.

Producing self-descriptive documents is hard. XML has some tools that help, in particular XML namespaces, and that it’s markup. But I see those as akin to having a hammer and nail in your hand as you attempt to build a house. XML’s “self-descriptive nature”, even if you accept that “XML” includes namespaces, is only slightly better than ASCII.

I’ve been doing a lot of self-describing data investigation over the past few weeks, and the last thing I consider important is which syntax is used. What I’ve found to be most important is that identifiers be URIs, and not to use an identifier where what it resolves to is what is really needed.

Lancelot: He says they've already *got* one!
Arthur: (confused) Are you *sure* he's got one?
Soldier: Oh yes, it's ver' naahs.
  -- Monty Python and the Holy Grail

Kendall writes;

In principle I support WS-Choreography, even without understanding exactly what it is aiming at, if only because it is likely to be very RDF and REST friendly, and those are, all other things being equal, among my preferred ways of describing information and building information interfaces.

That's good to hear, but I really don't see choreography solutions being anywhere near REST friendly. REST has already got a ver' naahs choreography solution built-in; hypermedia. It's how a REST agent changes state. As Roy wrote;

The model application is therefore an engine that moves from one state to the next by examining and choosing from among the alternative state transitions in the current set of representations.

Norm responds to a post of mine about why I felt that better technology, and not necessarily new standards, were what was required to solve the problems that XML Catalogs were trying to solve.

He offers three things that he believes can’t be done with caches, but can be done with XML Catalogs;

Populate the cache. “Caching proxies rely on the fact that you can access the resource at least once from the web.”. wwwoffle does this, but a better caching system need not. When I talked about the need for operating systems to be in on caching (and later with the Save-As idea), what I had in mind was treating the computer’s storage as a structured store (remember Bento?), such that any content would hit the disk “named” with its URI. This would permit the software that Norm installs to include with it a representation of this resource (schema or whatever), named with its one true URI, and available to any app on that machine. Again, no new standards required.

In that same section, Norm says that sometimes the URI may never be directly resolvable. That is definitely a possibility, but again, this same mechanism of tightly associating the URI-as-name with the data, makes that mostly moot; it doesn’t matter where the data comes from (modulo trust) if it’s self-describing.

Access Development Resources. Yah, what Mark said.

Devise Your Own Resolution Policies. I think your comment about public identifiers is relevant here; if they were used, this wouldn’t be an issue, and caching would be useful.

But while I maintain that better technology can do what Norm needs, I’m not saying that no standardization was necessary. Due to the fact that the technology is not there to do what is needed, plus the extent to which that technology needs to pretty much be pervasively integrated into OSs, standardizing on XML Catalogs may very well have been the best option. But something tells me that the decision to standardize was made without knowledge that a technical solution existed. No biggie, just pointing that out 8-).

I’ve heard about XML Catalogs before, but never in a context that piqued my interest enough such that I’d want to go learn what they were. Thanks to Norm Walsh’s description of them today in his weblog, I now know.

The idea, it seems, is that you need different identifiers in different contexts. So, for example, a http URL for some document won’t be usable when you’re offline, so you need a way to package that identifier, with the local one on the file system.

My view is that while I agree this is a problem, I don’t think new standards are required to fix it. I suggest that better technology is what is required.

“http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd” is an identifier for a DocBook DTD, and independent of the online status of your notebook, it remains an identifier for that DocBook DTD. What’s needed are operating systems, browsers, and network libraries that, when offline and asked for a representation of the resource identified by that URI, returns a cached representation.

Another consequence of this is that “File->Save As” in a browser should be de-emphasized. I’d prefer it be just “Save” or “Store” or something like that where the user isn’t prompted for a file name. The implication being that the file already has an identifier, so why does it need a different name on my computer? Obviously you’d still want access to “File->Save As” in some cases, but I don’t believe it’s what most people need most of the time.

Simon St. Laurent reports on Norm Walsh’s XML is not Object Oriented essay.

Simon writes;

The only thing I can think to add is that XML is pretty explicitly a rejection of an aspect of OO practice that Norm touches on only briefly: encapsulation. Everything accessible all the time is pretty clearly a hallmark of XML work. You can hide things if you want to, but it takes a lot more effort.

I’m pretty sure that Simon meant to say “data hiding” instead of encapsulation there, as the last sentence suggests. Encapsulation refers to the binding of associated data and behaviour into an identifiable whole. Data hiding refers to, well, hiding that data by not exposing it via the interface. There are many OO fanatics, myself included, who believe that you don’t need data hiding to be OO.

FWIW, I consider the Web to be the epitome of the anti-data-hiding view; resources as objects, URIs as object identifiers, GET as “give me your data”, POST as “process this data”, etc..

While noting that Roguewave has terminated its XML/tuple-space project, Ruple, Don Park wrote;

I am getting a dangerous itch to apply tuplespaces to web services workflow problems. TupleSpaces are extremely powerful as coordination infrastures so tuplespaces and web services go very well together IMHO.

Don, do you realize that REST’s uniform interface (GET/POST, etc..) defines a coordination language very similiar to a tuple space?

And for enabling workflow, there’s the additional REST constraint of using hypermedia as the engine of application state.

Jeremy Allaire posts a transcript of a “conversation”(?) with Tim Berners-Lee on the Semantic Web at PC Forum.

Here’s a snippet which includes some of Tim’s words, plus Jeremy’s commentary;

TBL: business model for semantic web is the biz model of the web. it’s how apps interoperate, it’s how apps talk. short answer: dramatically reduce cost of enterprise app integration.

(My side conversation with Adam Bosworth, BEA chief architect and ex-Microsoft, Adam helped shape many of the XML standards. We both agree that this RDF thing is a big joke and TBL is on another planet. Adam helped drive the creation of XML Schema and XML Namespaces, as well as Web Services standards that uses these, and these are the things that are actually driving the semantic web. Virtualy no one uses RDF, but nearly everyone is moving to these other standards).

I’m a big believer in the technology behind the Semantic Web, but am skeptical that it will see widescale deployment anytime soon, due mainly to the (current) lack of a killer app. But that doesn’t reduce its value for application integration by very much. As we’ve seen, any form, of exposure of a system in a machine processable manner is an improvement over the alternative of having no access. It sounds to me like Jeremy and maybe Adam don’t even see the Semantic Web as a solution to the same problem that they’re tackling in their Web services work. Well, it is, and it’s worth investigating further before so easily dismissing it.

I’d recommend reading an earlier blog entry about the value of the Semantic Web for integration.

A very nice piece from Tim on how Web services should look; RESTful.

Dave, Sam, and Don have all responded.

First, to Tim, right on man. It’s about time too. For quite a while, he’s seemed to be on the fence, but this seems to make it quite clear where he stands, and hopefully where he’ll be voting on Decision Day. Perhaps with a name as well respected as his firmly in the REST camp, more people will take notice.

To Sam, a couple of points. First, safety != idempotency. Safety means messages don’t change state (roughly). Idempotency means multiple messages have the same effect as one. All safe operations are idempotent, but not all idempotent operations are safe. For example, PUT is idempotent but not safe. In addition, having the response change for a GET, even in a very short period of time, is perfectly RESTful. What isn’t RESTful is if the result changes because GET was invoked (and the owner cares about the change).

Also, the “take three parameters” analogue misses a major point I believe. In the POST/SOAP/XML-RPC case, sure you get the same info back, but you have no way to refer to that info, or pass references around to other parties. When you marshal data into a http URI, you are creating a token which has associated with it a publicly specified method for dereferencing. That is a vast improvement over the one-time use-and-consume approach of POST for retrievals.

Dan Brickley finds a wrapper for wget called wget.pl, which adds Etag support to wget, making for a much more network-friendly Blagg/Blosxom aggregator combo.