An Investigation into the Opacity Properties of RFC 2396

One of Tim Berners-Lee's "axioms" that I find myself referring to quite often is the one of Opacity. More than any of his other axioms, I find this one truly fundamental; make the identifier itself opaque, thereby requiringing publishers to be explicit about the relationships between their resources through any mechanism that allows them to state these relationships (RDF, etc..).

Now, I'm not a total opacity nut. I recognize that if true opacity cannot (or has not) been achieved, that this isn't the end of the world, and that's what important is that any information available from a URI be uniformally interpreted, and also that we do everything we can to not let opacity degrade in the future.

So, while giving RFC 2396 a thorough going-over, I thought I'd jot down what I learned. I've done this by separator ("?", ";", "/"), as that seems a decent taxonomy by which to classify URI for these purposes.

But first some context

Context

This investigation has been performed primarily from the point of view of a client using a URI that it did not publish itself. It also fails to highlight the different degrees of opacity required by the various actors used to resolve a URI. As an example, consider;

client configures a caching proxy
client attempts to resolve a HTTP URI
client's browser opens connection to proxy
client's browser asks proxy for representation of URI
proxy responds with cached version

To resolve this URI with HTTP 1.0, the client treats the URI as entirely opaque; just a string. It doesn't look at the host name, port, URI scheme, nothing. In HTTP 1.1, because of the mandatory Host header, a client MUST crack open the host and port. But it need not concern itself with any other part of the URI.

This would be generalized by saying that in any HTTP request response chain with N>0 participating processors (not including the origin server), that only the Nth processor actually treats the URI as a locator. Everybody else need only treat it as a name for resolution (modulo the host/port).

The "/" separator

"/" is used by hierarchical URI to separate hierarchical parts. This means that a URI such as;

http://example.org/foo/bar/baz

is known to have two hierarchical parts; "foo" and "bar". This is opaque, in that nobody but the publisher of the URI knows what the hierarchy means, only that it exists. For example, if I published the URI;

http://example.org/desktop1/window123/frame1

http://example.org/desktop1/window123/frame2

all a third party would know is that there was some hierarchical relationship between the two (specifically that frame1 and frame2 had a common hierarchical "parent"), but not that the relationship was one of containment in a desktop GUI sense.

The one standout thing about "/" when compared with the other separators, is that has a special meaning in the context of relative and base URI processing.

Relative URI are only relative because they are hierarchical. That is, "/wow/gee" is a relative URI, but when grounded to a base URI of;

http://example.org/foo/

yields an absolute URI of;

http://example.org/foo/wow/gee

This suggests that URI publishers should be concerned about whether their URI end in "/" or not. If the URI might ever be used as a container, it should end with "/". This also suggests that the relationship between a URI with a terminating slash, and the same URI without the terminating slash, is also nothing more than that same non-specific hierarchical relationship.

The "?" separator

Section 5.2 of RFC 2396 reads "The base URI's query component is not used by the resolution algorithm and may be discarded". This suggests that URI using a query component makes a poor canonical URI, as the query component can never be used to specify any hierarchical relationship.

Also, the obvious opacity issue with queries is that they are not opaque to the client, only to proxies. This is because the client has to "populate" the query fields and therefore has to know what they mean in order to populate them. To a proxy, the URI simply identifies the resultant "completed" URI.

An open issue is whether the existence of a URI;

http://example.org/foo?bar=2

implies the existence of

http://example.org/foo

Logically, this would appear to be the case; the query term can be seen as passing parameters to the "main" URI, and the case of passing no parameters would leave us with this URI. But no where in RFC 2396 does it say that this is the case. But whether an oversight or simply unintentional, it appears to be "best current practice" as implemented in many web servers (http://foo.com/bar?a=b gets processed by the processor bound to http://foo.com/bar), so I have no problem accepting it as gospel.

The ";" separator

This separator appears to have the fewest issues with respect to opacity. No information can be extracted from it. Parameters are not required to have any meaning by themselves to anybody except the URI publisher.