A little bit more...

Saturday, November 11, 2006

Comment on W3C DOM and various implementations in different PL

First I have to confess I'm quite unfamiliar with xml processing. I've only done it once extensively in Delphi due to a project I was involved in.

These days I'm studying tricks and technologies as to xml processing with java. So as I mentioned in a previous post I wrote about the overview on it. In particular, I mentioned the DOM way which is based on DOM, Document Object Model, a standard Object Model of XML maintained by the W3C Consortium. Here I first give some simple details about DOM itself.

For a simple xml document shown below:
<?xml version="1.0" encoding="UTF-8" ?>
<song genre="rock">
<name>My December</name>
<singer>Linkin Park</singer>
</song>

The DOM tree-like structure should be like this (E indicates a element node and T indicates a text node):
E:song
|--T:characters(whitespace)
|--E:name---T:characters(My December)
|--T:characters(whitespace)
|--E:singer---T:characters( Linkin Park)
|--T:characters(whitespace)

As depicted above, the root node song has five child nodes among wich two have their child nodes. I wanna emphasize the text node here. Before I start going deep into xml processing these days, I even don't know the existence of so-called text nodes. Because in Delphi, they're just ignored. So the DOM tree-like structure is like this:
E:song---E:name
|--E:singer

Only an element is called a node. I think this is quite intuitive, though definitely the official DOM structure is more theoretically complete. But with the white space and other text nodes the process of xml parsing is complicated. The example is worth a thousand words. Let's see how the simple xml document is parsed defferently in Java and Delphi:

In Java (exceptions are left unhandled):
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File(<xml file name>);
Element root = doc.getDocumentElement();
NodeList list = root.getChildNodes();
// A simple helper method
printStr("name: " + list.item(1).getFirstChild().getNodeValue());
printStr("singer: " + list.item(3).getFirstChild().getNodeValue());

In Delphi:
var
XMLDoc: IXMLDocument;
XMLNode, CtlNode: IXMLNode;
i, index: integer;
str: string;
begin
str = '';
XMLDoc := TXMLDocument.Create(nil);
XMLNode = XMLDoc.ChildNodes.Nodes['song'];
for i := 0 to XMLNode.ChildNodes.Count - 1 do
begin
str := str + XMLNode.ChildNodes.Nodes[i].NodeValue;
end;
end;

Apparently, the Java version is more awkward and will be more complicated provided the xml document is very long. This is because the element nodes can't be sequentially accessed due to the existence of white space text nodes. In contrast, with text nodes ignored, the Delphi version is quite clear and adaptive to document of any size. As I know, besides Java many implementations (at least Javascript, as I know) of DOM are aware of the text nodes, especially the white space text nodes.

So various kinds of helper method are used by developers to improve this awkward situation.
Method 1:
private Node getNodeByName(final NodeList list, final String name) {

for (int i = 0; i < list.getLength(); i++) {

final Node node = list.item(i);

// to pass the white space node

if (name.equals(node.getNodeName())) {

return node;

}

}

return null; // not found

}

Method 2:

...

NodeList list = e.getChildNodes();

for (int i = 0; i < list.getLength(); i++) {

Node n = list.item(i);

  // to pass the white space node

if (!(n instanceof Element)) { continue; }

nsFixup((Element) n, map, false);

}

And I believe there must be more.

I really don't see any benifits of keeping the awareness of text nodes until now. But If you know, tell me please.

No comments:

About Me

My photo
I'm finishing my master degree in Software Engineering, Computer Science. I believe and have been following what Forrest Gump's Mam said: you have to do the best with what god gave you.