A little bit more...

Monday, November 20, 2006

Basics of Javascript

Note:
My recent posts about basics or overview of something
mostly cite select matirials of sources listed in the Resources section of every
post. It only serves for personal study and learning. And if you like, you can
take any part or all of them as desired. It would be my pleasure.

Overview

Javascript is an html scripting language. In the official specification it is
called ECMAScript.

Built-in Features

Datatypes and Values

All numbers in JavaScript are represented as 64-bit floating-point values
(i.e., similar to double in java and C++).

Conversion between Strings and Numbers can be done in several ways in both
direction. Numbers are automatically converted to strings when needed, so are
strings converted to numbers.

Numbers to strings:

var n = 100;
var s = n + " bottles of beer.";

var n_as_string = n + "";

var string_value = String(number);

string_value = number.toString();

Strings to numbers:

var product = "21" * "2"; // get number 42

var number = string_value - 0;
(Note: adding zero to a string value
results in string concatenation)

var number = Number(string_value);

// And parseInt(), parseFloat.

In JavaScript, functions are values that can be manipulated
by JavaScript code. It means that functions can be stored in variables, arrays,
and objects, and it means that functions can be passed as arguments to other
functions.

Functions can be defined in three ways:

function square(x) { return x*x;}

var square = function(x) { return x*x; }
// function name here is
optional.

var square = new Function("x", "return x*x");
// awkward, less useful and
less efficient.

An object is a collection of named values. These named values are usually
referred to as properties of the object. Properties of objects are, in
many ways, just like JavaScript variables; they can contain any type of data,
including arrays, functions, and other objects. Objects in JavaScript can serve
as associative arrays (recall the same concept in Delphi/Pascal, if you
know that language); that is, they can associate arbitrary data values with
arbitrary strings.

image.width
image.height

image["width"]
image["height"]

Arrays may contain any type of JavaScript data, including references to other
arrays or to objects or functions. Also note that
JavaScript does not support multidimensional arrays,
except as arrays of arrays. Finally, because JavaScript is an untyped language,
the elements of an array do not all need to be
of the same type
, as they do in typed languages like Java.

A corresponding object class is defined for each of the three key
primitive datatypes
. That is, besides supporting the number, string,
and boolean datatypes, JavaScript also supports Number, String, and Boolean
classes. JavaScript can flexibly convert values from one type to another. When
you use a string in an object contexti.e., when you try to access a property or
method of the string, JavaScript internally creates a String wrapper
object for the string value
. Note that the String object created when
you use a string in an object context is a transient one.

Primitive types are manipulated by value, and reference types, as the
name suggests, are manipulated by reference
. Numbers and booleans are
easily manipulated at the low levels of the JavaScript interpreter. Objects, on
the other hand, are reference types. Arrays and functions, which are specialized
types of objects, are therefore also reference types.

Since strings (primitive type, not the wrapper) are immutable in JavaScript,
there is no way to tell whether strings are passed by value or by reference.

Variables

There's no fundamental difference in JavaScript between variables and
the properties of objects
.

When the JavaScript interpreter starts up, one of the first things it
does, before executing any JavaScript code, is create a global
object
. The properties of this object are the
global variables of JavaScript programs. When you declare a global JavaScript
variable, what you are actually doing is defining a property of the global
object.

The JavaScript interpreter initializes the global object with a number of
properties that refer to predefined values and functions. For example, the
Infinity, parseInt, and Math properties refer to the
number infinity, the predefined parseInt( ) function, and the
predefined Math object, respectively.

In top-level code (i.e., JavaScript code that is not part of a function), you
can use the JavaScript keyword this to refer to the global
object
.

In client-side JavaScript, the Window object
serves as the global object
for all JavaScript code contained in the
browser window it represents. This global Window object has a self-referential
window property that can be used instead of this to refer to
the global object. The Window object defines the core global properties, such as
parseInt and Math, and also global client-side properties,
such as navigator and screen.

For local variables, while the body of a function is executing, the function
arguments and local variables are stored as properties of another special
object. This object is known as the call object.

Each time the JavaScript interpreter begins to execute a function, it creates
a new execution context for that function. Thus,
JavaScript code that is not part of any function runs in an execution context
that uses the global object for variable definitions. A JavaScript
implementation may allow multiple "global" execution contexts
. The
obvious example is client-side JavaScript, in which each separate browser
window, or each frame within a window, defines a separate global execution
context.

Object Support

ECMAScript does not contain proper classes such as those in C++, Smalltalk,
or Java. An ECMAScript object is an unordered collection of properties each with
zero or more attributes.

It turns out that every JavaScript object includes an internal
reference to another object, known as its prototype
object. All
functions have a prototype property that is automatically created and
initialized when the function is defined. The initial value of the
prototype property is an object with a single property. This property
is named constructor and refers back to the constructor function with
which the prototype is associated.

Property inheritance occurs only when you read property values, not
when you write them
. If you set the property p in an
object o that inherits that property from its prototype, what
happens is that you create a new property p directly in
o. Now that o has its own property named
p, it no longer inherits the value of p from
its prototype.

Navigator Object

The JavaScript
navigator object
is the object representation of the client internet browser
or web navigator program that is being used. This object is the top level object
to all others.

DOM Object

Overview

The goal of the DOM group is to define a programmatic interface for XML and
HTML. It is platform- and language-neutral interface. The DOM is separated into
three parts: Core, HTML, and XML. The Core DOM provides a low-level set of
objects that can represent any structured document.

DOM is being designed at several levels:

  • "Level 1. This concentrates on the actual core, HTML, and XML document
    models. It contains functionality for document navigation and manipulation.

  • Level 2. Includes a style sheet object model, and defines functionality for
    manipulating the style information attached to a document. It also enables
    traversals on the document, defines an event model and provides support for XML
    namespaces.

  • Level 3. Will address document loading and saving, as well as content models
    (such as DTDs and schemas) with document validation support. In addition, it
    will also address document views and formatting, key events and event groups.
    First public working drafts are available.

  • Further Levels. These may specify some interface with the possibly
    underlying window system, including some ways to prompt the user. They may also
    contain a query language interface, and address multithreading and
    synchronization, security, and repository."

Resources

  1. ECMAScript
    Language Specification 3rd edition

  2. Ajax
    in Action

  3. The
    CTDP JavaScript Manual Version 0.6.0, December 31, 2000

  4. W3C Document Object Model
    (DOM)

  5. DOM
    objects and methods

  6. JavaScript - The Definitive Guide, 5th Edition


This is a rough draft and published temporarily.

Wednesday, November 15, 2006

Study XSLT Tutorial

Overview

XSL = XML Style Sheets

XSL consists of three parts:

  • XSLT - a language for transforming XML documents

  • XPath - a language for navigating in XML documents

  • XSL-FO - a language for formatting XML documents


The root element that declares the document to be an XSL style sheet is <xsl:stylesheet> or <xsl:transform>.

Note: <xsl:stylesheet> and <xsl:transform> are completely synonymous and either can be used!

More Color On The Overview
An XSLT style sheet consists of a set of template rules, each of which takes the form "if this condition is encountered in the input, then generate the following output." The order of the rules is immaterial, and there is a conflict-resolution algorithm applied when several rules match the same input. One respect in which XSLT differs from serial text processing languages, however, is that the input is not processed sequentially line by line. Rather, the input XML document is treated as a tree structure, and each template rule is applied to a node in the tree. The template rule itself can decide which nodes to process next, so the input is not necessarily scanned in its original document order. [via]

Use XSL To Transform a XML Document

First declare the a xsl document and then define templates:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<html>
<body>
<h2>My CD Collection</h2>
<table border="1">
<tr bgcolor="#9acd32">
<th>Title</th>
<th>Artist</th>
</tr>
<xsl:for-each select="catalog/cd">
<tr>
<td><xsl:value-of select="title"/></td>
<td><xsl:value-of select="artist"/></td>
</tr>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>

</xsl:stylesheet>

Then specify the stylesheet in your xml source document, simply like this: <?xml-stylesheet type="text/xsl" href="cdcat.xsl"?>.

The match attribute is used to associate a template with an XML element. But match="/" defines the whole document by associating the template with the root of the xml source document, in which the value of the match attribute is an XPath expression.

In the element <xsl:for-each select="catalog/cd">, "catalog/cd" matches (case-sensitive match, after all an xsl instance is an xml document.) the data structure in the xml document, i.e., the value of the select attribute (a little bit like "select" in SQL) is an XPath expression.

We can also filter the output from the XML file by adding some criterions to the select attribute in the <xsl:for-each> element.

<xsl:for-each select="catalog/cd[artist='Bob Dylan']">

Legal filter operators are:

  • = (equal)

  • != (not equal)

  • < less than

  • > greater than
Note: 'Bob Dylan' should match exactly what is between the <artist> and </artist>, including white spaces and line breaker.

We can use an <xsl:sort> element inside the <xsl:for-each> to sort the output.

To add an if statement use the syntax below:
<xsl:if test="expression">
...
...some output if the expression is true...
...
</xsl:if>
For example:
<xsl:for-each select="catalog/cd">
<xsl:if test="price > 10">
<tr>
<td><xsl:value-of select="title"/></td>
<td><xsl:value-of select="artist"/></td>
</tr>
</xsl:if>
</xsl:for-each>

See here for more conditional tests to filter the output by using <xsl:choose> and <xsl:when>.

Without its select attribute specified, <xsl:apply-templates> is used to apply any relevant template(s) to the matched node(s)'s children. While using this element's select attribute, you can be pickier about exactly which children of a node should be processed and in what order.

Referring to the xsl file directly in an xml docuemt requires that there be a XSLT aware browser.Actually, we could have alternatives for the transformation. First, we can use javascript on the client side to invoke a stand alone xml parser, such as MS XML Parser, to do the transformation. Second, we can also use server side scripting language (e.g., asp, jsp, python, etc) to do the transformation, which meets the cross browser needs.

The xsl:attribute element can be used to add attributes to result elements whether created by literal result elements in the stylesheet or by instructions such as xsl:element.

Little Tricks

1. Use <...select="@width"> to identify the attribute of an element, in which case width is the attribute name. The XPath expression ../@title selects the title attribute of the element that is the parent of the current node.

2. Use curly braces ({}) surrounding an expression to specifiy an attribute value template. e.g., <h1><a href="{../link}"><xsl:apply-templates/></a></h1> (".." may be meant to go to parent node of the current node). And see the following example for more details:
The following example creates an img result element from a photograph element in the source; the value of the src attribute of the img element is computed from the value of the image-dir variable and the string-value of the href child of the photograph element; the value of the width attribute of the img element is computed from the value of the width attribute of the size child of the photograph element:

<xsl:variable name="image-dir">/images</xsl:variable>

<xsl:template match="photograph">
<img src="{$image-dir}/{href}" width="{size/@width}"/>
</xsl:template>


With this source

<photograph>
<href>headquarters.jpg</href>
<size width="300"/>
</photograph>


the result would be

<img src="/images/headquarters.jpg" width="300"/>

3. The order in which various template rules appears in the stylesheet mean nothing to the XSLT processor.

4. The XSLT processor uses the most specific template it can find to process each node of the source tree. So template: <xsl: template match="*|@*|text()"> might do nothing if any other templates are defined, since it just matches any element, attribute and text nodes. And another example, in the existence of <xsl:template match="channel/title">, <xsl:template match="title"> might do nothing also.

Conclusion

Conceptually (the fact is almost the same most of the time), you can think of the transformation process with XSLT like this: the input xml source document is parsed as a source tree structure (DOM?), and another input, the style sheet is also parsed as a tree stucture, then it's the XSLT Processor's job to write the source tree as the result tree according to the stylesheet (mostly, template rules). Figure 1 illustrates the process.

Figure 1. Operation of an XSLT ProcessorOperation of an XSLT Processor

Resources:

1. XSLT Tutorial

2. XSL Transformations (XSLT) Version 1.0

3. 使用XML: XSLT 2.0和XQuery对比

4. What kind of language is XSLT?

5. Book: XSLT Quickly

6. Saxon: Anatomy of an XSLT processor

Tuesday, November 14, 2006

A Little Trick: XML Data Embedded in HTML

You can embed xml which contained data you want to display in a html document. The line of code does this embeding thing is like this:
<xml id="cdcat" src="cd_catalog.xml"></xml>

But there's a little trick. It requires that the xml source document's name reflect the structure of the xml document. For example, below is a fragment of the source document:
<?xml ...?>
<CATALOG>
<CD>
<...


For the xml document containing this fragment of codes should be referred as "cd_catalog.xml" in the embeding html document. So below is the whole example.

The XML document containing the data:
<?xml version="1.0" encoding="ISO-8859-1"?>
<CATALOG>
<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>

<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>
<CD>
<TITLE>Hide your heart</TITLE>
<ARTIST>Bonnie Tyler</ARTIST>
<COUNTRY>UK</COUNTRY>

<COMPANY>CBS Records</COMPANY>
<PRICE>9.90</PRICE>
<YEAR>1988</YEAR>
</CD> ... </CATALOG>

The HTML document embeding the xml data:
<html>
<body>

<xml id="cdcat" src="cd_catalog.xml"></xml>

<table border="3" datasrc="#cdcat">

<tr>
<td><span datafld="ARTIST"></span></td>
<td><span datafld="TITLE"></span></td>
</tr>

</table>

</body>
</html>

Click this link to see the live example. And as the tutorial mentioned, it seems it only functions on IE 5.0 or later version, but not functions on Firefox.

Resourses:

XML Data Island

Monday, November 13, 2006

The Java SE 6 Platform Quiz

The following quiz answers cites The Java SE 6
Platform Quiz
:

1. What scripting language can you use in the Java SE 6 platform?

Answer (E): The Mozilla
Rhino
engine implements the JavaScript technology
scripting language and is available in the core Java Runtime Environment (JRE).
However, the scripting API allows you to use any scripting engine that conforms
with JSR 223.

2. What is the normalization of Unicode text?
Answer (C):
The Java SE 6 platform provides the public java.text.Normalizer
class, which allows you to convert text data to common composed or decomposed
forms, allowing for accurate comparisons and searches on text. Before the Java
SE 6 platform release, the Normalizer class had been hidden in the
Java platform. The class is now a public API.

3. How do you launch your host’s default browser to view a specific
URL?

Answer (B): The Desktop
API
allows your program to launch applications associated with certain file
types on the host platform. The current implementation can launch a web browser,
text editor, and email application.

4. How can I sort JTable content?
Answer
(D): A javax.swing.table.TableRowSorter wraps your existing
TableModel. You can configure it to filter or sort your
JTable contents.

5. What is the correct annotation to use to export a method as a web
service operation using Java API for XML Web Services (JAX-WS), version
2.0?

Answer (B): The @WebMethod annotation is used to
mark a method that is exposed as a web service operation. Note that the
@WebService annotation is used to specify that the class is a web
service or that the interface defines a web service. The programmer will likely
use the @WebService annotation in conjunction with the
@WebMethod annotation. See the article “Introducing
JAX-WS 2.0 With the Java SE 6 Platform, Part 1
” for more information.

6. In JDK 6, the JMX Monitor API now uses a thread pool to increase
performance. What is the purpose of the JMX Monitor API?

Answer (D):
The JMX
Monitor API
allows an application to sample an attribute property of an
MBean periodically and send a notification event if it passes a given threshold.
It now uses a thread pool instead of creating a thread for each monitor. Another
improvement is the ability to monitor a value within a complex type.

7. JDK 6 incorporates an advanced version of the
SwingWorker class into core Java technology. What is the purpose of
the SwingWorker class?

Answer (D): Since the 1998
publication of SwingWorker in the article “Threads
and Swing
,” developers have continuously requested that it be moved into
core. At the 2004 JavaOne conference, the Desktop team presented a new version
of SwingWorker that included generification, use of the concurrency
package, and PropertyChangeListener support. Much of this
functionality assists with interthread communication. The Java SE 6 platform
release incorporates a similar version of SwingWorker
that greatly assists developers in processing GUI-driven functionality off the
event-dispatching thread, indicating status and progress and aggregating the
results.

8. What is the best Java platform to use with the upcoming release of
the Microsoft Windows Vista operating system?
Answer (A): The Java
SE 6 platform release works best with the latest user interface (UI)
enhancements of Windows Vista. According to a recent blog entry by Chet Haase: “The
primary delivery of Java for Vista is Java SE 6; that release has received most
of our focus during the Vista beta release timeframe.” Go to the JDK 6 Project site to download the most
recent version. The release is pretty close to final, so it is working very well
at this point. All of the serious Windows Vista problems have been fixed in this
release for months, so it is a particularly good test vehicle for Java
technology on Vista.

9. In the Java SE 6 platform, what key tuning option(s) are needed to
achieve high performance?

Answer (D): See the blog entry “No Tuning
Required: Java SE Out-of-Box Vs. Tuned Performance
” for a comparison of
out-of-box and hand-tuned performance.

10. The Java SE 6 platform delivers a technology that can greatly
improve performance by reducing unnecessary synchronization overhead. It allows
a thread to lock and unlock an object with minimal use of atomic operations.
What is this technology called?

Answer (B): The technique called
store-free biased locking eliminates all synchronization-related atomic
operations on uncontended object monitors. The technique supports the bulk
transfer of object ownership from one thread to another, and the selective
disabling of the optimization where unprofitable, using epoch-based bulk
rebiasing and revocation. It has been implemented in the production version of
the Java HotSpot virtual machine (VM) and has yielded significant performance
improvements on a range of benchmarks and applications.

Three ways of validating a xml document with Java

With the rollout of Java 5.0 last year, JAXP 1.3 was in place for use. And one of the new features provided by JAXP 1.3 is a brand new Schema Validation Framework.

The newly provided framework decouples the validation of an instance document as a process independent of parsing. The Validation APIs are in the new package javax.xml.validation and let developers obtain from a compiled schema a Validator or/and a Validator Handler which are used to validate xml against the given schema. Alternatively, a compiled schema instance could also be passed to any Reader/Parser to validate xml. So there're roughly two ways provided by the new Schema Validation Framework. And besides these two, setting the uncomplied schema source on Reader/Parser is also available due to the issue of backward compatibility. As we can see in the first article and the accompanying example codes listed in the Resources section, the newly introduced Validation Frame improves the performance, effiency and flexibility.

Below are simple code snippets to respectively illustrate how validating xml documents is done in these three ways.

1. Set uncompiled schema (since JAXP 1.2):
private static void saxParseJAXP1_2(String xmlFile, DefaultHandler dh,
String schemaFile) {
try {
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(true);
spf.setValidating(true);
SAXParser sp = spf.newSAXParser();
sp.setProperty(
http://java.sun.com/xml/jaxp/properties/schemaLanguage,
XMLConstants.W3C_XML_SCHEMA_NS_URI);
sp.setProperty(
"
http://java.sun.com/xml/jaxp/properties/schemaSource",
schemaFile);

sp.parse(new File(xmlFile), dh);
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

2. Set compiled schema instance (since JAXP 1.3, FIX ME HERE)
private static void saxParseSetSchemaJAXP1_3(String xmlFile, DefaultHandler dh,
String schemaFile) {
try {
SchemaFactory sf = SchemaFactory.newInstance(
XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = sf.newSchema(new File(schemaFile));
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(true);
spf.setSchema(schema);
SAXParser sp = spf.newSAXParser();
sp.parse(new File(xmlFile), dh);
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
}

3. Validator (since JAXP1.3)
private static void saxParseValidateJAXP1_3(String xmlFile,
ErrorHandler dh, String schemaFile) {
try {
SchemaFactory sf = SchemaFactory.newInstance(
XMLConstants.W3C_XML_SCHEMA_NS_URI);
Validator validator = sf.newSchema(
new File(schemaFile)).newValidator();

validator.setErrorHandler(dh);
validator.validate(new StreamSource(xmlFile));
} catch (Exception e) {
e.printStackTrace();
}

It's noteworthy that the first way and the second way can apply for both DOM source and SAX source, while the third way is usually only used to validate a SAX stream (FIX ME HERE).

Update (20061113):

Basics of using Schema

Be aware of the concept of xml target namespace and "source namespaces". The name defined in a schema are said to belong to its target namespace. Definitions and declarations in a schema can refer to names that may belong to other namespaces. In the fourth article those namespaces are referred to as "source namespaces". And here follows a little colour as to simple type and complex type. An element that doesn't contain attributes or other elements can be defined to be of a simple type, predefined or user-defined, such as string, integer, decimal, time, etc. Elements with attributes and embeded elements must have a complex type. There're a huge amount of details about XML Schema definition that are not covered here but can be found here.

Simple example

A xml instance document:
<?xml version = "1.0" encoding = "utf-8"?>
<SONGS xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xsi:noNamespaceSchemaLocation='mySong.xsd'>
<SONG genre = "pop">
<TITLE > Hot Cop </TITLE>
<COMPOSER > Jacques Morali
</COMPOSER>
<COMPOSER>Henri Belolo</COMPOSER>
<COMPOSER>Victor Willis</COMPOSER>
<PRODUCER>Jacques Morali</PRODUCER>
<PUBLISHER>PolyGram Records</PUBLISHER>
<LENGTH>6:20</LENGTH>
<YEAR>1978</YEAR>
<ARTIST>Village People</ARTIST>
</SONG>
</SONGS>

The corresponding schema definition:
<?xml version="1.0" encoding="UTF-8" ?>
<xsd:schema xmlns:xsd='http://www.w3.org/2001/XMLSchema'>
<xsd:element name="SONGS">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="SONG" minOccurs='1' maxOccurs='unbounded' />
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="SONG">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="TITLE" type="xsd:string" />
<xsd:element name="COMPOSER" type="xsd:string" maxOccurs='unbounded' />
<xsd:element name="PRODUCER" type="xsd:string" maxOccurs='unbounded' />
<xsd:element name="PUBLISHER" type="xsd:string" maxOccurs='unbounded' />
<xsd:element name="LENGTH" type="xsd:string" />
<xsd:element name="YEAR" type="xsd:gYear" />
<xsd:element name="ARTIST" type="xsd:string" maxOccurs='unbounded' />
</xsd:sequence>
<xsd:attribute name="genre" type="xsd:string" />
</xsd:complexType>
</xsd:element>
</xsd:schema>

Resources:

1. Easy and Efficient XML Processing: Upgrade to JAXP 1.3

2. Java 2 Platform Standard Edition 5.0 API Specification

3. Java 2 Platform Standard Edition 1.4.2 API Specification

4. The basics of using XML Schema to define elements

5. XML Schema Part 0: Primer Second Edition

Saturday, November 11, 2006

Comment on W3C DOM and various implementations in different PL

First I have to confess I'm quite unfamiliar with xml processing. I've only done it once extensively in Delphi due to a project I was involved in.

These days I'm studying tricks and technologies as to xml processing with java. So as I mentioned in a previous post I wrote about the overview on it. In particular, I mentioned the DOM way which is based on DOM, Document Object Model, a standard Object Model of XML maintained by the W3C Consortium. Here I first give some simple details about DOM itself.

For a simple xml document shown below:
<?xml version="1.0" encoding="UTF-8" ?>
<song genre="rock">
<name>My December</name>
<singer>Linkin Park</singer>
</song>

The DOM tree-like structure should be like this (E indicates a element node and T indicates a text node):
E:song
|--T:characters(whitespace)
|--E:name---T:characters(My December)
|--T:characters(whitespace)
|--E:singer---T:characters( Linkin Park)
|--T:characters(whitespace)

As depicted above, the root node song has five child nodes among wich two have their child nodes. I wanna emphasize the text node here. Before I start going deep into xml processing these days, I even don't know the existence of so-called text nodes. Because in Delphi, they're just ignored. So the DOM tree-like structure is like this:
E:song---E:name
|--E:singer

Only an element is called a node. I think this is quite intuitive, though definitely the official DOM structure is more theoretically complete. But with the white space and other text nodes the process of xml parsing is complicated. The example is worth a thousand words. Let's see how the simple xml document is parsed defferently in Java and Delphi:

In Java (exceptions are left unhandled):
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File(<xml file name>);
Element root = doc.getDocumentElement();
NodeList list = root.getChildNodes();
// A simple helper method
printStr("name: " + list.item(1).getFirstChild().getNodeValue());
printStr("singer: " + list.item(3).getFirstChild().getNodeValue());

In Delphi:
var
XMLDoc: IXMLDocument;
XMLNode, CtlNode: IXMLNode;
i, index: integer;
str: string;
begin
str = '';
XMLDoc := TXMLDocument.Create(nil);
XMLNode = XMLDoc.ChildNodes.Nodes['song'];
for i := 0 to XMLNode.ChildNodes.Count - 1 do
begin
str := str + XMLNode.ChildNodes.Nodes[i].NodeValue;
end;
end;

Apparently, the Java version is more awkward and will be more complicated provided the xml document is very long. This is because the element nodes can't be sequentially accessed due to the existence of white space text nodes. In contrast, with text nodes ignored, the Delphi version is quite clear and adaptive to document of any size. As I know, besides Java many implementations (at least Javascript, as I know) of DOM are aware of the text nodes, especially the white space text nodes.

So various kinds of helper method are used by developers to improve this awkward situation.
Method 1:
private Node getNodeByName(final NodeList list, final String name) {

for (int i = 0; i < list.getLength(); i++) {

final Node node = list.item(i);

// to pass the white space node

if (name.equals(node.getNodeName())) {

return node;

}

}

return null; // not found

}

Method 2:

...

NodeList list = e.getChildNodes();

for (int i = 0; i < list.getLength(); i++) {

Node n = list.item(i);

  // to pass the white space node

if (!(n instanceof Element)) { continue; }

nsFixup((Element) n, map, false);

}

And I believe there must be more.

I really don't see any benifits of keeping the awareness of text nodes until now. But If you know, tell me please.

Friday, November 10, 2006

Unicode, UTF等字符编码摘记

两个遵守相同规范的编码标准
unicode 3.0(最新版本5.0)和ISO-10646。从Unicode2.0开始,unicode采用了与ISO 10646-1相同的字库和字码。ISO-10646也叫做UCS (Universal Character Set)。

几个术语:
UTF: Unicode/UCS Transformation Format

UTF-16: 16位编码。基本上是Unicode的双字节编码,额外附加空间用于非常用字符和未来扩充需要(一般很少用到),常用字符在0-0xFFFF,包括扩充空 间的范围是0-0×10FFFF,所以最长编码位数是21位。关于扩充空间,在ISO-10646有相对应的定义。由于它是一个变长码,与CPU字序有关 (例如“汉”字的Unicode编码是6C49。那么写到文件里时,究竟是将6C写在前面,还是将49写在前面?如果将6C写在前面,就是big endian。如果将49写在前面,就是little endian。),最节省空间,所以常作为网络传输的外码。UTF-16是Unicode的preferred encoding。

UTF-8: 由于UTF-16直接就是Unicode编码,没有变换,包含了0×00在编码内,这个在操作系统内(C语言)中有特殊意义(和ASCII不兼容?),会 引起问题,所以有时候需要采用UTF-8编码对Unicode的直接编码做一些变换。UTF-8对ASCII不作变换,进行8位编码,其他字符做变长编 码,每个字符1-3个字节。与CPU字序无关,可以在不同平台之间交流。

UCS-2: 与UTF-16基本一样。

UCS-4: 4字节编码,目前是在UCS-2前加上2个全零的byte。

内码:内码是指操作系统内部的字符编码。早期操作系统的内码是与语言相关的.现在的Windows在内部统一使 用Unicode,然后用代码页适应各种语言,“内码”的概念就比较模糊了。微软一般将缺省代码页指定的编码说成是内码,在特殊的场合也会说自己的内码是 Unicode,例如在GB18030问题的处理上。

字符集:charcterset 字符的集合,例如Unicode是一种字符集。

字符编码:Encoding 如何将二进制数据识别为字符的编码,一种编码表示的字符是有限的,常常一种编码设计为表示一种字符集。例如UTF-8,UTF-16是两种字符编码,它们能够表示Unicode字符集的所有字符。

中国国标编码:

GB 13000: 完全等同于ISO 10646-1/Unicode 2.1, 今后也将随ISO 10646/Unicode的标准更改而同步更改.

GBK: 对GB2312的扩充, 以容纳GB2312字符集范围以外的Unicode 2.1的统一汉字部分, 并且增加了部分unicode中没有的字符.

GB
18030-2000: 基于GB 13000, 作为Unicode 3.0的GBK扩展版本, 覆盖了所有unicode编码,
地位等同于UTF-8, UTF-16, 是一种unicode编码形式. 变长编码, 用单字节/双字节/4字节对字符编码.
GB18030向下兼容GB2312/GBK.
GB 18030是中国所有非手持/嵌入式计算机系统的强制实施标准.

Update (20061114): ISO 8859-1:

ISO/IEC 8859-1,又称Latin-1或“西欧语言”,是国际标准化组织ISO/IEC 8859的第一个8位字符集。它以ASCII为基础,在空置的0xA0-0xFF的范围内,加入192个字母及符号,借以供使用变音符号拉丁字母语言使用。

其他:

UCS只是规定如何编码,并没有规定如何传输、保存这个编码。例如“汉”字的UCS编码是6C49,我可
以用4个ascii数字来传输、保存这个编码;也可以用utf-8编码:3个连续的字节E6 B1
89来表示它。关键在于通信双方都要认可。UTF-8、UTF-7、UTF-16都是被广泛接受的方案。UTF-8的一个特别的好处是它与ISO-
8859-1完全兼容。UTF是“UCS Transformation Format”的缩写。

所谓代码页(code page)就是针对一种语言文字的字符编码。例如GBK的code page是CP936,BIG5的code page是CP950,GB2312的code page是CP20936。

  Windows中有缺省代码页的概念,即缺省用什么编码来解释字符。例如Windows的记事本打开了一个文本文件,里面的内容是字节流:BA、BA、D7、D6。Windows应该去怎么解释它呢?

 
 是按照Unicode编码解释、还是按照GBK解释、还是按照BIG5解释,还是按照ISO8859-1去解释?如果按GBK去解释,就会得到“汉字”
两个字。按照其它编码解释,可能找不到对应的字符,也可能找到错误的字符。所谓“错误”是指与文本作者的本意不符,这时就产生了乱码。

  答案是Windows按照当前的缺省代码页去解释文本文件里的字节流。缺省代码页可以通过控制面板的区域选项设置。记事本的另存为中有一项ANSI,其实就是按照缺省代码页的编码方法保存。

  Windows的内码是Unicode,它在技术上可以同时支持多个代码页。只要文件能说明自己使用什么编码,用户又安装了对应的代码页,Windows就能正确显示,例如在HTML文件中就可以指定charset。

趣事:

  “endian”这个词出自《格列佛游记》。小人国的内战就源于吃鸡蛋时是究竟从大头(Big-Endian)敲开还是从小头(Little-Endian)敲开,由此曾发生过六次叛乱,一个皇帝送了命,另一个丢了王位。

  我们一般将endian翻译成“字节序”,将big endian和little endian称作“大尾”和“小尾”。

Resources:
1. 对字符编码与Unicode,ISO 10646,UCS,UTF8,UTF16,GBK,GB2312的理解 国际化支持 USENIX.CN - powered by Sinoprise Technology Lab (有比较详细的介绍)

2. 无废话XML

3. 简要解释UCS、UTF、BMP、BOM等名词

4. 中文编码处理(1) -- 编码与字符集

5. ISO 8859-1

注:文中参考的不全是官方或权威资料,难免有错误,仅作学习用,本人对文中错误不负任何责任,并欢迎改正错误。

XML Processing With Java Overview

There’re basically two ways of processing xml with Java. One is the DOM way, that is tree-structure based way, and the other way is the SAX way that is event-driven stream based way. However, the bad thing is that there’re pros and cons for both ways, and the good thing is that we can use one of them in different situation to meet different needs.

The DOM way
DOM, Document Object Model, is the standard specification released by w3c consortium. It is a tree like structure which represents the structure of a XML document and is what what we often first parse a XML document into before we do any manipulation to it. It is quite intuitive for most programmers to manipulate. With it we can easily get what we want from a XML document, element names, attributes, values of elements, etc. But the price to pay is that before any manipulation we have to read the entire xml document and parse it into a DOM object during which everything must be stored in memory. This is inefficient and sometimes impossible, especially for extremely large documents. By the way, besides DOM, there’re some unofficially object models in use, such as JDOM, XOM, DOM4J and so on.

The SAX way

It is a stream like and event-based way. We can processing a document while we’re reading it. It is a very flexible but more complicated way than the DOM way. It is flexible because the SAX stream can be redirected to other process or document. It is complicated because the event handler (usually the DefaultHandler or ContentHandler) must be first written and then registered with the Parser (alternatively reader, or something like that). And there’re other disadvantages. Because it is processed like a stream, it is impossible to make changes to it or move backward to the data stream. But it is possilbe to make some simple structure (not the data itself) changes by using xsl transformation. In general, the SAX way is much faster than the DOM way.

What make up the “XML Processing”

So-called XML processing or sometimes called parsing consists of several aspects or procedures.
Validation
Data Modification and Retrieve
Transformation
Data Query

Examples

Higher Level Application
What are mentioned above are only those basic aspects about xml processing. Seen from a more global perspective, there’re many other higher level application of xml or xml processing.

Published temporarily and remains further refinement.

About Me

My photo
I'm finishing my master degree in Software Engineering, Computer Science. I believe and have been following what Forrest Gump's Mam said: you have to do the best with what god gave you.