Sunday, November 19, 2006

XML Sucks?

I searched the Internet and found something funny about XML. As we all know that we've discussed about XML before. We said it's the ultimate solution to problems brought by HTML. Well, some guys don't think so.

Generally speaking, though XML brings us tremedous benefits over HTML, inevitably it has got some defacts. The sum up list is listed below:
  • XmlIsTooComplex for what it does.
  • It's too hard for programs to parse and too verbose and unreadable for humans to write.
  • The benefits of "everyone is using XML, so we should too" are usually outweighed by the costs of time, training and mistakes involved in understanding it.
  • Because it's increasingly used for data interchange, it is promoted as a data storage model. XML is only a data encoding format.
  • or just comments wrapped around data. Too much comments and symbols.
  • , when they could just be comments instead.
  • Encourages non-relational data structures
    • ie. Data is not even in 1st normal form let alone 5th.
  • Poor OnceAndOnlyOnce syntax factoring
  • It's a poor copy of EssExpressions
  • It is ExtremelyInterstrangled.
  • Perhaps worst of all too many programmers don't understand the need for data description languages with broad support.
  • Transformations, even identity transforms, result in changes to format (whitespace, attribute ordering, attribute quoting, whitespace around attributes, newlines). These problems can make "diff"ing the XML source very difficult.
I picked up several for details (affirmative for XML is in italics. Opposite is in normal form.):

XML is too hard for programs to parse and too verbose and unwritable for humans.

It's not too hard for programs to parse - XML is a subset of SGML, which is well understood and well implemented, and because it's more rigorous than HTML it's easier to parse than HTML, which is a solved problem. It's not too hard for humans, by a long shot; a well-written DTD is a cakewalk to write in.

Tedious rather than hard. It takes more time and code to extract the information you want from XML than it does to have the information formatted in flat files. Parsing flat files is easier than processing DOM unless tools are provided.

Well, this is certainly true. You get an old argument of the virtues of (new thingy) over (old thingy). People thought HTML was silly in the light of Gopher, which was flat text, easier to write, edit and parse, and faster to transmit; over time they were shown to be incorrect (correction: over time they were shown different means serve different purposes). XML provides a mechanism for us to provide a parsable definition of document structure, which means that unlike CommaSeparatedValues or similar setups, the software doesn't have to know the document's structure ahead of time
(given an XML parser; magic? Fact: xml is a document format; The use of DOM and IPC is the key to the success of XML (see SOAP). File space requirements matter less every day (tell that to a CPU designer, and he will laugh loud), and though not trivial, XPath and XSLT are important features over and above what CSV provides. For many applications it's overkill. So is sending readme.1st files in RTF.


The benefits of "everyone is using XML, so we should too" are usually outweighed by the costs of time, training and mistakes involved in understanding it.

What are those costs? Many people said this about HTML, but frankly it's just not that hard - commands go in angle brackets, slash means off, i for italic, hit save, you're done. Technical workers can handle that, and XML is no worse (if they need to write their own DTDs, that's a worse, but give that job to qualified staff. Training: everything takes training.

Some things more than others.

Most things more than XML.


Because XML is increasingly used for data interchange, it is too easily promoted as a data storage model. XML is only a data encoding format.

It's not designed as a data storage model, although models can be built on top of it. Compared to older ASN.1 (correction: ASN.1 is really only a language for defining protocols; actually the protocol defined in ASN.1 can use XML as its data transfer format) or GIOP, such XML models suck. Inherent limitations make them unsalvageable. But many folks confuse storage and exchange. XML must be concrete enough for light-weight programs to parse; the same data may be described in many ways, and different XML representations are suitable for different tasks, in opposition to the OnceAndOnlyOnce goal. In contrast the relational model and SQL use a canonical representation not favoring a particular task. In particular, many to many relationships are problematic in XML. We have gone back to the sequential text file model at the expense of the kind of abstraction we gained when moving from COBOL to SQL. If you really want to process data sequentially, COBOL is a far better tool than XSLT applied to XML - but sensible people use SQL. XML should just be used for transport, and there should be a canonical representation (schema) of the relational model. A simple subset of SQL could be implemented to operate on this representation to allow programmers to extract data. Imagine how much simpler life would be if instead of writing XML parsers, and editing enormous, complex and verbose text files by hand, we had a simple SQL-style interface. In fact - I think I'll write one! (that will be easier than XPath and XQuery?)
-- Tim Glover (ed SkipSailors)

Why do people insist on complaining that XML doesn't do this or XML doesn't do that, when XML is just supposed to be a data storage and transport mechanism? And now this comparison to COBOL? COBOL?!? Oy, vay!

XML isn't a database language per se. It is a means of expressing data in a tree structure. If you need flat storage of your data for relational reasons and you don't feel like parsing out an XML file full of relational data items then how about using something other than XML? Although any data can be stored in an XML format; it's just a matter of designing the storage translation in and out. XML reliably transports the stored data for you .

XML is a means of storing data in a tree structure and can express relationships. The XML community try to push it far too far. XML databases are a silly idea. XSLT is a silly idea. When you start embedding Java in XML a la Cocoon you know you've gone completely bonkers. I have another problem, XML has to be processed by a computer program eventually, be it xslt, java, whatever. Because XML is very concrete and highly non-canonical it introduces a very strong coupling between the actual representation chosen and the processing program, to which I object. You cannot change your XML DTD to optimize a particular task without having to rewrite all your existing programs. I don't think this has really hit home yet - but it will. It is going to cause BIG problems. SQL solves this problem by providing an abstract interface to the data. My comparison with COBOL was with XSLT, a programming language written in XML for XML, not with XML itself. They are very similar - XML elements correspond to sequential file record types. XML attributes correspond to COBOL data division templates (conceptually at any rate. COBOL is very concrete in its layout of data attributes). COBOL has the great advantage over XSLT that it provides a very clean separation of program from data. In XSLT these are hopelessly confused, which causes much of the difficulty in reading and understanding it. Thanks for engaging in this with me - I find it a useful and constructive discussion.

-- Tim


XML is not a good basis for developing data models. It is not a shortcoming of XML, rather a problem that engineers pick the wrong tool for the job so often. Don't use a screwdriver as a crowbar.

_____________________________


To sum up, though XML is not a new technique, the usage of it is still controversial. Because it's flexibility and power, this technique has been applied to multiple areas for data transportation, storage and manupulation. It shows great potential to replace techniques such as SQL and flat files. However, when we try to apply XML as a universial solution, it shows great defacts--after all, XML wasn't designed to do those. It's so powerful that people almost forget what it is for at the first place. So when we try to apply XML to our works, think clearly how well it would be used. If there is a more mature, convenience technique, don't use XML.

Pitfalls of SEO

Now SEO is somewhat popular--What's an easier way than to have your site on top of the two famous search engines? It's well known that search engine will notice your key words and add value to your site, and generally, if your key words appear many times in your site, your site will be deemed more important and relavant to the key words. Well, not for all cases. If you add key words in all your alt attributes, and use only key words for all your titles, Google or Yahoo may consider your behaviors as spamming, which will result in bad page ranking. So when you optimize your page for particular search engines, remember to read their guidelines first. Yahoo and Google all have guidelines for programmers. Know what behaviors they hates, and keep in mind to avoid those pit falls.

Part of Guidelines given by Google:

Quality guidelines - basic principles

  • Make pages for users, not for search engines. Don't deceive your users or present different content to search engines than you display to users, which is commonly referred to as "cloaking."
  • Avoid tricks intended to improve search engine rankings. A good rule of thumb is whether you'd feel comfortable explaining what you've done to a website that competes with you. Another useful test is to ask, "Does this help my users? Would I do this if search engines didn't exist?"
  • Don't participate in link schemes designed to increase your site's ranking or PageRank. In particular, avoid links to web spammers or "bad neighborhoods" on the web, as your own ranking may be affected adversely by those links.
  • Don't use unauthorized computer programs to submit pages, check rankings, etc. Such programs consume computing resources and violate our Terms of Service. Google does not recommend the use of products such as WebPosition Gold™ that send automatic or programmatic queries to Google.

Quality guidelines - specific guidelines

  • Avoid hidden text or hidden links.
  • Don't employ cloaking or sneaky redirects.
  • Don't send automated queries to Google.
  • Don't load pages with irrelevant words.
  • Don't create multiple pages, subdomains, or domains with substantially duplicate content.
  • Don't create pages that install viruses, trojans, or other badware.
  • Avoid "doorway" pages created just for search engines, or other "cookie cutter" approaches such as affiliate programs with little or no original content.
  • If your site participates in an affiliate program, make sure that your site adds value. Provide unique and relevant content that gives users a reason to visit your site first.

If a site doesn't meet our quality guidelines, it may be blocked from the index. If you determine that your site doesn't meet these guidelines, you can modify your site so that it does and request reinclusion.


Part of the Guinelines given by Yahoo:

Yahoo! strives to provide the best search experience on the Web by directing searchers to high-quality and relevant web content in response to a search query.

Pages Yahoo! Wants Included in its Index

  • Original and unique content of genuine value
  • Pages designed primarily for humans, with search engine considerations secondary
  • Hyperlinks intended to help people find interesting, related content, when applicable
  • Metadata (including title and description) that accurately describes the contents of a web page
  • Good web design in general
Unfortunately, not all web pages contain information that is valuable to a user. Some pages are created deliberately to trick the search engine into offering inappropriate, redundant or poor-quality search results; this is often called "spam." Yahoo! does not want these pages in the index.

What Yahoo! Considers Unwanted
Some, but not all, examples of the more common types of content that Yahoo! does not want include:

  • Pages that harm accuracy, diversity or relevance of search results
  • Pages dedicated to directing the user to another page
  • Pages that have substantially the same content as other pages
  • Sites with numerous, unnecessary virtual hostnames
  • Pages in great quantity, automatically generated or of little value
  • Pages using methods to artificially inflate search engine ranking
  • The use of text that is hidden from the user
  • Pages that give the search engine different content than what the end-user sees
  • Excessively cross-linking sites to inflate a site's apparent popularity
  • Pages built primarily for the search engines
  • Misuse of competitor names
  • Multiple sites offering the same content
  • Sites that use excessive pop-ups, interfering with user navigation
  • Pages that seem deceptive, fraudulent or provide a poor user experience

YST's Content Quality Guidelines are designed to ensure that poor-quality pages do not degrade the user experience in any way. As with Yahoo!'s other guidelines, Yahoo! reserves the right, at its sole discretion, to take any and all action it deems appropriate to insure the quality of its index.


These guidelines are well hidden in Google and Yahoo's pages. Search their site carefully if you want them.

Referrence:

DOM - Part Two - How to get a node easily?

Walk through in a more elegant way:

As we can see that walking through the DOM tree is actually very troublesome. If there are thousands of tags in your document, and what you need is just a particular one deep inside several nesting, I guess you'll simply give up if only method mentioned in first part in available. However, for programmers' convenience, DOM does have some support for fast access to particular nodes.

The first way is to get the element by its tag.So let's continue with our example. You want to access the element node B. The very simplest way is to directly jump to it. By the method document.getElementsByTagName you can construct an array of all tags B in the document and then go to one of them. Let's assume that this B is the first one in the document, then you can simply do

var x = document.getElementsByTagName('B')[0]
Or we can do it more directly, get an element by its name. The usage is even more straight forward:

var x = document.getElementById('hereweare');

Here if we've assign an element's id with "hereweare", the function will return the index to the element. Then we can manipulate the element in whatever way we want.

Example: Why we need DOM?

Actually, before DOM we can already access element quite easily in Javascript. Like in IE, we may just write:

void function hide(x)
{
x.style.visibility="hidden";
......
......
}


then call the function with element's id, like

hide("hello");
Then the element with Id "hello" will be manipulated with the statements in funciton hide().

However, it will easily generate problems. The value passed to the function is actually a string, not an index to element. It confuses programmer sometimes. In order to make the code more readable and standardized, we should adopt DOM instead of the method mentioned above. Actually, in Firefox we can only use DOM to access elements.

To make the function standardized, we need to modify it a little bit:

void function hide(name)
{
var x=getElementById(name);
x.style.visibility="hidden";
......
......
}




referrence: http://www.quirksmode.org/dom/intro.html

DOM - Part one

What's DOM?

The Document Object Model (DOM) is the model that describes how all elements in an HTML page, like input fields, images, paragraphs etc., are related to the topmost structure: the document itself. By calling the element by its proper DOM name, we can influence it.

The Level 1 DOM Recommendation has been developed by the W3C to provide any programming language with access to each part of an XML document. As long as you use the methods and properties that are part of the recommendation, it doesn't matter if you parse an XML document with VBScript, Perl or JavaScript. In each language you can read out whatever you like and make changes to the XML document itself.

As some of you might have guessed, this paragraph describes an ideal situation and differences (between browsers, for instance) do exist. Generally, however, they're far smaller than usual so that learning to use the W3C DOM in JavaScript will help you to learn using it in another programming language.

In a way HTML pages can be considered as XML documents. Therefore the Level 1 DOM will work fine on an HTML document, as long as the browser can handle the necessary scripts.

You can read out the text and attributes of every HTML tag in your document, you can delete tags and their content, you can even create new tags and insert them into the document so that you can really rewrite your pages on the fly, without a trip back to the server.

Because it is developed to offer access to and change every aspect of XML documents, the DOM has many possibilities that the average web developer will never need. For instance, you can use it to edit the comments in your HTML document, but I don't see any reason why you would want to do so. Similarly, there are sections of the DOM that deal with the DTD/Doctype, with DocumentFragments (tiny bits of a document) or the enigmatic CDATA. You won't need these parts of the DOM in your HTML pages, so I ignore them and concentrate instead on the things that you'll need in your daily work.


How DOM works?

In DOM, every element and its content was seen as nodes in a "document tree". Say we have codes like this:
<p>This is a paragraph<p>
You'll have a tree as below:

<P> <-- element node
|
|
This is a paragraph >-- text node



And if we have:

<p>This is a <b>paragraph</b></p>


We'll have a tree like this:


<P>
|
--------------
| |
This is a <B>
|
|
paragraph



We can see that if a tag is nested inside another, it will become a child node of its outer tag. So a DOM tree is formed by creating children nodes below parents as they are nested inside other nodes. By doing this we can guarantee that a well-formed web page will be mapped to definite, well-form DOM tree.


How to walk through a DOM tree?

Knowing the exact structure of the DOM tree, you can walk through it in search of the element you want to influence. For instance, assume the element node P has been stored in the variable x (later on I'll explain how you do this). Then if we want to access the BODY we do

x.parentNode
We take the parent node of x and do something with it. To reach the B node:

x.childNodes[1]

childNodes is an array that contains all children of the node x. Of course numbering starts at zero, so childNodes[0] is the text node 'This is a' and childNodes[1] is the element node B.

Two special cases: x.firstChild accesses the first child of x (the text node), while x.lastChild accesses the last child of x (the element node B).

So supposing the P is the first child of the body, which in turn is the first child of the document, you can reach the element node B by either of these commands:



document.firstChild.firstChild.lastChild;
document.firstChild.childNodes[0].lastChild;
document.firstChild.childNodes[0].childNodes[1];
etc.


or even (though it's a bit silly)


document.firstChild.childNodes[0].parentNode.firstChild.childNodes[1];