
Content
Social
| Learn 10 good XML usage habits |
|
|
| Written by Bruno Grange | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Monday, 26 January 2009 09:31 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Improve your effectiveness and efficiency for working with XML Level: Introductory Martin Brown, Developer and writer, Freelance 13 May 2008 Make your XML work easier with the ten tips in this article—ultimately you'll be less prone to errors and more productive. You love XML and the flexibility and interoperability that it offers, but you can do some things to make your interaction with XML and the tools that you use to work with it significantly easier. Picking up some basic good habits when you work with XML will ensure that you get the most efficient use out of your XML documentations and applications.
Here are the top 10 good XML habits to adopt:
When you create an XML document quickly, it can be very tempting to create the basic structure and eschew the normal XML document requirements of specifying the XML declaration and the encoding type of the data that the XML document contains. Consider the XML document in Listing 1. Listing 1. XML document minus the XML declaration and data encoding type
As a human, you can look at that document and identify it as XML, but it is more difficult for a computer to achieve the same determination. You can make the process more explicit and identifiable by adding the XML declaration to the top of the file. This is a single line that specifies that the document is XML, and also describes a version number and the character encoding used in the XML data. For example:
The content of the encoding specification should be accurate, too. The encoding is used by XML parsers to ensure that the individual character is loaded correctly from the XML document. For example, continuing the phrase-based example in Listing 1, the addition of a Russian entry into your document would cause a problem because currently you specify an encoding that does not support the extended character set required by the Russian phrase for hello. Specifying the wrong encoding might mean that parsers process the document incorrectly; for example, reading a multibyte extended character as just a sequence of individual bytes might lead to corrupt data and bad output.
Once you have the XML declaration in place, you should then ensure that the valid structure of your XML file is defined with a DTD or an XSD. Either solution allows XML parsers to check and confirm that the contents of the XML file match the structure appropriate for the data that you are trying to model. For example, given a simple XML structure for a contact database, you want to define a structure that allows for the contact's name, address, and phone numbers to be specified. Using a DTD means that you can map out the structure and ensure that each of the contacts within the structure match the layout. For example, Listing 2 shows a DTD for the contacts database. Listing 2. A DTD for the contacts database
The DTD defines the elements, attributes (and the supported values of those attributes) required to describe a contact. You can see in Listing 2, for example, that a phone element has a type attribute, and that you also have attributes for the address and for components within the address. Use of a DTD helps to ensure that the structure is valid and, when used in combination with a validation process, can identify any problems. When used with an XML-capable editor, DTDs can also help with editing and automated completion of the content. XSDs, or schemas, perform many of the same functions as DTDs, but can be useful in different ways. For example, while some XML editors require a DTD for automated completion of content, schemas can provide more flexibility in the design of the actual hierarchy for the document. The tool you choose will depend on your own circumstances.
Looking at Listing 3, can you spot the problem? Listing 3. A validation example
Finding the problem by hand is tedious. But run the file through xmllint, a free tool that verifies the content and structure of the XML file, and you can see the output when executed against this file in Listing 4. Listing 4. Output after running the Listing 3 through xmllint
Although this looks very complicated compared to the original problem (one of the attributes wasn't closed), it does give you a place to start. Incidentally, xmllint supports a number of different command line options to help select the diagnosis method and results. One of the most useful options is the If you are using a DTD, then use the Listing 5. xmllint finds a different error
Using xmllint in this way is a quick, convenient way to confirm the structure of a document is valid. xmllint is available as part of the libxml2 toolkit, which is bundled with Linux, UNIX®, and Mac OS X, but requires a separate download for Windows®. For more information on xmllint and libxml2, see the Resources.
Validation isn't always the answer Using xmllint and similar tools to validate your XML files, particularly if you use a DTD, is a great way to validate the content of your XML files. The solution, however, does have its limitations. What about the content of the XML file for instance? With a DTD or XSD, you can specify explicit contents for attributes. You only create attributes with a string or ID that can be part of a restricted list of available options, but the content of elements cannot be controlled or limited in the same way. For example, in the contacts example, the telephone numbers element contains numbers and spaces. But there's nothing to stop a user adding alphabetic characters to that element. Doing so won't bring up an error during validation using xmllint, and editors and other XML-aware solutions won't address or identify the problem. The failure of your application because it identified a non-standard data type might be the way you actually learn about the problem. In short, XML validation only ensures the structure is correct, not the data. The easiest way to address this is to write a parser that reads the XML file and actually validates the data content. Don't go overboard in verifying the content though; you only need to go as far as ensuring that the data meets the requirements of your application.
XML structure versus attributes Opinion is divided on whether it is better to use attributes or elements to describe the information that you want to represent in the XML file. As a general rule, you should use elements (that is, the data between the tags) to define the information contained within a file. Attributes should be used to provide extended qualification of the data that you describe. Both elements and attributes have limitations. Attributes, for example, cannot be repeated within a tag, a classic case of where elements have an advantage over attributes. The ability to support repeating information in this way makes them very practical. In contrast, using elements to qualify the data can be sometimes be more complex to process, too. The phone numbers in the contacts example provides a good explanation of the benefits. In the example, shown here in Listing 6, attributes are used to qualify the type of phone number (such as work, home, or mobile). Listing 6. Qualifying the type of phone numbers
With this structure, it is easy to pick out numbers as a whole (by ignoring attributes), or to pick out a specific phone number type (by using the attribute). Compare that structure to one designed using only elements in Listing 7. Listing 7. Using only elements to qualify the phone number
Now it is difficult to see the wood for the trees. Although, in theory, any XML parser or a suitable XPath definition can pull out the information you want, you gained very little, while making the XML document difficult to read.
When working with XML data, finding the information you want can be complex. You can, of course, write a parser to pick out the material that you need, but sometimes, you really just need to find a small fragment of the information in the file very quickly. For example, if you wanted to extract a list of all the countries in your contacts XML file so that you could see how widely spread your contacts were, you could use XPath to pick out the information. XPath enables you to pull out the data from an XML file by using the structure of the XML file as part of the query. You can, for example, extract the data for a specific element by giving the path to the element within the XML file:
You can dissect the content like this:
Note that in the example, you qualified the type of address to select the information from, so it will pick all addresses. You can see the result of the XPath query in Listing 8. Listing 8. Result of the XPath query
If you want to pick out more specific data, you can specify the element contents, or attribute contents that you want to match. For example, to select only mobile phone numbers, you need to specify the attribute type and value. To do this, use an at sign (@), which specifies that you want to search an attribute, and then specify the value you want to match (see Listing 9). Listing 9. Selecting only mobile phone numbers
Listings 8 and 9 use a command line tool. Many XML toolkits provide native methods to work with XPath elements, and you can extract data using the XPath specification to use in your applications directly, without having to work with a parser to get the information.
You don't always need a parser to extract information Although it seems counter-intuitive, you don't always need to use a full XML parser employing SAX, DOM or other techniques like XPath or XQuery to pull out the information that you want from XML files. XML files contain data in a structured format, and although sometimes you need that information in its structured format. More often than not, when you are quickly looking for a piece of information, a more simple solution will work. Often you can get away with just using grep, or Perl, or something similar to extract the data you want without actually parsing the structure or content of the document as an XML file. For example, you can pick out phone numbers using grep (see Listing 10). Listing 10. Picking out phone numbers using grep
You've picked out the information you want, without worrying about the fact that it is XML, or indeed concerning yourself with the structure. When all you want is a quick piece of information, simplified processing techniques are just as capable of finding the information you want, without the overhead associated with a traditional parsing solution.
When to use SAX over DOM parsing When you build a parser for your documents to pull out the information that you want, it is often difficult to determine when to use a SAX-based processor, and when to use a DOM-based processor. The easiest way to make the decision is to consider both the complexity of the documents and what you want to do with the information. If you convert or translate documents, or the document is particularly large, then SAX is your best choice. SAX parses the document element by element, triggering a method or function to be called when the element is identified. If you convert an XML document to another format, for example translating XML to HTML, then SAX is the most efficient way. You don't have to load the entire document into memory, just react to the elements and structure being identified. The downside with SAX is if you need to save or record the structure, or to understand the document as a whole and pick out individual elements from the document (for example, selecting a single contact in its entirety). To do this you need to build complex processes that load the XML, record the data into a structure, and are then capable of identifying the elements into the output target.
When to use DOM over SAX parsing DOM processing loads the entire document and its structure into memory and allows you to refer to and use the structure of your XML document within your application. For example, with the contacts example, you could read the entire contacts database into memory, and then select all the phone numbers by iterating over the contacts, and then within each contact, iterate over each phone number. Because DOM retains the structure, and more importantly understands and works with the structure, you can easily work with the structure as a whole or on an individual bases. Staying with the contacts example, inserting a new contact with SAX would be complex. But with DOM, you can just insert a new XML element representing the new contact into the existing XML document. The limitation of DOM is that processing the file in a stream—for example, translating to HTML—is made more complex, because you have to process the document by iterating over each element individually within the structure. Furthermore, because DOM loads the entire XML document into memory during the parsing, DOM parsers can be slower and obviously requires more memory. The DOM process provides some benefits related to this; for example, you can process an XML document parsed using DOM multiple times from a single parse. With SAX, you have to repeat the parse multiple times to achieve the same result. See Resources to find out more details and examples of using DOM and SAX.
If you regularly write and use XML, then a good XML editor is a must. XML editors differ from standard text editors because they understand the structure and layout of XML. They can offer a whole range of features that make it easier to work with XML, including:
Examples of good XML editors include Eclipse and oXygenXML, but plenty of other choices are out there.
Learning good habits in XML can make all the difference between taking advantage of the functionality offered by XML and struggling against the XML standard to get the basics of validation and parsing right. This article should help you to adopt 10 good habits that improve your effectiveness and efficiency as you work with XML documents and data. Learn
Get products and technologies
Discuss
Related Articles: |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


