Click the box titles below to expand:

How to's

XSLT

GOOGLE MAPS API

PHP

Perl

Document Object Model (DOM)

UNIX

Generic Mapping Tools (GMT)

Miscellaneous

Projects

Courses Taught

Latest Favorites

Mac OS X

Web Development

Beta

How To :: XSLT :: Using tokenize()

In creating webdlmon I use a variety of XML files as sources. These XML files were originally orb status packets, that were converted to Antelope parameter file (pf) format. The XML files are created by pf2xml, an Antelope command line tool.

One of the XML files is a large stash file that contained lines of pfarr and pfstring tags. The pfstring contents are separated by spaces. I needed to parse these long lines into variables and reformat the output to something palatable to PHP 5's DOMDocument class. This can be done easily using XSLT.

The fact that these are Antelope pf-based XML files is irrelevant to the following discussion, as this is all about parsing long strings in XML. Below is the content of the source XML file that originally came from an Antelope status ORB:

input.xml

  1. <?xml version="1.0" encoding="iso-8859-1"?>
  2. <pfarr>
  3. <pfstring name="commandline">q3302orb -v -S state/q3302orb_AG -pf q3302orb_AG</pfstring>
  4. <pfstring name="dlsite">
  5. q330 0000 345 1169760599.99999 TA_D03A 921 47 -123 0.0325 regular internet hosted 1172293472.07035
  6. q330 0123 234 9999999999.99900 TA_HAST 1005 36 -121 0.5558 regular internet hosted 1172293966.53652
  7. q330 0234 123 1157317200.00000 TA_U04C 718 36 -120 0.7886 vsat spacenet 1172298386.07728
  8. </pfstring>
  9. </pfarr>

The fields that I was interested in were the datalogger name (column 5), the communications provider (column 10) and the communications type (column 11). I wanted the output to look like the following:

output.xml

  1. <?xml version="1.0" encoding="iso-8859-1"?>
  2. <dlsites>
  3. <site name="TA_D03A">
  4. <comt>regular internet</comt>
  5. <comp>hosted</comp>
  6. </site>
  7. <site name="TA_HAST">
  8. <comt>regular internet</comt>
  9. <comp>hosted</comp>
  10. </site>
  11. <site name="TA_U04C">
  12. <comt>vsat</comt>
  13. <comp>spacenet</comp>
  14. </site>
  15. </dlsites>

So, what to do? I originally thought to use a regular expression in my XSLT, something along the lines of:

  1. <?xml version="1.0" encoding="ISO-8859-1"?>
  2.  
  3. <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  4. <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" />
  5.  
  6. <xsl:template match="/">
  7. <dlsites>
  8. <xsl:apply-templates select="/pfarr/pfstring" />
  9. </dlsites>
  10. </xsl:template>
  11.  
  12. <xsl:template match="pfstring[@name = 'dlsite']">
  13. <xsl:variable name="elValue" select="." />
  14.  
  15. <xsl:analyze-string select="$elValue" regex="\s*(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+\n">
  16.  
  17. <xsl:matching-substring>
  18. <xsl:variable name="dlname" select="regex-group(5)" />
  19. <site name="{@dlname}">
  20. <comt><xsl:value-of select="regex-group(10)"/></comt>
  21. <comp><xsl:value-of select="regex-group(11)"/></comp>
  22. </site>
  23. </xsl:matching-substring>
  24.  
  25. <xsl:non-matching-substring>
  26. <unknown>
  27. <xsl:value-of select="$elValue"/>
  28. </unknown>
  29. </xsl:non-matching-substring>
  30.  
  31. </xsl:analyze-string>
  32.  
  33. </xsl:template>
  34.  
  35. </xsl:stylesheet>

When I emailed the XSLT discussion group about this, I was told that a regular expression match like I had written would be very greedy. Resident discussion group guru Abel Braaksma told me to have a look at the XSLT function tokenize(). This is part of the XSLT 2.0 specification, and is implemented in Michael Kay's Saxon XSLT processor, which is the processor I use. (Aside: there is a good article on using tokenize() at O'Reilly's xml.com).

A quick look at some sample code Abel sent me showed me how powerful tokenize() can be. I decided to use tokenize() in several ways;

  1. Define an XSL variable that contains the contents of each line of input.xml, lets call it $tokenizedLine.
  2. In an XSL for-each loop, take each line and tokenize the contents based on a simple regular expression search of two or more white-spaces ('\s{2,}'). Assign the tokenized string to a new variable name, $values.
  3. I can then treat $values like an array, ie. reference by index.

So, the final XSLT looks like this:

transform.xsl

  1. <?xml version="1.0" encoding="ISO-8859-1"?>
  2.  
  3. <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  4. <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" />
  5.  
  6. <xsl:template match="/">
  7. <dlsites>
  8. <xsl:apply-templates select="/pfarr/pfstring" />
  9. </dlsites>
  10. </xsl:template>
  11.  
  12. <xsl:template match="pfstring[@name = 'dlsite']">
  13.  
  14. <xsl:variable name="elValue" select="." />
  15. <xsl:variable name="tokenizedLine" select="tokenize($elValue, '\n')" />
  16.  
  17. <xsl:for-each select="$tokenizedLine">
  18. <xsl:variable select="tokenize(., '\s{2,}')" name="values" />
  19. <site name="{$values[5]}">
  20. <comt><xsl:value-of select="$values[10]"/></comt>
  21. <comp><xsl:value-of select="$values[11]"/></comp>
  22. </site>
  23. </xsl:for-each>
  24.  
  25. </xsl:template>
  26.  
  27. </xsl:stylesheet>

This is a much more clean and efficient way of parsing the XML. The method of assigning the output of tokenize() to a variable makes for very nice mark-up. This has made the processing of my various XML files much faster. Thanks to the discussion forum, and in particular Abel, for their helpful comments.

Final note

Since figuring this out, I decided to go a different route and do much of this processing on-the-fly with the PHP 5 in-built class DOMDocument and many of the in-built Antelope Datascope.so PHP commands. This is an even more efficient method, and uses one less file to create webdlmon (the XSLT file in this discussion). Another reason for switching is that often the stash XML files only have one space separating each column, or two, or three or more. This inconsistency makes it extremely difficult to write clean code to effectively parse the source XML.

Disclaimer

This information is freely provided as–is. Messing around with the command line and creating files is a serious business, and I accept no liability for errors created, systems corrupted, or hard–disk damage by you following these instructions. They worked for me but may not work for you. Remember to back–up EVERYTHING before you try any of this stuff — it is not simple OR easy!!!

If you have any questions about this please email me at rlnewman@ucsd.edu and I will try my best to help you out.

made with CSS     Valid XHTML 1.0!      Valid CSS!