A key Role in the Initial Growth of the World Wide Web

Apache Web Server Journal

Subscribe to Apache Web Server Journal: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Apache Web Server Journal: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Apache Web Server Authors: Liz McMillan, Elizabeth White, Pat Romanski, Janakiram MSV, Gil Allouche

Related Topics: Apache Web Server Journal, XML Magazine

Apache Web Server: Article

Storing XML in Relational Database Management Systems

Using the Edge table approach

When we need to store XML in a Relational Database Management System (RDBMS) we have to think of nonnative ways to store it, since the structure of XML does not fit the flat relations of the RDBMSs. Although some RDBMS vendors support XML storage and XQuery such as Oracle 10g (release 2), this support is not yet common. Some of the well-known RDBMSs are Oracle, DB2, Informix, Sybase IQ, SQL Server, MySQL, and PostgreSQL. If you have one of these products, or a similar product, you should be interested in reading this article.

There are several ways to store XML in an RDMS. One possibility is to store XML in big text columns such as CLOB (Character Large Object). Some advantages of storing XML as CLOB are: there is no preprocessing, and the originality of the XML document is preserved. Unfortunately, this method has some major disadvantages - neither XQuery nor SQL will work on the text columns. We would like to query the stored data, because reading the whole document and extracting needed parts manually is simply impractical. We would like to find ways to store XML in a database that would allow us to query the data easily. In this article I am going to show you a method that is known as shredding XML into the edge table. This is a very general method that will work on any XML document regardless of having a schema.

XML has a lot of bells and whistles, but it can be viewed simply as a tree. It's possible to represent the XML nodes as vertices of the tree and the relations between the XML nodes as the edges. Let's look at an example of how we can take an XML fragment and represent it as a tree. Figure 1 shows a simple XML document and Figure 2 shows its tree representation. The root of the XML document is the root of the tree too. Note that in XML order is very important, therefore when we create a tree of XML we have to capture the order, which is done in Figure 2. For example, the attribute isbn comes before the element title, which comes before the element authors.

Figure 2 shows the elements and attributes in ovals, and their corresponding values in rectangular boxes. The tree you see in Figure 2 is a nice representation of XML that makes it easier for us to gather the information presented in Figure 1 in a graphical way. The attribute isbn carries a flag of @, which indicates that it is an attribute. Unfortunately, storing Figure 2 as a graph into a database isn't of much use either. We will shred the XML tree into pieces, each of which can be stored in the rows of a database table. However, in order to do this we first have to number all of the vertices of the tree as shown in Figure 3. Numbering is essential for capturing the parent-child and ancestor-descendant relations, which will be used for answering XML queries.

Numbering an XML Tree
We will number the vertices in a top-down, left-to-right fashion using preorder traversal, which is numbering a vertex when we see it for the first time and then repeating this process until all vertices are numbered.

The red dashed line shows how to number the XML tree. It traverses the tree in a top-down and left-to-right fashion. Note that the root element book has the lowest number in the tree. The numbers of @isbn, title, and authors are 2, 4, and 6, which in fact capture the order of these nodes.

Edge Table
Now that we have numbered the XML tree, we can store it in the Edge table. An Edge table has id, parentid, name, value, and lastdesc attributes (columns) as shown in Figure 4.

id is the identification number of a vertex (element, attribute, text box, etc.). Identification numbers are shown in Figure 3. id is generally integer type.

parentid is the parent's identification number. For example, the parentid of the title element is 1 (same for @isbn and authors). The type integer is the usual choice.

name is the name of an element or an attribute. Text nodes (rectangular boxes) do not have names, but we name them as text for better performance. name can be char or varchar type. Data type char stands for character, and varchar stands for variable character. Element tag names and attribute names are generally short, usually not more than 50 characters.

value is the value of an element or an attribute. The content of this column is generally larger than the other columns. In most cases using varchar is the most efficient solution. For example in Figure 1, the value of the title element, <title>XML in Use</title>, has the largest character content: a total of 10 characters.

For simplicity, in this article I have been using a small XML sample (and I will continue using the same XML below). For interested readers, I have provided much larger XML samples at www.cs.umb.edu/~smimarog/xmlsample. For these examples I set the value type to varchar(550) in my database. varchar(550) is space saver compared to char(550). varchar(550) indicates that the maximum length of the characters will be no more than 550, but when it is less than 550, say 35, then the remaining 515 characters will not be wasted. However when you have the type char(550) and put content into it that has only 35 characters, unfortunately 515 characters will be wasted.

lastdesc is the number that holds the last descendant of a node. Figure 3 shows that lastdesc of book is 10, the lastdesc of title is 5, etc. We are going to use the last descendant information for answering XPath queries that contain // (double slash). Remember that // is the shortcut of descendant-or-self::node(). I will show how this is used in an example later in this article.

Parsing XML
Parsing XML for storage into an RDBMS is known as shredding, as I mentioned earlier. There are several dozens of XML parsers available, which can be categorized into two main groups: SAX (Simple API for XML) parsers, and DOM (Document Object Model) parsers. SAX is the gold standard of XML APIs. SAX implementations are fast and memory efficient. SAX is available for most of the popular programming languages such as Java, C++, Visual Basic, Python, and Perl. Java comes with a built-in SAX parser. Apache's Xerces (http://xerces.apache.org) provides a SAX parser for C++, Java, and Perl.

SAX is an event-driven API. It notifies the user of various events when parsing the XML document in top-down fashion, such as startDocument, endDocument, startElement, endElement, processingInstructions, etc. It's the programmer's responsibility to take action upon notification of these events. In Java this is done by implementing the ContentHandler interface or by extending the DefaultHandler class. SAX parsers check for well-formedness and optionally for validity. In Java the errors and the warnings are provided by the ErrorHandler interface. The parser will send a "Fatal Error" when it encounters a well-formedness error, and an "Error" for all other errors.

Let's look at one of the events closely:

public void startElement
      (String namespaceURI,
      String localName,
      String qName,
      Attributes atts)
throws SAXException

The SAX parser will invoke this event at the beginning of every element. The SAX parser provides the namespace URI, the local name, the qualified name (qName), and the list of attributes of this element. This event is accompanied with the endElement event when the parser reaches the end of this element, but before invoking the endElement the parser will provide the contents of the element in order.

More Stories By Selim Mimaroglu

Selim Mimaroglu is a PhD candidate in computer science at the University of Massachusetts in Boston. He holds an MS in computer science from that school and has a BS in electrical engineering.

Comments (1) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
SYS-CON India News Desk 03/06/06 05:14:58 PM EST

When we need to store XML in a Relational Database Management System (RDBMS) we have to think of nonnative ways to store it, since the structure of XML does not fit the flat relations of the RDBMSs. Although some RDBMS vendors support XML storage and XQuery such as Oracle 10g (release 2), this support is not yet common. Some of the well-known RDBMSs are Oracle, DB2, Informix, Sybase IQ, SQL Server, MySQL, and PostgreSQL. If you have one of these products, or a similar product, you should be interested in reading this article.