XML stands for **eXtensible Markup Language**. It is a markup language much like HTML, but designed to **store and transport data**, not to display it. XML is a W3C recommendation.
Key characteristics of XML:
<?xml version="1.0" encoding="UTF-8"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A young man wins a lottery, but it's a trap.</description>
</book>
</catalog>
While both XML and HTML are markup languages, they serve different purposes:
# HTML Example (display-oriented)
<h1>Book Catalog</h1>
<p>Here are some books:</p>
<ul>
<li>XML Developer's Guide by Matthew Gambardella</li>
</ul>
# XML Example (data-oriented)
<book>
<title>XML Developer's Guide</title>
<author>Matthew Gambardella</author>
</book>
XML documents must follow strict syntax rules to be considered **"well-formed"**.
# Correct:
<main><section>...</section></main>
# Incorrect (overlapping tags):
<main><section>...</main></section>
# Correct:
<element name="value">
# Incorrect:
<element name=value>
XML elements are the building blocks of XML. They represent data or containers for other data.
<name>John Doe</name>
<customers> <!-- Root element -->
<customer>...</customer>
</customers>
<order> <!-- Parent of <item> -->
<id>123</id>
<item>Laptop</item> <!-- Child of <order> -->
</order>
<br /> <!-- Self-closing tag -->
<image src="pic.jpg" />
<line_break></line_break> <!-- Also an empty element, but with separate tags -->
Attributes provide additional information about an element that is not considered part of its content.
<student id="s123" status="active">
<name>Alice</name>
<major type="undergraduate">Computer Science</major>
</student>
In the example above, `id` and `status` are attributes of the `<student>` element. `type` is an attribute of the `<major>` element.
When to use elements and when to use attributes is often a design decision, but general guidelines exist:
# Data as Elements (preferred for structured data)
<product>
<id>P123</id>
<name>Laptop</name>
<price>1200</price>
<currency>USD</currency>
</product>
# Data as Attributes (less flexible)
<product id="P123" name="Laptop" price="1200" currency="USD"/>
Comments are used to add notes or explanations within the XML document. They are ignored by XML parsers.
<!-- This is a single-line comment -->
<price>10.99</price> <!-- Price in USD -->
<!--
This is a
multi-line
comment.
-->
The XML declaration is the first line of an XML document. It defines the XML version and the character encoding.
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
These are two important concepts for XML documents.
# Well-Formed Example:
<data><item>Value</item></data>
# Not Well-Formed Example (missing closing tag for <item>):
<data><item>Value</data>
# Assume a schema defines that <age> must be a number:
<person><name>Bob</name><age>30</age></person> <!-- Valid -->
<person><name>Alice</name><age>thirty</age></person> <!-- Well-formed, but NOT valid against schema -->
XML namespaces are used to avoid element name conflicts when combining XML documents from different applications or industries.
<root xmlns:prefix="URI">
<prefix:element>...</prefix:element>
</root>
<root>
<!-- Product from a "furniture" vocabulary -->
<product xmlns:f="http://www.example.com/furniture">
<f:name>Chair</f:name>
<f:material>Wood</f:material>
</product>
<!-- Product from an "electronics" vocabulary -->
<product xmlns:e="http://www.example.com/electronics">
<e:name>Laptop</e:name>
<e:model>XPS 15</e:model>
</product>
<!-- Default namespace (applies to current element and its children if no prefix) -->
<order xmlns="http://www.example.com/orders">
<id>1001</id>
</order>
</root>
In this example, `<f:name>` and `<e:name>` are distinct elements despite having the same local name, because they belong to different namespaces.
CDATA sections are used to escape blocks of text that might contain characters that would otherwise be interpreted as XML markup (e.g., `<`, `&`).
<script_code>
<![CDATA[
function showMessage() {
if (a < b && c > d) { // < and & would normally cause errors
alert("Hello!");
}
}
]]>
</script_code>
Without CDATA, the `<` and `&` characters in the JavaScript code would be interpreted as the start of new XML tags or entities, leading to a well-formedness error.
XML entities are special characters that have a predefined meaning in XML and must be escaped if you want them to appear literally in your content.
<message>
The price is < 100 & has a "special" offer.
</message>
<attribute value="It's important" /> <!-- ' is allowed if quoted with " -->
<attribute value='It's important' /> <!-- Alternatively, use entity -->
Result when parsed: "The price is < 100 & has a "special" offer."
XML schemas define the legal building blocks of an XML document, ensuring that XML documents adhere to a specific structure and content model. This allows for validation of XML data.
The original way to define the structure of an XML document. It uses a specific syntax to declare elements, attributes, and their relationships.
<?xml version="1.0"?>
<!DOCTYPE note [
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
<?xml version="1.0"?>
<!DOCTYPE note SYSTEM "note.dtd">
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
# note.dtd content:
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
XSD is the successor to DTD. It's written in XML itself, making it more powerful and extensible.
# Example of an XML document linked to an XSD:
<!-- book.xml -->
<catalog
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="book.xsd">
<book id="bk101">
<author>John Doe</author>
<title>My XML Book</title>
</book>
</catalog>
# book.xsd
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="catalog">
<xs:complexType>
<xs:sequence>
<xs:element name="book" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="author" type="xs:string"/>
<xs:element name="title" type="xs:string"/>
</xs:sequence>
<xs:attribute name="id" type="xs:string"/>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
An XML document that conforms to its DTD or XSD is considered **valid**.
To use data from an XML document, it needs to be parsed (read and interpreted) by an XML parser. Parsers convert the XML into a tree-like structure in memory, allowing programs to access elements and their data.
# Assuming you have a 'books.xml' file with the catalog example from Section 1
import xml.etree.ElementTree as ET
tree = ET.parse('books.xml') # Parse the XML file
root = tree.getroot() # Get the root element (<catalog>)
print(f"Root element: {root.tag}")
# Iterate over all <book> elements
for book in root.findall('book'):
book_id = book.get('id') # Get attribute 'id'
title = book.find('title').text # Get text content of <title> child element
author = book.find('author').text # Get text content of <author> child element
genre = book.find('genre').text
print(f"\nBook ID: {book_id}")
print(f" Title: {title}")
print(f" Author: {author}")
print(f" Genre: {genre}")
Expected Output (console):
Root element: catalog
Book ID: bk101
Title: XML Developer's Guide
Author: Gambardella, Matthew
Genre: Computer
Book ID: bk102
Title: Midnight Rain
Author: Ralls, Kim
Genre: Fantasy
XSLT (eXtensible Stylesheet Language Transformations) is a language for transforming XML documents into other XML documents, HTML, or plain text.
Assuming `books.xml` from Section 1.
# transform_books.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/catalog">
<html>
<head>
<title>My Book Catalog</title>
<style>
table { width: 100%; border-collapse: collapse; }
th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
th { background-color: #f2f2f2; }
</style>
</head>
<body>
<h1>Our Books</h1>
<table>
<tr>
<th>Title</th>
<th>Author</th>
<th>Genre</th>
<th>Price</th>
</tr>
<xsl:for-each select="book">
<tr>
<td><xsl:value-of select="title"/></td>
<td><xsl:value-of select="author"/></td>
<td><xsl:value-of select="genre"/></td>
<td><xsl:value-of select="price"/></td>
</tr>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
To perform the transformation, you need an XSLT processor (e.g., `xsltproc` on Linux, or libraries in programming languages).
# Using xsltproc (Linux/macOS)
xsltproc transform_books.xsl books.xml > output.html
# Open output.html in a browser to see the HTML table.
A language for finding information in an XML document. It's used by XSLT to navigate XML and also directly by programming languages to select parts of an XML document.
# XPath Examples (on the 'catalog' XML from Section 1):
/catalog/book # Selects all <book> elements that are children of <catalog>.
/catalog/book/title # Selects all <title> elements that are children of <book> elements, which are children of <catalog>.
/catalog/book[1] # Selects the first <book> element.
/catalog/book[last()] # Selects the last <book> element.
/catalog/book[@id] # Selects all <book> elements that have an 'id' attribute.
/catalog/book[@id='bk101'] # Selects the <book> element with id='bk101'.
//title # Selects all <title> elements anywhere in the document.
//book[price > 10] # Selects all <book> elements where the child <price> is greater than 10.
/catalog/book/description/text() # Selects the text content of <description> element.
A W3C language designed to query XML data. It is more powerful than XPath, allowing for more complex data manipulation and transformation.
# XQuery Example (on the 'catalog' XML from Section 1):
FOR $book IN doc("books.xml")/catalog/book
WHERE $book/price > 10
ORDER BY $book/title
RETURN <expensive_book>
{$book/title}
{$book/author}
{$book/price}
</expensive_book>
This query would return a new XML document containing only the books with a price greater than 10, ordered by title, and formatted into `<expensive_book>` elements.
While JSON has become more prevalent for web APIs due to its simplicity, XML still plays a significant role in many enterprise, legacy, and domain-specific applications.
XML is a powerful and versatile language for structuring, storing, and transporting data. While its use in new web APIs has diminished in favor of JSON, its role in enterprise systems, document formats, and web services remains significant. Mastering XML syntax, understanding its validation mechanisms (schemas), and learning how to parse and transform it will provide you with a fundamental skill set for many data-centric applications.