Web Data: XML
Web Data: XML
XML: XML stands for eXtensible Markup Language, developed by W3C in 1996. XML 1.0
was officially adopted as a W3C recommendation in 1998. XML was designed to carry data, not to
display data. XML is designed to be self-descriptive. XML is a subset of SGML that can define your
own tags. A Meta Language and tags describe the content. XML Supports CSS, XSL, DOM.
A "Well Formed" XML document must have the following correct XML syntax:
- XML documents must have a root element
- XML elements must have a closing tag
- XML tags are case sensitive
- XML elements must be properly nested
- XML attribute values must be quoted
XML Element
An XML element is everything from (including) the element's start tag to (including) the
element's end tag.
An element can contain:
- other elements
- text
- attributes
- Or a mix of all of the above...
XML vocabulary:
XML vocabulary is used to define - element and attribute names
- element content
- Semantics of elements and attributes
Some of the xml vocabularies are XHTML, RSS, XSL, DTD, and Schema
XML DTD:
Document Type Definition purpose is to define the structure of an XML document. It defines
the structure with a list of defined elements in the xml document.
<!DOCTYPE note
[
<!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> <!ELEMENT from
(#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)>
]>
Where PCDATA refers parsed character data. In the above xml document the elements to, from,
heading, body carries some text, so that, these elements are declared to carry text in DTD file.
This definition file is stored with .dtd extension.
XML Schema:
The schema has more advantages over DTD. A DTD can have two types of data in it, namely
the CDATA and the PCDATA. The CDATA is not parsed by the parser whereas the PCDATA is
parsed. In a schema you can have primitive data types and custom data types like you have used in
programming.
XML Parsers:
An XML parser converts an XML document into an XML DOM object - which can then be
manipulated with a JavaScript.
Two types of XML parsers:
– Validating Parser
• It requires document type declaration
• It generates error if document does not
Conform with DTD and
Meet XML validity constraints
– Non-validating Parser
• It checks well-formedness for xml document
• It can ignore external DTD
XML Namespaces
It is a collection of element and attributes names associated with an XML vocabulary. XML
Namespaces provide a method to avoid element name conflicts.
XML document 1
<table>
<tr>
<td>Apples</td>
<td>Bananas</td>
</tr>
</table>
XML document2
<table>
<name> Coffee Table</name>
<width>80</width>
<length>120</length>
</table>
If these XML fragments were added together, there would be a name conflict. Both contain a
<table> element, but the elements have different content and meaning. Such name conflicts in
<f:table>
<f:name> Coffee Table</f:name>
<f:width>80</f:width>
<f:length>120</f:length>
</f:table>
When using prefixes in XML, a so-called namespace for the prefix must be defined. The
namespace is defined by the xmlns attribute in the start tag of an element. The namespace declaration
has the following syntax.
xmlns:prefix="URI"
For example,
<h:table xmlns:h="http://www.w3.org/table">
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>
<f:table xmlns:f="http://www.w3.org/furniture">
<f:name> Coffee Table</f:name>
<f:width>80</f:width>
<f:length>120</f:length>
</f:table>
Default namespace
XML DOM
Document Object Model is for defining the standard for accessing and manipulating
XML documents. XML DOM is used for
Loading the xml document
Accessing the xml document
Deleting the elements of xml document
Changing the elements of xml document
According to the DOM, everything in an XML document is a node. It considers The entire
document is a document node every XML element is an element node The text in the XML elements
are text nodes Every attribute is an attribute node Comments are comment nodes
DOM Levels
Level 1 Core: W3C Recommendation, October 1998
It has feature for primitive navigation and manipulation of XML trees
other Level 1 features are: All HTML features
Level 2 Core: W3C Recommendation, November 2000
It adds Namespace support and minor new features
other Level 2 features are: Events, Views, Style, Traversal and Range
Level 3 Core: W3C Working Draft, April 2002
It supports: Schemas, XPath, XSL, XSLT
javax.xml.parsers org.w3c.dom
The following DOM java Classes are necessary to process the XML document:
DocumentBuilderFactory class creates the instance of DocumentBuilder. DocumentBuilderproduces
a Document (a DOM) that conforms to the DOM specification
The following methods and properties are necessary to process the XML document:
<Employee- Detail>
<Employee>
<Emp_Id> E-001 </Emp_Id>
<Emp_Name> Vinod </Emp_Name>
<Emp_E- mail> Vinod@yahoo.com </Emp_E-mail>
</Employee>
<Employee>
<Emp_Id> E-002 </Emp_Id>
<Emp_Name> Arun </Emp_Name>
<Emp_E-mail> Arun@yahoo.com </Emp_E- mail>
</Employee>
</Employee-Detail>
Step 2: Create a java based dom for counting the number of elements in xml file.
import org.w3c.dom.*;
import javax.xml.parsers.*;
import java.io.*;
Output:
Number of nodes: 2
Scanning the XML file from start to end, each event invokes a
corresponding callback method that the programmer writes.
SAX (Simple API for XML) is an event-based parser for XML documents. Unlike a DOM parser, a
SAX parser creates no parse tree. SAX is a streaming interface for XML, which means that applications
using SAX receive event notifications about the XML document being processed an element, and
attribute, at a time in sequential order starting at the top of the document, and ending with the closing of
the ROOT element.
Reads an XML document from top to bottom, recognizing the tokens that make up a well-formed XML
document.
Tokens are processed in the same order that they appear in the document.
Reports the application program the nature of tokens that the parser has encountered as they occur.
The application program provides an "event" handler that must be registered with the parser.
As the tokens are identified, callback methods in the handler are invoked with the relevant information.
ContentHandler Interface
This interface specifies the callback methods that the SAX parser uses to notify an application program
of the components of the XML document that it has seen.
void startDocument() − Called at the beginning of a document.
void endDocument() − Called at the end of a document.
void startElement(String uri, String localName, String qName, Attributes atts) − Called at the beginning
of an element.
void endElement(String uri, String localName,String qName) − Called at the end of an element.
void characters(char[] ch, int start, int length) − Called when character data is encountered.
void ignorableWhitespace( char[] ch, int start, int length) − Called when a DTD is present and ignorable
whitespace is encountered.
void processingInstruction(String target, String data) − Called when a processing instruction is
recognized.
void setDocumentLocator(Locator locator)) − Provides a Locator that can be used to identify positions
in the document.
void skippedEntity(String name) − Called when an unresolved entity is encountered.
void startPrefixMapping(String prefix, String uri) − Called when a new namespace mapping is defined.
void endPrefixMapping(String prefix) − Called when a namespace definition ends its scope.
Attributes Interface
This interface specifies methods for processing the attributes connected to an element.
int getLength() − Returns number of attributes.
String getQName(int index)
String getValue(int index)
String getValue(String qname)
SAX packages
javax.xml.parsers: Describing the main classes needed for parsing org.xml.sax: Describing few
interfaces for parsing
SAX classes
SAXParser Defines the API that wraps an XMLReader implementation class SAXParserFactory
Defines a factory API that enables applications to configure and obtain a SAX based parser to parse
XML documents ContentHandler Receive notification of the logical content of a document.
DTDHandler Receive notification of basic DTD-related events. EntityResolver Basic interface for
resolving entities. ErrorHandler Basic interface for SAX error handlers. DefaltHandler Default base
class for SAX event handlers.
SAX parser methods
startDocument() and endDocument() – methods called at the start and end of an XML
document.
startElement() and endElement() – methods called at the start and end of a document element.
characters() – method called with the text contents in between the start and end tags of an XML
document element.
Understanding SAX Parser
At the very first, create an instance of the SAXParserFactory class which generates an instance
of the parser. This parser wraps a SAXReader object. When the parser's parse() method is invoked, the
reader invokes one of the several callback methods (implemented in the application). These callback
methods are defined by the interfaces ContentHandler, ErrorHandler, DTDHandler, and
EntityResolver.
StAXParserDemo.java
package com.tutorialspoint.xml;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.Iterator;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.Characters;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
try {
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader eventReader =
factory.createXMLEventReader(new FileReader("input.txt"));
while(eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
switch(event.getEventType()) {
case XMLStreamConstants.START_ELEMENT:
StartElement startElement = event.asStartElement();
String qName = startElement.getName().getLocalPart();
if (qName.equalsIgnoreCase("student")) {
Iterator<Attribute> attributes =
startElement.getAttributes();
String rollNo = attributes.next().getValue();
if(rollNo.equalsIgnoreCase(requestedRollNo)) {
System.out.println("Start Element : student");
System.out.println("Roll No : " + rollNo);
isRequestRollNo = true;
}
} else if (qName.equalsIgnoreCase("firstname")) {
bFirstName = true;
} else if (qName.equalsIgnoreCase("lastname")) {
bLastName = true;
} else if (qName.equalsIgnoreCase("nickname")) {
bNickName = true;
}
else if (qName.equalsIgnoreCase("marks")) {
bMarks = true;
}
break;
case XMLStreamConstants.CHARACTERS:
Characters characters = event.asCharacters();
case XMLStreamConstants.END_ELEMENT:
EndElement endElement = event.asEndElement();
if(endElement.getName().getLocalPart().equalsIgnoreCase(
"student") && isRequestRollNo) {
System.out.println("End Element : student");
System.out.println();
isRequestRollNo = false;
}
break;
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (XMLStreamException e) {
e.printStackTrace();
}
}
}
This would produce the following result −
Start Element : student
Roll No : 393
First Name: dinkar
Last Name: kad
Nick Name: dinkar
Marks: 85
End Element : student
It is process of extracting one info. From one xml doc. and uses that info to create another xml
doc.
XSL
XSL is stands for eXtensible Style sheet Language which is an xml vocabulary. It contains two
Types of information:Template data: Which is text that is copied to output xml document
with change or no change.
XSL markup: which controls transformation process. It uses two namespaces:
http://www.w3.org/1999/xsl/Transform is the namespace name for xsl namespace
http://www.w3.org/1999/xhtml is the xhtml namesapce name
1. XSLT
2. XPATH
3. XSL-FO
XPATH- is used to find information in an XML document. It navigates through elements and
attributes in XML documents.
XSL-FO is a XSL Formatter an xml vocabulary for defining style properties of xml document.
<Employee>
<Emp_Id> E-002 </Emp_Id>
<Emp_Name> Amit</Emp_Name>
<Emp_E-mail> Amit2@yahoo.com </Emp_E- mail> </Employee>
</Employee-Detail>
Axis name specifies the direction to which we can search a node Node test specifies an element name
selected for transformation
Predicate:
XPath uses some predicate which constructs the node test. Predicate can be either
child::<element name>
(Or)
Attribute::<value> Absolute and Relative Location Path