Essential Go XML  Suggest an edit

Pull (streaming) XML parsing

Parsing into a struct is convenient but requires a lot of memory to hold the whole decoded document in memory.

In some cases XML files are so large that it’s not possible to decode the whole file into memory. For example XML dumps of Wikipedia content are several gigabytes in size.

Pull parsing is more efficient but API is harder to use.

var xmlStr = `
<people>
	<person age="34">
		<first-name>John</first-name>
		<address>
			<city>San Francisco</city>
			<state>CA</state>
		</address>
	</person>
	<!-- sample comment -->
	<person age="23">
		<first-name>Julia</first-name>
	</person>
</people>`

r := bytes.NewBufferString(xmlStr)
decoder := xml.NewDecoder(r)
inCityElement := false
for {
	t, err := decoder.Token()
	if err == io.EOF {
		// io.EOF is a successful end
		break
	}
	if err != nil {
		fmt.Printf("decoder.Token() failed with '%s'\n", err)
		break
	}

	switch v := t.(type) {

	case xml.StartElement:
		if v.Name.Local == "person" {
			for _, attr := range v.Attr {
				if attr.Name.Local == "age" {
					fmt.Printf("Element: '<person>', attribute 'age' has value '%s'\n", attr.Value)
				}
			}
		} else if v.Name.Local == "city" {
			inCityElement = true
		}

	case xml.EndElement:
		if v.Name.Local == "city" {
			inCityElement = false
		}

	case xml.CharData:
		if inCityElement {
			fmt.Printf("City: %s\n", string(v))
		}

	case xml.Comment:
		fmt.Printf("Comment: %s\n", string(v))

	case xml.ProcInst:
		// handle XML processing instruction like <?target inst?>

	case xml.Directive:
		// handle XML directive like <!text>
	}
}
Element: '<person>', attribute 'age' has value '34'
City: San Francisco
Comment:  sample comment 
Element: '<person>', attribute 'age' has value '23'

Pull parsing requests next token from stream of XML tokens.

For start tag like we get xml.StartElement token.

For end tag like we get xml.EndElemnt token.

For data inside the element data we get xml.CharData token.

When decoder reaches the end, it returns error io.EOF.

In the above example we print age attribute of element and char data inside element.

This is a very basic example. In real programs you might need to remember more state.

For example, if your XML is:

<foo>
  <bar>
    <foo></foo>
  </bar>
</foo>

If you look just at the xml.StartElement token, you don’t know if foo is for the top-level element or is it a child of element.

  ↑ ↓ to navigate     ↵ to select     Esc to close