How to extract XPath in Golang

Sometimes you need to extract data points from HTML using Xpath (Read what is XPath first). First, install htmlquery package:

go get github.com/antchfx/htmlquery

Here is the code and read explanation after:

package main

import (
	"bytes"
	"github.com/antchfx/htmlquery"
	"strings"
	"fmt"
)

// ExtractXPath function that accepts a single XPath expression and returns a single string
func ExtractXPath(htmlStr string, xpathExpr string) (string, error) {
	// Load the HTML document
	var buffer bytes.Buffer
	buffer.WriteString(htmlStr)
	doc, err := htmlquery.Parse(&buffer)
	if err != nil {
		return "", err
	}

	// Find the nodes matching the XPath expression
	nodes := htmlquery.Find(doc, xpathExpr)
	var content []string

	// Iterate over the nodes and extract the content
	for _, node := range nodes {
		content = append(content, htmlquery.InnerText(node))
	}
	// Join the extracted content if multiple nodes were found
	result := strings.Join(content, " ")

	return result, nil
}

func main() {
	htmlStr := `
		<html>
			<head>
				<title>Test Page</title>
			</head>
			<body>
				<div class="content">
					<p>Hello, World!</p>
					<p>This is a test.</p>
				</div>
			</body>
		</html>`

	xpathExpr := "//div[@class='content']/p"

	content, err := ExtractXPath(htmlStr, xpathExpr)
	if err != nil {
		fmt.Println("Error:", err)
	} else {
		fmt.Println("Extracted content:", content)
	}
}

You will receive the output:

Extracted content: Hello, World! This is a test.

How it works in Go

Luckily, there is an open-source lib htmlquery for that. Install it first:

go get github.com/antchfx/htmlquery

Then, do a basic query against the document:

nodes, err := htmlquery.QueryAll(doc, "//a")
if err != nil {
	panic(`not a valid XPath expression.`)
}

See more examples at the doc.

Extracting multiple Xpath elements in Golang

package extract

import (
	"bytes"
	"github.com/antchfx/htmlquery"
	"strings"
)

type Rules = map[string]string
type Content = map[string]string

func XPath(htmlStr string, filter Rules) (Content, error) {
	// Load the HTML document
	var buffer bytes.Buffer
	buffer.WriteString(htmlStr)
	doc, err := htmlquery.Parse(&buffer)
	if err != nil {
		return nil, err
	}

	result := make(Content)

	// Iterate over the filter to apply each XPath expression
	for key, xpathExpr := range filter {
		// Find the nodes matching the XPath expression
		nodes := htmlquery.Find(doc, xpathExpr)
		var content []string

		// Iterate over the nodes and extract the content
		for _, node := range nodes {
			content = append(content, htmlquery.InnerText(node))
		}
		// Join the extracted content if multiple nodes were found
		result[key] = strings.Join(content, " ")
	}

	return result, nil
}

Extracting multiple Xpath elements requires more complicated code. First, define two maps: for extracting rules and for the result. Each rule has its own key, which will be used for the result map after extraction.

Then, iterate over filter rules and find elements for each rule. Extract it and put it into the resulting map under a certain key.

Here is the usage example:


filter := Rules{
		"Title":          "//title/text()",
		"Header":         "//h1/text()",
		"link_more_info": "//a[contains(text(),'More information')]/@href",
		"link_fb":        "//a[contains(text(),'Another link fb')]/@href",
	}

	content, err := XPath(html, filter)
	if err != nil {
		fmt.Println("Error: %s", err)
	}
	fmt.Printf("Extracted content: %v
", content)

The result map will be:

map[
	Header:Example Domain
	Title:Example Domain
	link_fb:https://fb.com/test
	link_more_info:https://www.iana.org/domains/example
]

Table of Contents

Table of Contents

How it works in Go

Extracting multiple Xpath elements in Golang