Sometimes you need to extract data points from HTML using Xpath (Read what is XPath first). First, install htmlquery package:
go get github.com/antchfx/htmlquery
Here is the code and read explanation after:
package main
import (
"bytes"
"github.com/antchfx/htmlquery"
"strings"
"fmt"
)
// ExtractXPath function that accepts a single XPath expression and returns a single string
func ExtractXPath(htmlStr string, xpathExpr string) (string, error) {
// Load the HTML document
var buffer bytes.Buffer
buffer.WriteString(htmlStr)
doc, err := htmlquery.Parse(&buffer)
if err != nil {
return "", err
}
// Find the nodes matching the XPath expression
nodes := htmlquery.Find(doc, xpathExpr)
var content []string
// Iterate over the nodes and extract the content
for _, node := range nodes {
content = append(content, htmlquery.InnerText(node))
}
// Join the extracted content if multiple nodes were found
result := strings.Join(content, " ")
return result, nil
}
func main() {
htmlStr := `
<html>
<head>
<title>Test Page</title>
</head>
<body>
<div class="content">
<p>Hello, World!</p>
<p>This is a test.</p>
</div>
</body>
</html>`
xpathExpr := "//div[@class='content']/p"
content, err := ExtractXPath(htmlStr, xpathExpr)
if err != nil {
fmt.Println("Error:", err)
} else {
fmt.Println("Extracted content:", content)
}
}
You will receive the output:
Extracted content: Hello, World! This is a test.
How it works in Go
Luckily, there is an open-source lib htmlquery for that. Install it first:
go get github.com/antchfx/htmlquery
Then, do a basic query against the document:
nodes, err := htmlquery.QueryAll(doc, "//a")
if err != nil {
panic(`not a valid XPath expression.`)
}
See more examples at the doc.
Extracting multiple Xpath elements in Golang
package extract
import (
"bytes"
"github.com/antchfx/htmlquery"
"strings"
)
type Rules = map[string]string
type Content = map[string]string
func XPath(htmlStr string, filter Rules) (Content, error) {
// Load the HTML document
var buffer bytes.Buffer
buffer.WriteString(htmlStr)
doc, err := htmlquery.Parse(&buffer)
if err != nil {
return nil, err
}
result := make(Content)
// Iterate over the filter to apply each XPath expression
for key, xpathExpr := range filter {
// Find the nodes matching the XPath expression
nodes := htmlquery.Find(doc, xpathExpr)
var content []string
// Iterate over the nodes and extract the content
for _, node := range nodes {
content = append(content, htmlquery.InnerText(node))
}
// Join the extracted content if multiple nodes were found
result[key] = strings.Join(content, " ")
}
return result, nil
}
Extracting multiple Xpath elements requires more complicated code. First, define two maps: for extracting rules and for the result. Each rule has its own key, which will be used for the result map after extraction.
Then, iterate over filter rules and find elements for each rule. Extract it and put it into the resulting map under a certain key.
Here is the usage example:
filter := Rules{
"Title": "//title/text()",
"Header": "//h1/text()",
"link_more_info": "//a[contains(text(),'More information')]/@href",
"link_fb": "//a[contains(text(),'Another link fb')]/@href",
}
content, err := XPath(html, filter)
if err != nil {
fmt.Println("Error: %s", err)
}
fmt.Printf("Extracted content: %v\n", content)
The result map will be:
map[
Header:Example Domain
Title:Example Domain
link_fb:https://fb.com/test
link_more_info:https://www.iana.org/domains/example
]