March 6, 2025

Monitoring Server Incident Resolution

The issue lasted for 9 hours but was not related to crawling. The root cause was a network issue affecting the monitoring server. Because the monitoring server was unavailable to the main job manager, each job report had to wait several minutes for a timeout response from the monitoring server.

As a result, the processing time for each job increased, and the job queue grew to several thousand jobs.

The incident has now been resolved. We are continuously working on improving our monitoring system to prevent similar issues in the future.