What services does SKYLINE provide?

We provide specialized divisions including: Data Centre Design & Build (Tier III/IV), Cybersecurity SOC/NOC Operations, SCADA & Automation, Fire Safety, Construction, HVAC, AI Solutions, Server Infrastructure with HPE & Dell, and Software Development.

Does SKYLINE design and build data centres?

Yes, we specialize in Tier III and Tier IV data centre design and construction including electrical infrastructure, precision cooling, raised flooring, fire suppression, and Schneider Electric UPS systems. All our projects are SACS-002 compliant and trusted by major clients like Saudi Aramco and SABIC.

Where is SKYLINE located?

Our headquarters is in Dammam, Eastern Province, Saudi Arabia. We serve all regions of Saudi Arabia and GCC countries including Riyadh, Jeddah, Khobar, Dhahran, and Jubail.

What compliance certifications does SKYLINE hold?

We comply with NCA-ECC (Essential Cybersecurity Controls), SACS-002 (Aramco Third Party Cybersecurity Standard), ISO 27001 (Information Security Management), and SAMA Cybersecurity Framework.

What experience does SKYLINE have in the Saudi market?

SKYLINE was founded in 2019 with 6+ years of experience, has completed 15+ mega projects including the FIFA Club World Cup, and serves 200+ clients including Saudi Aramco, SABIC, Maaden, and STC.

如何监控服务器正常运行时间并设置告警

一份实用的分步指南：从外部监控服务器正常运行时间、从内部观测健康状况，并在出现故障时几秒内收到告警——全部使用 Skyline Cloud VPS 上免费、开源的工具。

SKYLINE Engineering @skyline

Published: Jun 9, 2026
Reading time: 4 min

如何监控服务器正常运行时间并设置告警

“正常运行时间监控”到底是什么意思

正常运行时间监控回答的是两个不同的问题，而要运行一项可靠的服务，这两个问题你都需要关注：

我的服务器从外部能否访问？（外部/黑盒监控）——通过互联网访问的真实客户端能否得到健康的响应？
我的服务器内部是否健康？（内部/白盒监控）——CPU、内存、磁盘，以及 Nginx、MySQL 或你的应用进程等单个服务。

外部检查能捕捉到用户能看到的中断。内部检查则能捕捉到这些中断的_根本原因_——磁盘被占满、进程失控、swap 频繁换页——往往在它们演变成服务中断之前就能被发现。本指南将在一台 Linux VPS 上同时设置这两种监控，外加能在几秒内通知你的告警。文中命令假定运行在 Skyline Cloud 服务器上的 Ubuntu 22.04/24.04 或 Debian；对于 AlmaLinux/RHEL，请相应调整软件包名称。

第 1 步——外部正常运行时间检查

最简单的外部检查，就是从你服务器_之外_的一台机器发起一个 HTTP 请求。在另一台主机（或你的笔记本电脑）上运行下面的命令，确认网站能够响应：

curl -sS -o /dev/null -w "HTTP %{http_code} in %{time_total}s\n" \
  https://example.com/

健康的结果看起来像 HTTP 200 in 0.184s。如果不想自己写代码就实现持续监控，可以使用 UptimeRobot 这类托管检查服务，或者 Uptime Kuma 这类自托管方案。自托管能让你的监控数据保留在境内——这对 PDPL 与 NCA 合规对齐很有帮助。

使用 Docker 将 Uptime Kuma 运行在一台_独立的、小型_ VPS 上（绝不要放在它所监控的那台服务器上）：

docker run -d --restart=always \
  -p 3001:3001 \
  -v uptime-kuma:/app/data \
  --name uptime-kuma louislam/uptime-kuma:1

打开 http://<monitor-ip>:3001，创建你的管理员账户，然后点击 Add New Monitor（添加新监控项）：

Monitor Type（监控类型）： HTTP(s)
URL： https://example.com/health（使用轻量级的健康检查端点，而不是首页）
Heartbeat Interval（心跳间隔）： 60 秒
Retries（重试次数）： 2（避免因单次抖动就触发告警）
Accepted Status Codes（可接受的状态码）： 200-299

始终监控一个专用的 /health 端点，该端点要能确认你的应用_及其_数据库都正常运行，而不仅仅是 Web 服务器能返回一个页面。

第 2 步——使用 node_exporter 和检查脚本进行内部健康监测

对于单台服务器，你并不需要一整套 Prometheus 技术栈。一个由 cron 或 systemd 定时器运行的简短脚本就能覆盖基本需求。创建 /usr/local/bin/health-check.sh：

#!/usr/bin/env bash
set -euo pipefail

THRESH_DISK=90      # percent
THRESH_MEM=90       # percent
WEBHOOK="https://hooks.example.com/your-webhook"

alert() {
  curl -fsS -X POST -H 'Content-Type: application/json' \
    -d "{\"text\":\"[$(hostname)] $1\"}" "$WEBHOOK" || true
}

# Disk usage on /
disk=$(df --output=pcent / | tail -1 | tr -dc '0-9')
[ "$disk" -ge "$THRESH_DISK" ] && alert "Disk at ${disk}% on /"

# Memory usage
mem=$(free | awk '/Mem:/ {printf "%d", $3/$2*100}')
[ "$mem" -ge "$THRESH_MEM" ] && alert "Memory at ${mem}%"

# Critical service must be active
for svc in nginx mysql; do
  systemctl is-active --quiet "$svc" || alert "Service $svc is DOWN"
done

赋予它可执行权限并进行测试：

sudo chmod +x /usr/local/bin/health-check.sh
sudo /usr/local/bin/health-check.sh

用 systemd 定时器来调度它——在日志记录和漏跑处理方面，它比 cron 更可靠。创建 /etc/systemd/system/health-check.service：

[Unit]
Description=Server health check

[Service]
Type=oneshot
ExecStart=/usr/local/bin/health-check.sh

以及 /etc/systemd/system/health-check.timer：

[Unit]
Description=Run health check every 2 minutes

[Timer]

[Install]
WantedBy=timers.target

启用它：

sudo systemctl daemon-reload
sudo systemctl enable --now health-check.timer
systemctl list-timers health-check.timer

如果需要更丰富的指标和历史数据，可以安装 Prometheus node_exporter，它会在 9100 端口上将 CPU、内存、磁盘和网络以指标的形式暴露出来：

sudo apt update && sudo apt install -y prometheus-node-exporter
sudo systemctl enable --now prometheus-node-exporter
curl -s http://localhost:9100/metrics | head

将它绑定到 localhost，或在防火墙中限制 9100 端口，以确保这些指标不会对外公开：

sudo ufw allow from <monitor-ip> to any port 9100 proto tcp

第 3 步——能真正触达你的告警

只有当告警快速到达、并且通过不止一个渠道送达时，它才是有用的。至少配置两个渠道，这样单个故障的服务商不会让你完全收不到通知。

渠道	适用场景	时延	备注
邮件	审计留痕、非紧急	数秒至数分钟	使用真正的 SMTP 服务，而不是服务器本身
Webhook（Slack/Teams）	团队可见性	数秒	容易接入上面的脚本
短信 / 推送	真正的紧急情况	数秒	仅保留给“网站宕机”这类情况

务必通过专门的 SMTP 中继发送告警邮件——绝不要依赖被监控服务器自己的邮件功能，因为如果服务器宕机了，它就无法向你发出警告。可以使用企业邮箱托管或任何 SMTP 服务商。一个基于 msmtp 的最简邮件告警示例：

sudo apt install -y msmtp msmtp-mta
printf 'Subject: ALERT %s\n\n%s\n' "$(hostname)" "Disk high" \
  | msmtp -a default you@example.com

在 Uptime Kuma 中，于 Settings → Notifications（设置 → 通知）下添加通知方式（邮件/SMTP、Slack、Telegram 或通用 webhook），并将它们附加到每一个监控项上。

第 4 步——调优阈值，避免告警疲劳

糟糕的告警比没有告警更糟——人们会学会无视它。请遵循以下规则：

在告警前要求连续 2 次以上检查失败（即 Retries 设置），以忽略瞬时抖动。
针对用户能感知的症状告警，比如 HTTP 5xx 或高延迟，而不是针对每一项内部指标。
设置合理的阈值：磁盘 90%、内存持续 90%、延迟超过你平时的 p95。
发送恢复通知，以便你知道问题何时已解除。
每月复盘告警，删除或调优任何“狼来了”式的误报。

验证整条链路

在你真正依赖告警之前，要对告警路径进行端到端测试。临时调低某个阈值，或停止某个非关键服务：

sudo systemctl stop nginx        # triggers the service-down alert
# confirm the alert arrives, then:
sudo systemctl start nginx

如果告警在几分钟内出现在你的收件箱和聊天工具中，那么你的监控就是真实有效的。未经测试的告警链路，等同于完全没有监控。

总结

现在你已经拥有了外部正常运行时间检查、通过 systemd 定时器与 node_exporter 实现的内部健康监控，以及经过你亲自验证的多渠道告警。请让你的监控主机与它所监控的服务器保持分离，让数据保留在境内以满足 PDPL 与 NCA 合规，并随着流量增长而重新审视各项阈值。

需要一台可靠的、位于境内的 VPS 来托管你的应用和监控技术栈——还配有本地阿拉伯语支持和透明的定价？创建你的 Skyline Cloud 账户，几分钟即可完成部署。

SKYLINE Engineering

@skyline

The engineering team at SKYLINE Industrial Solutions. We publish field-tested guides drawn from real KSA and GCC deployments.

See author profile

SKYLINE engineering services

Need this implemented for you?

Reading is free — building it right takes a team. SKYLINE engineers ship Skyline Cloud for Aramco vendors, banks, hospitals and government agencies across Saudi Arabia. Talk to us before you start.

Cybersecurity & Data Centre SACS-210, NCA ECC, SOC, 24/7 ops IT AMC Annual maintenance, 24/7 SLA, Bronze→Platinum

Talk to a SKYLINE engineer WhatsApp · +966 50 993 9334 Call +966 50 993 9334

Aramco Approved Contractor ISO 9001 · ISO 27001 SAMA CSF aligned NCA ECC ready 247+ KSA clients

Comments

0 total · 0 threads

Be the first to leave a comment.

如何监控服务器正常运行时间并设置告警

“正常运行时间监控”到底是什么意思

第 1 步——外部正常运行时间检查

第 2 步——使用 node_exporter 和检查脚本进行内部健康监测

第 3 步——能真正触达你的告警

第 4 步——调优阈值，避免告警疲劳

验证整条链路

总结

كيف تختار مزوّداً سحابياً في السعودية — قائمة المشتري 2026

How to Choose a Cloud Provider in Saudi Arabia — 2026 Buyer's Checklist

المحلي مقابل السحابي للأعمال السعودية: التكلفة الإجمالية ومتى تنتقل

Need this implemented for you?

Comments

Discover by topic