<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>VPS on Code is cheap, let&#39;s talk</title>
    <link>https://blog.ferstar.org/series/vps/</link>
    <description>Code is cheap, let&#39;s talk</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>zh-CN</language>
    <copyright>Copyright 2026 ferstar</copyright>
    <lastBuildDate>Sat, 13 Jun 2026 15:00:00 +0800</lastBuildDate>
    <ttl>60</ttl><atom:link href="https://blog.ferstar.org/series/vps/index.xml" rel="self" type="application/rss+xml" /><image>
      <url>https://blog.ferstar.org/site-logo.png</url>
      <title>Code is cheap, let&#39;s talk</title>
      <link>https://blog.ferstar.org/</link>
    </image>
    
    <item>
      <title>512MB VPS 一次 OOM 故障复盘：SSH 秒断、API 521，以及后续低内存优化</title>
      <link>https://blog.ferstar.org/posts/vps-low-memory-oom-ssh-api-recovery/</link>
      <pubDate>Sat, 13 Jun 2026 15:00:00 +0800</pubDate>
      
      <guid isPermaLink="true">https://blog.ferstar.org/posts/vps-low-memory-oom-ssh-api-recovery/</guid>
      <description>512MB VPS 突然出现 SSH 秒断和 Cloudflare 521，排查发现是低内存页分配失败；通过 XanMod、swap、zswap、earlyoom、OOMScoreAdjust 和 Docker 限制恢复稳定。</description><content:encoded><![CDATA[<p>这次不是迁移，是迁移之后的小内存 VPS 被现实敲了一下。</p>
<p>现象很像网络问题：本机 SSH alias 直连秒断，云厂商 Web Console 卡在 connecting，Cloudflare 侧的 API 变成 521。更麻烦的是，机器上的代理服务还正常，说明不是整台机完全死掉，也不是浮动 IP、TUN 代理、SSH key 这类一眼能解释的问题。</p>
<p>最后重启以后 SSH 恢复，才有机会进机器复盘。结论很明确：512MB 内存太紧，系统在网络收包路径上发生了大量页分配失败，Nginx stream、SSH、API 这些入口服务被拖进了半死不活的状态。</p>

<h2 class="relative group">故障现象
    <div id="故障现象" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#%e6%95%85%e9%9a%9c%e7%8e%b0%e8%b1%a1" aria-label="锚点">#</a>
    </span>
    
</h2>
<p>当时 SSH 失败停在密钥认证之前：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">kex_exchange_identification: Connection closed by remote host</span></span></code></pre></div></div>
<p>这个错误点很关键。它说明连接已经到了远端，但还没进入正常 SSH 认证流程，所以不是 key 没换、权限不对、known_hosts 脏了这种问题。</p>
<p>架构上，这台机器的入口有点绕：</p>
<div class="highlight-wrapper"><pre tabindex="0"><code class="language-mermaid" data-lang="mermaid">flowchart LR
    Client["client"]
    Nginx["nginx stream :443"]
    SSH["sshd :22"]
    API["api backend :8000"]
    FM["filebrowser"]
    CF["Cloudflare"]

    Client -->|SSH over 443| Nginx
    CF -->|api.example.com| Nginx
    CF -->|fm.example.com| Nginx
    Nginx -->|default stream| SSH
    Nginx -->|SNI api| API
    Nginx -->|SNI fm| FM</code></pre></div>
<p>所以 SSH、API、文件服务都会经过 Nginx stream。只要 Nginx、内核网络栈或者本机内存状态出问题，外面看到的就会很像“SSH 挂了”和“API 挂了”同时发生。</p>

<h2 class="relative group">真正的证据
    <div id="真正的证据" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#%e7%9c%9f%e6%ad%a3%e7%9a%84%e8%af%81%e6%8d%ae" aria-label="锚点">#</a>
    </span>
    
</h2>
<p>重启后翻内核日志，关键字不是 <code>sshd</code>，而是这些：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">journalctl -k --since <span class="s2">"2026-06-13 00:00:00"</span> <span class="p">|</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">  egrep -i <span class="s2">"page allocation failure|out of memory|oom|killed process"</span></span></span></code></pre></div></div>
<p>重启前出现了多次 <code>page allocation failure</code>，相关进程包括：</p>
<ul>
<li><code>containerd-shim</code></li>
<li><code>containerd</code></li>
<li><code>dockerd</code></li>
<li><code>hysteria</code></li>
<li><code>runc</code></li>
<li><code>kswapd0</code></li>
</ul>
<p>堆栈集中在网络收包路径，能看到类似 <code>virtio_net</code>、<code>tcp_gro_receive</code>、<code>skb_page_frag_refill</code> 这些函数。也就是说，不是单个业务进程写爆了日志那么简单，而是低内存状态已经影响到了内核处理网络包。</p>
<p>这也解释了为什么现象会这么诡异：TCP 端口可能还能接，代理 UDP 服务也可能还活着，但 SSH 握手、Nginx stream 转发、API upstream 都可能在内存紧张时随机卡住或断开。</p>

<h2 class="relative group">第一轮：系统和内核收拾干净
    <div id="第一轮系统和内核收拾干净" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#%e7%ac%ac%e4%b8%80%e8%bd%ae%e7%b3%bb%e7%bb%9f%e5%92%8c%e5%86%85%e6%a0%b8%e6%94%b6%e6%8b%be%e5%b9%b2%e5%87%80" aria-label="锚点">#</a>
    </span>
    
</h2>
<p>先把系统包和内核状态整理到一个可控状态。</p>
<p>XanMod 源之前还停留在老写法：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">deb <span class="o">[</span>signed-by<span class="o">=</span>/etc/apt/keyrings/xanmod-archive-keyring.gpg<span class="o">]</span> http://deb.xanmod.org releases main</span></span></code></pre></div></div>
<p>这个源已经不适合当前 Debian 13，改成按 codename 的源：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">deb <span class="o">[</span>signed-by<span class="o">=</span>/etc/apt/keyrings/xanmod-archive-keyring.gpg<span class="o">]</span> http://deb.xanmod.org trixie main</span></span></code></pre></div></div>
<p>然后安装当前 x64v3 的 XanMod 内核：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">apt update
</span></span><span class="line"><span class="cl">apt install linux-xanmod-x64v3
</span></span><span class="line"><span class="cl">reboot</span></span></code></pre></div></div>
<p>重启后确认：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">uname -r
</span></span><span class="line"><span class="cl"><span class="c1"># 7.0.12-x64v3-xanmod1</span></span></span></code></pre></div></div>
<p>旧内核也清掉，只保留当前 XanMod 相关包，避免 <code>/boot</code> 和 grub 里堆一堆过期入口：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">apt autoremove --purge
</span></span><span class="line"><span class="cl">apt clean
</span></span><span class="line"><span class="cl">update-grub</span></span></code></pre></div></div>
<p>顺手清理 journal 和 Docker 无用对象：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">journalctl --vacuum-time<span class="o">=</span>7d
</span></span><span class="line"><span class="cl">docker system prune -af</span></span></code></pre></div></div>
<p>这一步不碰 Docker volumes，避免误删业务数据。</p>

<h2 class="relative group">第二轮：给 512MB 内存留后路
    <div id="第二轮给-512mb-内存留后路" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#%e7%ac%ac%e4%ba%8c%e8%bd%ae%e7%bb%99-512mb-%e5%86%85%e5%ad%98%e7%95%99%e5%90%8e%e8%b7%af" aria-label="锚点">#</a>
    </span>
    
</h2>
<p>这台机器实际内存只有 454MiB 左右，不能指望“应用别吃太多”这种愿望管理。要让系统在内存紧张时有明确退路。</p>

<h3 class="relative group">扩 swap
    <div id="扩-swap" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#%e6%89%a9-swap" aria-label="锚点">#</a>
    </span>
    
</h3>
<p>把 swap 扩到 2GB：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">swapoff /swapfile
</span></span><span class="line"><span class="cl">fallocate -l 2G /swapfile
</span></span><span class="line"><span class="cl">chmod <span class="m">600</span> /swapfile
</span></span><span class="line"><span class="cl">mkswap /swapfile
</span></span><span class="line"><span class="cl">swapon /swapfile</span></span></code></pre></div></div>
<p><code>/etc/fstab</code> 保持：</p>
<div class="highlight-wrapper"><pre tabindex="0"><code class="language-fstab" data-lang="fstab">/swapfile none swap sw 0 0</code></pre></div>

<h3 class="relative group">zswap 改成 zstd
    <div id="zswap-改成-zstd" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#zswap-%e6%94%b9%e6%88%90-zstd" aria-label="锚点">#</a>
    </span>
    
</h3>
<p>zswap 本来已经开了，但默认 compressor 是 <code>lzo</code>。确认内核支持 zstd：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">grep -i zstd /proc/crypto
</span></span><span class="line"><span class="cl">grep CONFIG_CRYPTO_ZSTD /boot/config-<span class="k">$(</span>uname -r<span class="k">)</span></span></span></code></pre></div></div>
<p>运行时切换：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="nb">echo</span> zstd > /sys/module/zswap/parameters/compressor
</span></span><span class="line"><span class="cl">cat /sys/module/zswap/parameters/compressor
</span></span><span class="line"><span class="cl"><span class="c1"># zstd</span></span></span></code></pre></div></div>
<p>再持久化到 grub：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="nv">GRUB_CMDLINE_LINUX_DEFAULT</span><span class="o">=</span><span class="s2">"zswap.enabled=1 zswap.compressor=zstd net.ifnames=0 biosdevname=0"</span>
</span></span><span class="line"><span class="cl">update-grub</span></span></code></pre></div></div>

<h3 class="relative group">sysctl 留一点网络和内存余量
    <div id="sysctl-留一点网络和内存余量" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#sysctl-%e7%95%99%e4%b8%80%e7%82%b9%e7%bd%91%e7%bb%9c%e5%92%8c%e5%86%85%e5%ad%98%e4%bd%99%e9%87%8f" aria-label="锚点">#</a>
    </span>
    
</h3>
<p>新增 <code>/etc/sysctl.d/99-lowmem-network-tuning.conf</code>：</p>
<div class="highlight-wrapper"><pre tabindex="0"><code class="language-conf" data-lang="conf">vm.min_free_kbytes = 16384
vm.swappiness = 30
vm.vfs_cache_pressure = 100
net.core.netdev_max_backlog = 2500</code></pre></div>
<p>这里 <code>vm.min_free_kbytes</code> 一开始试过更高，但在 512MB 机器上太激进，反而压缩了用户态可用内存。最后回到 16MB 左右，比较像这台机器能承受的值。</p>

<h2 class="relative group">第三轮：决定谁该先死
    <div id="第三轮决定谁该先死" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#%e7%ac%ac%e4%b8%89%e8%bd%ae%e5%86%b3%e5%ae%9a%e8%b0%81%e8%af%a5%e5%85%88%e6%ad%bb" aria-label="锚点">#</a>
    </span>
    
</h2>
<p>小内存机器最怕的是大家一起抢内存，最后内核随机挑一个关键入口杀掉。要把优先级说清楚。</p>

<h3 class="relative group">保护入口服务
    <div id="保护入口服务" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#%e4%bf%9d%e6%8a%a4%e5%85%a5%e5%8f%a3%e6%9c%8d%e5%8a%a1" aria-label="锚点">#</a>
    </span>
    
</h3>
<p>给 <code>ssh</code>、<code>nginx</code>、<code>supervisor</code> 加 systemd drop-in：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="k">[Service]</span>
</span></span><span class="line"><span class="cl"><span class="na">OOMScoreAdjust</span><span class="o">=</span><span class="s">-700</span></span></span></code></pre></div></div>
<p>实际检查时，<code>sshd</code> 自己已经是 <code>-1000</code>，Nginx、Supervisor 和 gunicorn 都是 <code>-700</code>：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="k">for</span> pid in <span class="k">$(</span>pgrep -f <span class="s2">"sshd|nginx|supervisord|gunicorn"</span><span class="k">)</span><span class="p">;</span> <span class="k">do</span>
</span></span><span class="line"><span class="cl">  <span class="nb">printf</span> <span class="s2">"%s score=%s adj=%s cmd=%s\n"</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">    <span class="s2">"</span><span class="nv">$pid</span><span class="s2">"</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">    <span class="s2">"</span><span class="k">$(</span>cat /proc/<span class="nv">$pid</span>/oom_score<span class="k">)</span><span class="s2">"</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">    <span class="s2">"</span><span class="k">$(</span>cat /proc/<span class="nv">$pid</span>/oom_score_adj<span class="k">)</span><span class="s2">"</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">    <span class="s2">"</span><span class="k">$(</span>tr <span class="s1">'\0'</span> <span class="s1">' '</span> </proc/<span class="nv">$pid</span>/cmdline<span class="k">)</span><span class="s2">"</span>
</span></span><span class="line"><span class="cl"><span class="k">done</span></span></span></code></pre></div></div>
<p>注意，如果 SSH 是通过 Nginx stream 进来的，重启 <code>nginx</code> 会断 SSH。改这种入口服务配置时，要么只 reload，要么准备好重连。</p>

<h3 class="relative group">限制容器
    <div id="限制容器" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#%e9%99%90%e5%88%b6%e5%ae%b9%e5%99%a8" aria-label="锚点">#</a>
    </span>
    
</h3>
<p>几个代理和文件服务容器都补上资源限制：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">services</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">hysteria</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">mem_limit</span><span class="p">:</span><span class="w"> </span><span class="l">96m</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">memswap_limit</span><span class="p">:</span><span class="w"> </span><span class="l">160m</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">pids_limit</span><span class="p">:</span><span class="w"> </span><span class="m">128</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">oom_score_adj</span><span class="p">:</span><span class="w"> </span><span class="m">500</span></span></span></code></pre></div></div>
<p>不同容器按实际情况微调：</p>
<table>
  <thead>
      <tr>
          <th>容器</th>
          <th style="text-align: right">mem_limit</th>
          <th style="text-align: right">memswap_limit</th>
          <th style="text-align: right">pids_limit</th>
          <th style="text-align: right">oom_score_adj</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>hysteria</td>
          <td style="text-align: right">96m</td>
          <td style="text-align: right">160m</td>
          <td style="text-align: right">128</td>
          <td style="text-align: right">500</td>
      </tr>
      <tr>
          <td>hysteria2</td>
          <td style="text-align: right">96m</td>
          <td style="text-align: right">160m</td>
          <td style="text-align: right">128</td>
          <td style="text-align: right">500</td>
      </tr>
      <tr>
          <td>tuic-server</td>
          <td style="text-align: right">64m</td>
          <td style="text-align: right">128m</td>
          <td style="text-align: right">128</td>
          <td style="text-align: right">500</td>
      </tr>
      <tr>
          <td>filebrowser</td>
          <td style="text-align: right">96m</td>
          <td style="text-align: right">160m</td>
          <td style="text-align: right">128</td>
          <td style="text-align: right">500</td>
      </tr>
  </tbody>
</table>
<p>这里踩了一个小坑：这台 Debian 包里的 Docker 是 <code>26.1.5+dfsg1</code>，<code>docker update</code> 不支持动态改 <code>--oom-score-adj</code>：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">docker update --oom-score-adj <span class="m">500</span> hysteria
</span></span><span class="line"><span class="cl"><span class="c1"># unknown flag: --oom-score-adj</span></span></span></code></pre></div></div>
<p>所以 <code>oom_score_adj</code> 要写进 compose，然后重建容器：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">docker compose up -d --force-recreate</span></span></code></pre></div></div>
<p>验证：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">docker inspect hysteria hysteria2 tuic-server filebrowser <span class="se">\
</span></span></span><span class="line"><span class="cl">  --format <span class="s2">"{{.Name}} OOM={{.HostConfig.OomScoreAdj}} Mem={{.HostConfig.Memory}} Swap={{.HostConfig.MemorySwap}} Pids={{.HostConfig.PidsLimit}}"</span></span></span></code></pre></div></div>

<h3 class="relative group">earlyoom 提前处理
    <div id="earlyoom-提前处理" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#earlyoom-%e6%8f%90%e5%89%8d%e5%a4%84%e7%90%86" aria-label="锚点">#</a>
    </span>
    
</h3>
<p>最后加 earlyoom，让它在内核真正 OOM 前先动手：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">apt install earlyoom</span></span></code></pre></div></div>
<p><code>/etc/default/earlyoom</code>：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="nv">EARLYOOM_ARGS</span><span class="o">=</span><span class="s2">"-m 10,5 -s 20,10 -r 300 --prefer '(^|/)(hysteria|tuic-server|filebrowser)( |</span>$<span class="s2">)' --avoid '(^|/)(sshd|sshd-session|nginx|supervisord|gunicorn|systemd|dockerd|containerd)( |:|</span>$<span class="s2">)'"</span></span></span></code></pre></div></div>
<p>这和 systemd 的 <code>OOMScoreAdjust</code> 不冲突。<code>OOMScoreAdjust</code> 会影响 <code>/proc/*/oom_score</code>，earlyoom 默认也会参考这个分数；<code>--avoid</code> 只是再加一层“别主动杀这些入口服务”的保险。</p>
<p>启动后日志里能看到策略：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">Preferring to kill process names that match regex '(^|/)(hysteria|tuic-server|filebrowser)( |$)'
</span></span><span class="line"><span class="cl">Will avoid killing process names that match regex '(^|/)(sshd|sshd-session|nginx|supervisord|gunicorn|systemd|dockerd|containerd)( |:|$)'
</span></span><span class="line"><span class="cl">sending SIGTERM when mem avail <= 10.00% and swap free <= 20.00%,
</span></span><span class="line"><span class="cl">        SIGKILL when mem avail <=  5.00% and swap free <= 10.00%</span></span></code></pre></div></div>

<h2 class="relative group">日志别再把盘打满
    <div id="日志别再把盘打满" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#%e6%97%a5%e5%bf%97%e5%88%ab%e5%86%8d%e6%8a%8a%e7%9b%98%e6%89%93%e6%bb%a1" aria-label="锚点">#</a>
    </span>
    
</h2>
<p>这台机器根分区也不大，journal 和 Docker log 都要限一下。</p>
<p><code>/etc/systemd/journald.conf.d/99-vps-limits.conf</code>：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="k">[Journal]</span>
</span></span><span class="line"><span class="cl"><span class="na">SystemMaxUse</span><span class="o">=</span><span class="s">128M</span>
</span></span><span class="line"><span class="cl"><span class="na">SystemKeepFree</span><span class="o">=</span><span class="s">512M</span>
</span></span><span class="line"><span class="cl"><span class="na">RuntimeMaxUse</span><span class="o">=</span><span class="s">16M</span>
</span></span><span class="line"><span class="cl"><span class="na">MaxRetentionSec</span><span class="o">=</span><span class="s">7day</span>
</span></span><span class="line"><span class="cl"><span class="na">RateLimitIntervalSec</span><span class="o">=</span><span class="s">30s</span>
</span></span><span class="line"><span class="cl"><span class="na">RateLimitBurst</span><span class="o">=</span><span class="s">1000</span></span></span></code></pre></div></div>
<p>Docker daemon 默认日志：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">"log-driver"</span><span class="p">:</span> <span class="s2">"json-file"</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">"log-opts"</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nt">"max-size"</span><span class="p">:</span> <span class="s2">"5m"</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nt">"max-file"</span><span class="p">:</span> <span class="s2">"2"</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span></span></span></code></pre></div></div>
<p>另外关掉 UFW logging：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ufw logging off</span></span></code></pre></div></div>
<p>之前有不少扫描流量触发 UFW BLOCK 日志，对小盘小内存机器都没什么好处。</p>

<h2 class="relative group">最后验证
    <div id="最后验证" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#%e6%9c%80%e5%90%8e%e9%aa%8c%e8%af%81" aria-label="锚点">#</a>
    </span>
    
</h2>
<p>收尾时做了几组检查：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">uname -r
</span></span><span class="line"><span class="cl"><span class="c1"># 7.0.12-x64v3-xanmod1</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">apt list --upgradable
</span></span><span class="line"><span class="cl"><span class="c1"># Listing...</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">systemctl --failed --no-pager
</span></span><span class="line"><span class="cl"><span class="c1"># 0 loaded units listed.</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">systemctl is-active earlyoom ssh nginx supervisor docker containerd
</span></span><span class="line"><span class="cl"><span class="c1"># active</span>
</span></span><span class="line"><span class="cl"><span class="c1"># active</span>
</span></span><span class="line"><span class="cl"><span class="c1"># active</span>
</span></span><span class="line"><span class="cl"><span class="c1"># active</span>
</span></span><span class="line"><span class="cl"><span class="c1"># active</span>
</span></span><span class="line"><span class="cl"><span class="c1"># active</span></span></span></code></pre></div></div>
<p>内存状态：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">Mem: 454Mi total, 235Mi available
</span></span><span class="line"><span class="cl">Swap: 2.0Gi total, about 271Mi used</span></span></code></pre></div></div>
<p>入口验证：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ssh VPS_ALIAS <span class="s1">'echo ok'</span>
</span></span><span class="line"><span class="cl">curl -sS -o /dev/null -w <span class="s1">'%{http_code}\n'</span> https://api.example.com/
</span></span><span class="line"><span class="cl">curl -sS -o /dev/null -w <span class="s1">'%{http_code}\n'</span> https://fm.example.com/</span></span></code></pre></div></div>
<p>结果是 SSH 正常，API 返回应用层 404，文件服务返回 200。API 的 404 是业务路由结果，不是 Cloudflare 521，也不是 upstream 挂掉。</p>
<p>再查优化后的内核日志：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">journalctl -k --since <span class="s2">"2026-06-13 06:28:32 UTC"</span> --no-pager <span class="p">|</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">  egrep -i <span class="s2">"out of memory|oom-kill|page allocation failure|killed process"</span></span></span></code></pre></div></div>
<p>没有新的 OOM 或 page allocation failure。</p>

<h2 class="relative group">9 天后回看
    <div id="9-天后回看" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#9-%e5%a4%a9%e5%90%8e%e5%9b%9e%e7%9c%8b" aria-label="锚点">#</a>
    </span>
    
</h2>
<p>2026 年 6 月 22 日又看了一眼，优化效果比预期好。</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">uptime: 9 days, 6:59
</span></span><span class="line"><span class="cl">load average: 0.16, 0.14, 0.16
</span></span><span class="line"><span class="cl">Mem: 454Mi total, 227Mi available
</span></span><span class="line"><span class="cl">Swap: 2.0Gi total, 346Mi used
</span></span><span class="line"><span class="cl">rootfs: 56% used</span></span></code></pre></div></div>
<p>关键服务都还在：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">systemctl is-active earlyoom ssh nginx supervisor docker containerd
</span></span><span class="line"><span class="cl"><span class="c1"># active active active active active active</span></span></span></code></pre></div></div>
<p>容器也没掉：</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">tuic-server  Up 7 days
</span></span><span class="line"><span class="cl">hysteria2    Up 7 days
</span></span><span class="line"><span class="cl">hysteria     Up 7 days
</span></span><span class="line"><span class="cl">filebrowser  Up 9 days (healthy)</span></span></code></pre></div></div>
<p>最关键的是，近 7 天没有新的 <code>OOM</code> / <code>page allocation failure</code>，earlyoom 也没有真正动刀。SSH、API、文件服务入口都正常，API 仍然是应用层 404，不是 Cloudflare 521。</p>
<p>这说明前面那套组合拳确实有用：swap + zswap 兜底，OOMScoreAdjust 保入口，容器资源限制负责把边界画清楚，earlyoom 放在最后兜底。512MB 还是 512MB，但至少现在不是靠运气跑。</p>

<h2 class="relative group">小结
    <div id="小结" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#%e5%b0%8f%e7%bb%93" aria-label="锚点">#</a>
    </span>
    
</h2>
<p>这次最大的教训是：512MB VPS 可以跑，但不能靠默认配置硬扛。</p>
<p>尤其是 SSH 走 Nginx stream、API 也走同一个 443 入口时，入口服务必须被保护起来。真正该先让步的是代理容器、文件服务、临时任务这些低优先级进程，而不是 <code>sshd</code>、<code>nginx</code> 和 API supervisor。</p>
<p>当然，所有这些优化都只是把 512MB 的边界往外推一点。要从根上解决，还是升到 1GB RAM。低配能折腾，不能迷信。</p>
]]></content:encoded>
      
    </item>
    
  </channel>
</rss>
