本可避免的P1事故:Nginx變更導致網關請求均響應400
問題背景
項目上使用SpringCloudGateway作為網關承接公網上各個業務線進來的請求流量,在網關的前面有兩臺Nginx反向代理了網關,網關做了一系列的前置處理后轉發請求到后面各個業務線的服務,簡要的網絡鏈路為:
網關域名(wmg.test.com) -> ... -> Nginx ->F5(硬負載域名fp.wmg.test) -> 網關 -> 業務系統
某天,負責運維Nginx的團隊要增加兩臺新的Nginx機器,原因說來話長,按下不表,使用兩臺新的Nginx機器替代掉原先反向代理網關的兩臺Nginx。
SRE等級定性P1
一個月黑風高的夜晚,負責運維Nginx的團隊進行了生產變更,在兩臺新機器上部署了Nginx,然后讓網絡團隊將網關域名的流量切換到了兩臺新的Nginx機器上,剛切換完,立馬有業務線團隊的人反應,過網關的接口請求都變成400了。負責運維Nginx的團隊又讓網絡團隊將網關域名流量切回到原有的兩臺Nginx上,業務線過網關的接口請求恢復正常,持續了兩分多鐘,SRE等級定性P1。
負責運維Nginx的團隊說,兩臺新的Nginx配置和原有的兩臺Nginx配置一樣,看不出什么問題,找到我,讓我從網關排查有沒有什么錯誤日志。
不太可能吧,如果新的兩臺Nginx配置和原有的兩臺Nginx配置一樣的話,不會出現請求都是400的問題啊,我心想,不過還是去看了網關上的日志,在那個時間段,網關沒有錯誤日志出現。
看了下新Nginx的日志,Options請求正常返回204,其它的GET、POST請求都是400,Options是預檢請求,在Nginx層面就處理返回了,新Nginx的日志示例如下:
10.x.x.x:63048 > - > 10.x.x.x:8099 > [2025-07-17T10:36:26+08:00] > 10.x.x.x:8099 OPTIONS /api/xxx HTTP/1.1 > 204 > 0 > //domain/ > Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 > - > [req_time:0.000 s] >[upstream_connect_time:- s]> [upstream_header_time:- s] > [upstream_resp_time:- s] [-]
10.x.x.x:63048 > - > 10.x.x.x:8099 > [2025-07-17T10:36:26+08:00] > 10.x.x.x:8099 POST /api/xxx HTTP/1.1 > 400 > 0 > //domain/ > Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 > - > [req_time:0.001 s] >[upstream_connect_time:0.000 s]> [upstream_header_time:0.001 s] > [upstream_resp_time:0.001 s] [10.x.x.x:8082]
去找了網絡團隊,從流量回溯設備上看到400確實是網關返回的,還沒有到后面的業務系統,400代表BadRequest,我懷疑是不是請求體的問題,想讓網絡將那個時間段的流量包數據取下來分析,網絡沒給,只給我了業務報文參數,走網關請求的業務參數報文是加密的,我本地運行程序可以正常解密報文,我反饋給了負責運維Nginx的團隊。
負責運維Nginx的團隊又花了一段時間定位問題,還是沒有頭緒,又找到我,讓我幫忙分析調查下。
介入調查
我說測試環境地址是啥,我先在測試環境看下能不能復現,負責運維Nginx的團隊成員說,沒有在測試環境搭建測試,這一次變更是另一個成員直接生產變更。
??
我要來了新的Nginx配置文件和老的Nginx配置文件比對了下,發現有不一樣的地方,老Nginx上反向代理網關的配置如下:
server {
listen 8080;
server_name wmg.test.com;
add_header X-Frame-Options "SAMEORIGIN";
add_header X-Content-Type-Options "nosniff";
add_header Content-Security-Policy "frame-ancestors 'self'";
location / {
proxy_hide_header host;
client_max_body_size 100m;
add_header 'Access-Control-Allow-Origin' "$http_origin" always;
add_header 'Access-Control-Allow-Credentials' 'true' always;
add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS, DELETE, PUT';
add_header 'Access-Control-Allow-Headers' '...';
if ($request_method = 'OPTIONS') {
return 204;
}
proxy_pass //fp.wmg.test:8090;
}
}
新Nginx配置如下:
upstream http_gateways{
server fp.wmg.test:8090;
keepalive 30;
}
server {
listen 8080 backlog=512;
server_name wmg.test.com;
add_header X-Frame-Options "SAMEORIGIN";
add_header X-Content-Type-Options "nosniff";
add_header Content-Security-Policy "frame-ancestors 'self'";
location / {
proxy_hide_header host;
proxy_http_version 1.1;
proxy_set_header Connection "";
client_max_body_size 100m;
add_header 'Access-Control-Allow-Origin' "$http_origin" always;
add_header 'Access-Control-Allow-Credentials' 'true' always;
add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS, DELETE, PUT';
add_header 'Access-Control-Allow-Headers' '...';
if ($request_method = 'OPTIONS') {
return 204;
}
proxy_pass //http_gateways;
}
}
新Nginx代理網關的配置與原有Nginx上的配置區別在于:
-
使用upstream配置了網關的F5負載均衡地址:
upstream http_gateways{ server fp.wmg.test:8090; keepalive 30; } -
設置http協議為1.1,啟用長連接
proxy_http_version 1.1; proxy_set_header Connection "";
我讓負責運維Nginx的團隊在測試環境的Nginx上按照新的Nginx配置模擬了生產環境:
Nginx:10.100.8.11 監聽9104端口
網關:10.100.22.48 監聽8081端口
Nginx的9104端口轉發到網關的8081端口,配置如下:
upstream http_gateways{
server 10.100.22.48:8081;
keepalive 30;
}
server {
listen 9104 backlog=512;
server_name localhost;
add_header X-Frame-Options "SAMEORIGIN";
add_header X-Content-Type-Options "nosniff";
add_header Content-Security-Policy "frame-ancestors 'self'";
location / {
proxy_hide_header host;
proxy_http_version 1.1;
proxy_set_header Connection "";
client_max_body_size 100m;
add_header 'Access-Control-Allow-Origin' "$http_origin" always;
add_header 'Access-Control-Allow-Credentials' 'true' always;
add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS, DELETE, PUT';
add_header 'Access-Control-Allow-Headers' '...';
if ($request_method = 'OPTIONS') {
return 204;
}
proxy_pass //http_gateways;
}
}
問題復現
通過Nginx請求網關到后端服務接口,問題復現,請求響應400:
curl -v -X GET //10.100.8.11:9104/wechat-web/actuator/info
去掉下面的兩個配置,請求正常響應200:
proxy_http_version 1.1;
proxy_set_header Connection "";
天外來鍋
將這個現象反饋給了負責運維Nginx的團隊,結果負責運維Nginx的團隊查了半天說網關不支持長連接,要讓網關改造。
??
不應該啊,以往網關發版的時候,是滾動發版的,F5上先下掉一個機器的流量,停啟這個機器上的網關服務,然后F5上流量,F5下流量的時候是有長連接存在的,每次都會等個5分鐘左右才能下掉一路的流量。
得,先放下手頭的工作,花點時間來證明網關是支持長連接的。
在Nginx機器上通過命令行指定長連接方式訪問網關請求后端服務接口:
wget -d --header="Connection: keepalive" //10.100.22.48:8081/wechat-web/actuator/info //10.100.22.48:8081/wechat-web/actuator/info //10.100.22.48:8081/wechat-web/actuator/info
回車出現如下日志:
Setting --header (header) to Connection: keepalive
DEBUG output created by Wget 1.14 on linux-gnu.
URI encoding = ‘UTF-8’
Converted file name 'info' (UTF-8) -> 'info' (UTF-8)
Converted file name 'info' (UTF-8) -> 'info' (UTF-8)
--2025-07-17 13:45:08-- //10.100.22.48:8081/wechat-web/actuator/info
Connecting to 10.100.22.48:8081... connected.
Created socket 3.
Releasing 0x0000000000c95a90 (new refcount 0).
Deleting unused 0x0000000000c95a90.
---request begin---
GET /wechat-web/actuator/info HTTP/1.1
User-Agent: Wget/1.14 (linux-gnu)
Accept: */*
Host: 10.100.22.48:8081
Connection: keepalive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
transfer-encoding: chunked
Content-Type: application/vnd.spring-boot.actuator.v3+json
Date: Thu, 17 Jul 2025 05:25:34 GMT
---response end---
200 OK
Registered socket 3 for persistent reuse.
Length: unspecified [application/vnd.spring-boot.actuator.v3+json]
Saving to: ‘info’
[ <=> ] 83 --.-K/s in 0s
2025-07-17 13:45:08 (7.75 MB/s) - ‘info’ saved [83]
URI encoding = ‘UTF-8’
Converted file name 'info' (UTF-8) -> 'info' (UTF-8)
Converted file name 'info' (UTF-8) -> 'info' (UTF-8)
--2025-07-17 13:45:08-- //10.100.22.48:8081/wechat-web/actuator/info
Reusing existing connection to 10.100.22.48:8081.
Reusing fd 3.
---request begin---
GET /wechat-web/actuator/info HTTP/1.1
User-Agent: Wget/1.14 (linux-gnu)
Accept: */*
Host: 10.100.22.48:8081
Connection: keepalive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
transfer-encoding: chunked
Content-Type: application/vnd.spring-boot.actuator.v3+json
Date: Thu, 17 Jul 2025 05:25:34 GMT
---response end---
200 OK
Length: unspecified [application/vnd.spring-boot.actuator.v3+json]
Saving to: ‘info.1’
[ <=> ] 83 --.-K/s in 0s
2025-07-17 13:45:08 (9.47 MB/s) - ‘info.1’ saved [83]
URI encoding = ‘UTF-8’
Converted file name 'info' (UTF-8) -> 'info' (UTF-8)
Converted file name 'info' (UTF-8) -> 'info' (UTF-8)
--2025-07-17 13:45:08-- //10.100.22.48:8081/wechat-web/actuator/info
Reusing existing connection to 10.100.22.48:8081.
Reusing fd 3.
---request begin---
GET /wechat-web/actuator/info HTTP/1.1
User-Agent: Wget/1.14 (linux-gnu)
Accept: */*
Host: 10.100.22.48:8081
Connection: keepalive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
transfer-encoding: chunked
Content-Type: application/vnd.spring-boot.actuator.v3+json
Date: Thu, 17 Jul 2025 05:25:34 GMT
---response end---
200 OK
Length: unspecified [application/vnd.spring-boot.actuator.v3+json]
Saving to: ‘info.2’
[ <=> ] 83 --.-K/s in 0s
2025-07-17 13:45:08 (11.1 MB/s) - ‘info.2’ saved [83]
FINISHED --2025-07-17 13:45:08--
Total wall clock time: 0.1s
Downloaded: 3 files, 249 in 0s (9.25 MB/s)
可以看到第一個請求建立了socket 3,Connection: keepalive,請求成功,http響應狀態碼為200

第二個請求重用了第一個連接,socket 3,Connection: keepalive,請求成功,http響應狀態碼為200

第三個請求依然重用了第一個連接,socket 3,Connection: keepalive,請求成功,http響應狀態碼為200

網關是支持長連接的,反饋給負責運維Nginx的團隊,負責運維Nginx的團隊又查了半天,又找到我說還是得拜托我來調查解決掉這個問題。
深度調查
在測試環境Nginx機器10.100.8.11上使用tcpdump命令抓取與網關相關的流量包:
tcpdump -vv -i ens192 host 10.100.22.48 and tcp port 8081 -w /tmp/ng400.cap
找到出現http響應碼為400的請求,可以看到流量包中的wechat-web/actuator/info請求響應為:HTTP/1.1 400 Bad Request
觀察請求體,其中一個請求頭Host的值為:http_gateways,這引起了我的注意:

查閱資料得到,HTTP/1.1協議規范定義HTTP/1.1版本必須傳遞Host請求頭
- Both clients and servers MUST support the Host request-header.
- A client that sends an HTTP/1.1 request MUST send a Host header.
- Servers MUST report a 400 (Bad Request) error if an HTTP/1.1
request does not include a Host request-header.
- Servers MUST accept absolute URIs.
Host的格式可以包含:. 和 - 特殊符號,_ 不被支持
查閱Nginx的官方文檔得知,proxy_set_header 有兩個默認配置:
proxy_set_header Host $proxy_host;
proxy_set_header Connection close;
可以看出Nginx啟用了HTTP/1.1協議,Host如果沒有指定會取$proxy_host,那么使用upstream的情況下,$proxy_host就是upstream的名稱,而此處的upstream中包含_,不是合法的Host格式。
HTTP/1.1規定必須傳遞Host的一方面原因就是為了支持單IP地址托管多域名的虛擬主機功能,方便后端服務根據不同來源Host做不同的處理。
Older HTTP/1.0 clients assumed a one-to-one relationship of IP addresses and servers; there was no other established mechanism for distinguishing the intended server of a request than the IP address to which that request was directed. The changes outlined above will allow the Internet, once older HTTP clients are no longer common, to support multiple Web sites from a single IP address, greatly simplifying large operational Web servers, where allocation of many IP addresses to a single host has created serious problems.
那么只要遵循了HTTP/1.1協議規范的框架(Tomcat、SpringCloudGateway、...)在解析Host時發現Host不是合法的格式時,就響應了400。
本地搭建了一個測試環境,debug了下網關的代碼,在SpringCloudGateway解析http請求類ReactorHttpHandlerAdapter中的apply方法里面可以看到,解析Host失敗會響應400:

下面是SpringCloudGateway解析http請求類ReactorHttpHandlerAdapter中的apply方法邏輯:
public Mono<Void> apply(HttpServerRequest reactorRequest, HttpServerResponse reactorResponse) {
NettyDataBufferFactory bufferFactory = new NettyDataBufferFactory(reactorResponse.alloc());
try {
ReactorServerHttpRequest request = new ReactorServerHttpRequest(reactorRequest, bufferFactory);
ServerHttpResponse response = new ReactorServerHttpResponse(reactorResponse, bufferFactory);
if (request.getMethod() == HttpMethod.HEAD) {
response = new HttpHeadResponseDecorator(response);
}
return this.httpHandler.handle(request, response)
.doOnError(ex -> logger.trace(request.getLogPrefix() + "Failed to complete: " + ex.getMessage()))
.doOnSuccess(aVoid -> logger.trace(request.getLogPrefix() + "Handling completed"));
}
catch (URISyntaxException ex) {
if (logger.isDebugEnabled()) {
logger.debug("Failed to get request URI: " + ex.getMessage());
}
reactorResponse.status(HttpResponseStatus.BAD_REQUEST);
return Mono.empty();
}
}
SpringCloudGateway通過debug級別日志輸出這類不符合協議規范的日志,生產日志級別為info,因此不會打印這樣異常的日志。
解決方案
既然HTTP/1.1協議規定必須傳遞Host且沒有通過配置顯式指定Nginx傳遞的Host時Nginx會有默認值,那么在Nginx的配置中增加傳遞Host的配置覆蓋默認值的邏輯,查閱Nginx的文檔,可以通過增加下面的配置解決:
proxy_set_header Host $host;
在測試環境Nginx9104端口代理配置中增加上面的配置,再次執行,請求正常響應200。

完整配置如下:
upstream http_gateways{
server 10.100.22.48:8081;
keepalive 30;
}
server {
listen 9104 backlog=512;
server_name wmg.test.com;
add_header X-Frame-Options "SAMEORIGIN";
add_header X-Content-Type-Options "nosniff";
add_header Content-Security-Policy "frame-ancestors 'self'";
location / {
proxy_set_header Host $host;
proxy_hide_header host;
proxy_http_version 1.1;
proxy_set_header Connection "";
client_max_body_size 100m;
add_header 'Access-Control-Allow-Origin' "$http_origin" always;
add_header 'Access-Control-Allow-Credentials' 'true' always;
add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS, DELETE, PUT';
add_header 'Access-Control-Allow-Headers' '...';
if ($request_method = 'OPTIONS') {
return 204;
}
proxy_pass //http_gateways;
}
}
解決方案不止一個:
- 可以修改upstream的名稱,去掉不支持的_,比如更換為:http-gateways、httpgateways
- 還可以直接指定Host的值為域名(domain),proxy_set_header Host 'doamin';
總結
這個問題只要在測試環境測試下,是必現的,不屬于測試case沒有覆蓋到的范疇,一定要重視測試流程,很多流程看似繁瑣,其實都是血與淚的教訓得來的。
本文來自博客園,作者:杜勁松,轉載請注明原文鏈接://www.xtjzw.net/imadc/p/19002991
