记一次惨痛教训---vcenter踩坑

记一次惨痛教训---vcenter踩坑

一、起因

某数据中心备份系统,采用veeam backup and repliaction 来备份业务系统数据。业务系统均部署在vmware esxi中,并且使用vcenter来管理。某天突然看到最近几天的备份任务全部失败,经询问是由于该数据中心一次意外停电之后就出现的。Vcenter登录提示如下:

503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http20NamedPipeServiceSpecE:0x0000564ca69d6010] _serverNamespace = / action = Allow _pipeName =/var/run/vmware/vpxd-webserver-pipe)

此时大概猜到了vcenter应该是挂了,vcenter服务无法正常启动,间接导致veeam backup所有备份作业都失败。

二、无用的折腾

上述现象遇到过不少,所以经验丰富的我处理起来毫不慌张,从容蛋定的ssh到服务器,然后system-control --status --all,假装看一下有没服务没正常启动==废话,vcenter登录页面打开都503了,服务肯定不正常==。一般通过system-control --start --all百分百解决,从没失过手。但是这一次却不一般了,一直卡在Successfully started service vmon

Successfully started service VMwareSTS

Performing start operation on service VMwareDNSService...

Successfully started service VMwareDNSService

Performing start operation on profile: ALL...

Successfully started service vmon

...

接下来就是各种查看日志,各种偿试,由于是回忆的,顺序不一定正确,将就看吧。

  • 看vpxd 日志如下:
root@photon-machine [ ~ ]# tail /var/log/vmware/vpxd/vpxd.log
2023-11-22T11:47:46.149+08:00 info vpxd[04178] [Originator@6876 sub=vpxdvpxdSignal] Signal 15 received, exiting
2023-11-22T11:47:46.149+08:00 info vpxd[04178] [Originator@6876 sub=Default] Initiating VMware VirtualCenter shutdown
2023-11-22T11:49:18.318+08:00 info vpxd[04113] [Originator@6876 sub=vpxCrypt] Failed to read X509 cert; err: 151441516
2023-11-22T11:50:58.432+08:00 info vpxd[04113] [Originator@6876 sub=vpxCrypt] Failed to read X509 cert; err: 151441516
2023-11-22T11:50:58.432+08:00 info vpxd[04113] [Originator@6876 sub=HostGateway] Created ComponentManagerGatewaySource!
2023-11-22T11:50:58.432+08:00 info vpxd[04113] [Originator@6876 sub=HostGateway] Created CmConnectionFSM
2023-11-22T11:50:58.432+08:00 info vpxd[04113] [Originator@6876 sub=HostGateway] Created ComponentManagerClient.
2023-11-22T11:50:58.432+08:00 info vpxd[04113] [Originator@6876 sub=HostGateway] CmConnectionFSM::RunFSM(ST_INIT)
2023-11-22T11:50:58.433+08:00 info vpxd[04113] [Originator@6876 sub=HostGateway] stsUrlFromConfig: https://172.16.0.190/sts/STSService/vsphere.local ssoAdminUrlFromConfig: https://172.16.0.190/sso-adminserver/sdk/vsphere.local
2023-11-22T11:52:58.586+08:00 info vpxd[04113] [Originator@6876 sub=vpxCrypt] Failed to read X509 cert; err: 151441516
  • 有说要重置密码

image-20231123112459296

下面这图其实没任何关联的。

image-20231123104354983

  • 有的说要看磁盘是否满了

    附一个,手动扩容,但df查看分区容量没变化。

root@photon-machine [ / ]# df -h
Filesystem                                Size  Used Avail Use% Mounted on
devtmpfs                                  4.9G     0  4.9G   0% /dev
tmpfs                                     4.9G  712K  4.9G   1% /dev/shm
tmpfs                                     4.9G  712K  4.9G   1% /run
tmpfs                                     4.9G     0  4.9G   0% /sys/fs/cgroup
/dev/sda3                                  11G  6.8G  3.3G  68% /
tmpfs                                     4.9G  2.1M  4.9G   1% /tmp
/dev/mapper/dblog_vg-dblog                 15G   86M   14G   1% /storage/dblog
/dev/sda1                                 120M   28M   84M  25% /boot
/dev/mapper/imagebuilder_vg-imagebuilder  9.8G   23M  9.2G   1% /storage/imagebuilder
/dev/mapper/autodeploy_vg-autodeploy      9.8G   23M  9.2G   1% /storage/autodeploy
/dev/mapper/core_vg-core                   25G  1.5G   22G   7% /storage/core
/dev/mapper/db_vg-db                      9.8G  417M  8.8G   5% /storage/db
/dev/mapper/netdump_vg-netdump            985M  1.3M  916M   1% /storage/netdump
/dev/mapper/updatemgr_vg-updatemgr         99G  131M   94G   1% /storage/updatemgr
/dev/mapper/log_vg-log                    9.8G  3.3G  6.0G  35% /storage/log
/dev/mapper/archive_vg-archive             50G   47G   16M 100% /storage/archive
/dev/mapper/seat_vg-seat                  9.8G  203M  9.0G   3% /storage/seat

官网上只是告诉你update,但我是直接扩容的,参考如下:

Solution: https://communities.vmware.com/t5/VMware-vSphere-Discussions/storage-archive-100-usage/td-p/1376108

  1. SSH into the vCSA and login as root and execute the “shell” command to get shell access.
  2. Run the “df -h” command and verify the “/storage/archive” mount is at 90+% of use.
  3. Access the vCenter Server that the vCSA instance is running on and increase Disk 13 of the vCSA VM Hardware by a significant amount.
  4. In the SSH session to the vCSA, run the autogrow script “/usr/lib/applmgmt/support/scripts/autogrow.sh” .
  5. Run the “df -h” command and verify the “/storage/archive” mount Use percentage has decreased.
  6. After some time has passed, verify the vSphere Client System Configuration Node Health is “Good” .
  7. Verify the vCenter Server Appliance VAMI Health Status for Storage is “Good” .

It's ok.

/usr/lib/applmgmt/support/scripts/autogrow.sh
df -h

root@photon-machine [ / ]# df -h
Filesystem                                Size  Used Avail Use% Mounted on
devtmpfs                                  4.9G     0  4.9G   0% /dev
tmpfs                                     4.9G  712K  4.9G   1% /dev/shm
tmpfs                                     4.9G  712K  4.9G   1% /run
tmpfs                                     4.9G     0  4.9G   0% /sys/fs/cgroup
/dev/sda3                                  11G  6.8G  3.3G  68% /
tmpfs                                     4.9G  2.1M  4.9G   1% /tmp
/dev/mapper/dblog_vg-dblog                 15G   86M   14G   1% /storage/dblog
/dev/sda1                                 120M   28M   84M  25% /boot
/dev/mapper/imagebuilder_vg-imagebuilder  9.8G   23M  9.2G   1% /storage/imagebuilder
/dev/mapper/autodeploy_vg-autodeploy      9.8G   23M  9.2G   1% /storage/autodeploy
/dev/mapper/core_vg-core                   25G  1.5G   22G   7% /storage/core
/dev/mapper/db_vg-db                      9.8G  417M  8.8G   5% /storage/db
/dev/mapper/netdump_vg-netdump            985M  1.3M  916M   1% /storage/netdump
/dev/mapper/updatemgr_vg-updatemgr         99G  131M   94G   1% /storage/updatemgr
/dev/mapper/log_vg-log                    9.8G  3.3G  6.0G  35% /storage/log
/dev/mapper/archive_vg-archive             59G   47G  9.5G  84% /storage/archive
/dev/mapper/seat_vg-seat                  9.8G  203M  9.0G   3% /storage/seat

下次再100%时,再试下手动删旧的文件,虽然官方有说明会自动删除。

As the VMware said here It's fixed in the VCSA 6.7 U1b

The /storage/archive partition can be full by design, as it is aimed at storing as much WAL history as possible, and is automatically cleaned up by the archiver service by removing automatically the oldest WAL segments.

However you can run ls -lrth in the /storage/archive/vpostgres and remove ( run rm) very old files


-- rhttpproxy.log
2023-11-23T23:55:45.772+08:00 info rhttpproxy[02104] [Originator@6876 sub=Libs] [msg.dictionary.load.openFailed] Cannot open file "/etc/vmware/config": No such file or directory.
2023-11-23T23:55:45.772+08:00 info rhttpproxy[02104] [Originator@6876 sub=Libs] [msg.dictionary.load.openFailed] Cannot open file "/usr/lib/vmware/settings": No such file or directory.
2023-11-23T23:55:45.772+08:00 info rhttpproxy[02104] [Originator@6876 sub=Libs] [msg.dictionary.load.openFailed] Cannot open file "/usr/lib/vmware/config": No such file or directory.
2023-11-23T23:55:45.772+08:00 info rhttpproxy[02104] [Originator@6876 sub=Libs] [msg.dictionary.load.openFailed] Cannot open file "~/.vmware/config": No such file or directory.
2023-11-23T23:55:45.772+08:00 info rhttpproxy[02104] [Originator@6876 sub=Libs] [msg.dictionary.load.openFailed] Cannot open file "~/.vmware/preferences": No such file or directory.
--cm.log
2023-11-24T11:11:33.282+08:00 [main [] INFO  com.vmware.cis.services.cm.service.SecurityAgent ()] SecurityAgent: System domain is 'vsphere.local', CM solution UPN is 'machine-061848d2-faff-4f5e-8027-f9d1ee2bc3bd@vsphere.local'
 --vpxd-svcs.log
 , maxIdleTime: 120000 ms, maxKeepAliveRequests: 100]
2023-11-24T11:18:02.689+08:00 [main  INFO  com.vmware.vim.dataservices.DataService  opId=] Inventory services Access logging enabled...
2023-11-24T11:18:02.689+08:00 [main  INFO  com.vmware.vim.dataservices.DataService  opId=] Inventory services server started.
2023-11-24T11:18:02.689+08:00 [main  INFO  com.vmware.vim.dataservices.DataService  opId=] Server startup time: 390 ms
2023-11-24T11:18:02.689+08:00 [main  INFO  com.vmware.vim.dataservices.VpxdSvcsMain  opId=] Start service completed....
2023-11-24T11:18:21.315+08:00 [Thread-9  WARN  com.vmware.cis.server.util.VpxdClient  opId=] Cannot handle exception during retry: com.vmware.vim.vmomi.client.exception.ConnectionException: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8085 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-11-24T11:18:21.316+08:00 [Thread-9  WARN  com.vmware.cis.server.util.impl.InitPoolTask  opId=] Init pool encountered exception: com.vmware.cis.server.util.exception.VpxdClientException at attempt 1
2023-11-24T11:18:41.360+08:00 [Thread-9  WARN  com.vmware.cis.server.util.VpxdClient  opId=] Cannot handle exception during retry: com.vmware.vim.vmomi.client.exception.ConnectionException: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8085 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-11-24T11:18:41.360+08:00 [Thread-9  WARN  com.vmware.cis.server.util.impl.InitPoolTask  opId=] Init pool encountered exception: com.vmware.cis.server.util.exception.VpxdClientException at attempt 2
2023-11-24T11:18:44.171+08:00 [refresh-lotus-locator-task  INFO  com.vmware.cis.lotus.LotusLocator  opId=] vmAfClient.getDomainControllerEx("") : 172.16.0.190
2023-11-24T11:18:44.172+08:00 [refresh-lotus-locator-task  INFO  com.vmware.cis.lotus.LotusLocator  opId=] Lotus hostname URL : 172.16.0.190
2023-11-24T11:18:44.174+08:00 [refresh-lotus-locator-task  INFO  com.vmware.cis.lotus.LotusLocator  opId=] vmAfClient.getDomainName() in baseDn format : dc=vsphere,dc=local
2023-11-24T11:18:44.180+08:00 [refresh-lotus-locator-task  INFO  com.vmware.cis.lotus.LotusLocator  opId=] Successfully refreshed machine account credentials
2023-11-24T11:19:01.400+08:00 [Thread-9  WARN  com.vmware.cis.server.util.VpxdClient  opId=] Cannot handle exception during retry: com.vmware.vim.vmomi.client.exception.ConnectionException: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8085 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-11-24T11:19:01.400+08:00 [Thread-9  WARN  com.vmware.cis.server.util.impl.InitPoolTask  opId=] Init pool encountered exception: com.vmware.cis.server.util.exception.VpxdClientException at attempt 3
2023-11-24T11:19:21.469+08:00 [Thread-9  WARN  com.vmware.cis.server.util.VpxdClient  opId=] Cannot handle exception during retry: com.vmware.vim.vmomi.client.exception.ConnectionException: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8085 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-11-24T11:19:21.469+08:00 [Thread-9  WARN  com.vmware.cis.server.util.impl.InitPoolTask  opId=] Init pool encountered exception: com.vmware.cis.server.util.exception.VpxdClientException at attempt 4

--vpxd.log
2023-11-24T11:25:04.014+08:00 info vpxd[19242] [Originator@6876 sub=ThreadPool] Entering worker thread loop
2023-11-24T11:25:04.036+08:00 error vpxd[18951] [Originator@6876 sub=Main opID=CheckCertificateExpiry-6d7d4b3] Unable to get certificate count for APPLMGMT_PASSWORD from VECS localhost, error: 0
2023-11-24T11:25:34.014+08:00 warning vpxd[18254] [Originator@6876 sub=VpxProfiler opID=EventManagerProcessJobs-46cdbe2] EventManagerProcessJobs [TotalTime] took 30010 ms
2023-11-24T11:25:44.349+08:00 info vpxd[18185] [Originator@6876 sub=vpxCrypt] Failed to read X509 cert; err: 151441516
  • 还有的说要清除软连接
-- 清空所有没用的软连接?,
# List all disabled services for removal. 

find /etc/systemd/system/ -lname '/dev/null' -exec ls {} \;  

# Automatically remove them (or rm each file)

find /etc/systemd/system/ -lname '/dev/null' -exec rm {} \; 

# Relaod systemctl daemon

systemctl daemon-reload 

# Start services or Reboot

service-control --start --all
  • 还有说证书过期的

    可是我查了证书信息,都正常的。

  • dns配置问题

    ping 主机名正常,查看/etc/hosts、/etc/resolv.conf配置也没发现什么问题,毕竟都没做过修改啊。

    image-20231129153353529

    image-20231129153419027

经过上述分析和各种尝试,无果。

三、 还原及重新安装

vcenter系统也有用veeam备份,因此可以直接还原,只是vcenter挂了,需要在veeam上添加对应esxi到inventory里,然后还原。我以为还原很快就能解决,谁知又是像上面那样无限循环,完全要么是cm服务没启动,要么是vpxd服务没启动,换了几个还原点,都一样报错,就是不能完整把服务启动起来,而且似乎每次报错以及能启动的服务都不太一样。

vcenter各主要服务依赖如下,可以用类似service-contorl --list-depencies vmware-cm的命令来获取依赖信息。

image-20231129233257852

image-20231129234227013

难道备份的文件有问题?于是决定直接重新安装,一开始想着用新一点的版本,找了个vcenter6.7 U3安装包,安装完后进入第二阶段配置,结果发现安装进度非常慢,最后卡在60%并报错,大概是这个提示The supplied System Name photon-machine is not valid,应该是dns出问题了,但是按上述方法检查,又没发现有什么异常,而且是全新安装的,此前这么多次安装也没有遇到过这问题。

经再次还是不行,怀疑是ISO镜像问题,又用certutil 工具获取了sha1,对比发现sha1值一致,感觉又进入了死循环。索性直接找到几个数据中心都在用的6.7U2,安装,还是一样的问题,主要服务根本启动不全,而且耗时非常久,这下人都傻眼了,不会这么邪门吧!

思来想去,想到了会不会是vcenter安装时必须要联接外网?不过当时已经夜深人静了,没继续搞下去。第二天早上,直接把vcenter外网打开,重新安装系统,一切都非常顺利,所有服务启动正常,竟然如此丝滑!但是我不想让vcenter连接外网,于是在防火墙上只开启源为vcenter 目的为DNS 服务器(114.114.114.114、8.8.8.8)的访问。再次重新安装vcenter,同样非常顺利地安装成功!

四、总结

自安装完vcenter后,防火墙上一直都禁止vcenter访问外网的,也没遇到过什么问题,这次异常断电后为什么不行呢?而且即使全新重装vcenter,也必须连外网或者至少需要dns能正常解析。如果dns连不到外网,即使vcenter所有服务正常,都无法登录==登录时会一直在登录界面转圈==。抓包也只看到vcenter向dns服务器发送请求,并没任何其他ip访问。我是用ip访问的,又不涉及dns解析,即使用本地域名访问,设置etc/host文件理论上也可以,为什么要搞这么复杂呢,在官网也没查到相关说明。

评论已关闭