2024.0414

Number of words in this article:In 1951, the reading time was about 3 minutes


introduction:云平台在追求销量与市场规模的同时,客户的基础运维与数据安全工作,不应因“降本增效”而被忽视。

** Author| ** First Finance Lu Qian

On the afternoon of April 8, 2024, Tencent Cloud experienced a service failure. The interface response reported an error, an internal service error, and the webpage displayed a 504 error. A 504 error represents a Gateway timeout, which means that the server acts as a gateway or proxy but does not receive the request from the upstream server in time.

On the evening of April 8, Tencent Cloud announced that there was an abnormality in services related to the console of Tencent Cloud official website and engineers were in the process of emergency repair. Some areas have recovered, and repairs are continuing in other areas.

On April 14, Tencent Cloud officials explained the failure on April 8:At 15:23 on April 8, the Tencent Cloud team received an alarm message that the cloud API service was in an abnormal state; immediately, a large number of customer feedback about the inability to log in to the Tencent Cloud console began to appear in channels such as Tencent Cloud work orders, after-sales service groups, and Weibo.

After fault location, it was found that the customer's inability to log in to the console was caused by an exception in the cloud API. Cloud API is a unified collection of open interfaces on the cloud. Customers can programmatically manage and control cloud resources through the API. The cloud console provides interactive web functions through a combination of cloud APIs. The root cause of the accident was that during the change process of Tencent Cloud version, sandbox verification and plan drills were not effectively implemented, exposing shortcomings in change management, which ultimately led to insufficient consideration of forward compatibility of the new version of cloud API services and insufficient gray mechanism for configuration data.

After the failure occurred, Tencent Cloud stated that some public cloud services that relied on cloud APIs to provide product capabilities were also unusable due to exceptions in cloud APIs, such as cloud functions, Optical Character Recognition, microservice platforms, audio content security, Captcha, etc. The failure lasted for nearly 87 minutes in total, during which a total of 1957 customers reported the failure.

From a customer's perspective, cloud services can be divided into a data plane and a control plane. The data plane carries the customer's own business, and the control plane is responsible for operating different products on the cloud. For example, the most widely used IaaS services currently are basically directly oriented to the data plane, and the control plane is only involved when customers purchase or need to adjust the resource level. The console and cloud API that failed this time is the impact on the control surface. In layman's terms, if cloud services are compared to hotels, the console is equivalent to the hotel's front desk and is a unified service entrance. Once the hotel front desk fails, management capabilities such as check-in and renewal will be unavailable, but the rooms that have been checked in will not be affected.

Tencent Cloud said that the IaaS resources such as servers that the customer had configured during this failure, including the services that had been deployed and running, were not affected by the abnormal cloud API. Other PaaS and SaaS services that provide services in non-cloud APIs are in normal service status. However, service products provided using APIs (requiring "hotel front desk services") have varying degrees of impact. For example, calls to Tencent Cloud storage services dropped significantly that day. During this period, the after-sales team assisted some customers in implementing business disaster recovery plans and scheduled affected services to quickly restore customers 'business services.

Tencent Cloud stated that it will make improvements in three aspects: improving system resilience, strengthening change management and protection measures, and enhancing fault response and communication capabilities.

In recent years, applications "collapse" due to cloud service problems have occurred frequently. On April 9 this year, Alipay collapsed and appeared on the hot search. Users reported that a page that "has stopped visiting" appeared when using the Alipay APP. Alipay then responded that:A small number of users experienced temporary poor access when accessing some pages. This situation has been quickly restored, users 'funds and information security have not been affected, and all functions can be used normally. However, the specific cause of the accident was not further explained.

On the evening of December 3, 2023, Tencent's video "collapsed" and appeared on the hot search on Weibo. Tencent Video responded that a temporary technical problem had occurred and was being repaired as soon as possible, and various functions were gradually restored.

On the evening of November 27, 2023, the Didi App system failed, causing a large area across the country to collapse, and the service could not be used normally. On November 29, Didi issued a statement saying that various services had been restored and it was initially determined that the cause of the accident was a failure of the underlying system software.

March 5, 2023 20:Around 20th, during the peak period of user activity at Station B, many netizens found that neither mobile phones nor computers at Station B could access the video details page. The team at Station B solved the problem 20 minutes after the failure occurred that night. Many industry people tend to think that the reason is "code failure in iterative updates", which is the official explanation after the large-scale server crash of Station B in July 2021.

If it weren't for the large-scale negative impact and discussion caused by Didi's long-term nationwide collapse, non-industry people would not have regarded the temporary "collapse" of a certain software as a hot topic for discussion. Sun Qi, CTO of Wanbo Zhiyun, told First Finance and Economics that the Didi incident is only an individual case, but the incident has a large level of failure and has indeed affected the lives of a certain number of ordinary people. In fact, software failures that many users cannot see are occurring every day, which is a relatively common problem in the industry.

This time, Tencent Cloud suffered a large-scale failure, and some industry insiders combined it with Alibaba Cloud's epic failure in November 2023. On the evening of November 12, 2023, Alibaba Cloud broke down, and topics such as "Alibaba Cloud's disk collapsed","Taobao collapsed again","idle fish collapsed" and "nail collapsed" were on the hot searches one after another. Many products of Alibaba are affected. Alibaba Cloud

author-gravatar

Author: Emma

An experienced news writer, focusing on in-depth reporting and analysis in the fields of economics, military, technology, and warfare. With over 20 years of rich experience in news reporting and editing, he has set foot in various global hotspots and witnessed many major events firsthand. His works have been widely acclaimed and have won numerous awards.

This post has 5 comments:

Leave a comment: