
In Data centers are calculated incorrectly more and more often. (Image: Facebook)
So far, CPUs have been considered largely reliable in terms of their calculations, despite repeated arithmetic errors. That seems to be changing now, as reported by Google, among others. The more and more sophisticated CPUs apparently calculate incorrectly more and more often, which is particularly evident in large data centers. Facebook recently noticed increasing so-called “silent data corruption”.
Google and Facebook: CPU corrupt data
Google engineer Peter Hochschild has Last week, as part of the Hot Topics in Operating Systems (HotOS) 2021 conference, production teams at the search engine company were increasingly complaining about machines that corrupt data. The machines would have damaged various stable and actually error-free applications. In conventional investigations, however, no errors could be found, according to a corresponding report.
The Google engineers then turned their attention to the hardware. The result: hardware errors occurred more frequently than expected. In addition, the problems would have appeared sporadically and long after installation – and especially with individual CPU cores. Google describes the phenomenon as Silent Corrupt Execution Errors (CEE) and the incorrectly behaving cores as unpredictable.
Google blames CPU designs
Back in February, Facebook published a report in which the social media group described silent data corruption as a phenomenon that is now occurring more often in data centers than predicted after may be. Facebook did not give a reason for this. For Google, meanwhile, it is clear that the ever faster computing and smaller CPU designs are responsible, as The Register writes Calculation errors can have serious consequences. A CPU in a Google data center is said to have carried out a kind of unpredictable ransomware attack in which the machine encrypted something – incorrectly – in such a way that only it could decrypt it again. The experts also see crashes and data loss as increasing challenges. Google and Facebook now want to expand their tests to find solutions to the problem.