test only one update at a time, test only one update at a time, test only one update at a time...
Posted by: Anonymous
on September 15, 2007 05:04 PM
A few years ago, I had several dozen Sun SPARC servers, in a private network, running, under Solaris, the jobs that pay the bills. The only time those servers' operating systems were updated was to solve a problem. The consequence was cumulative uptime in the five nines: The only significant events during that time were the WTC's demolition and a large regional power failure, but only the power failure stopped the financial processes that run on those computers.
In the last four years, those dozens of Sun servers have been replaced by no-name, high-end, PCs, with AMD processors in arrays of 1U, 2U, 4U, etc rack enclosures. The operating system used is a Fedora that I have vetted and sometimes modified with a different device driver for servers that have special boards. There are two consequences from that change: savings and reduced availability. Our budget dropped because instead of paying Sun in the high six figures for a maintenance contract, we now spend about 100 grand on new hardware every year. The reduced availability is caused by the lower quality of the commodity, PC hardware that we buy as well as an expected shorter working lifetime for PC equipment than our old Suns. As a matter of fact, we decommission our AMD servers when they become two years old because we don't want failures caused by low grade, printed circuit board manufacturing or failures caused by the drying out of low grade capacitors.
The benefits of switching from Solaris to Linux are mixed. We have faster hardware which allows us to try new computationally intensive approaches at a much lower cost. Our availability decreased from five nines to four nines, but we address that by having many more servers to shove into the breach. We have noticed that the Linux TCP stack seems to be slower and of lower quality than the Solaris TCP stack.
We keep those servers running by not changing their operating systems. If the jobs that pay the bills can run, there is no reason to change any computer's software. That does not apply to our applications which change as often as necessary to make a profit. The point is that the computing base for the applications is always dependable and predictable.
In our environment, there are also desktops; some run Ubuntu, a few run Fedora, and the rest run Windows. We maintain local, private mirrors of the Fedora and Ubuntu repositories although automatic updates are disabled. If a user needs to change something, it's easy to make the requested upgrade, but we try not to fix what ain't broke.
When a change is significant, that means that the processes that pay the bills could be negatively affected, we try the patch on a test computer and test and test and test. If the patch passes muster, it gets added slowly to the systems that need it, never all at once to all systems.
Open source software is a great way to reduce operating expenditures drastically, but we learned the hard way to not change our computing platforms without testing things ourselves, for our needs, in our environment.