THIS COLUMN augments comments made by Mario Rodriguez at this page.
Part One: The Quick History of PC Benchmarking
To paraphrase Mark Twain, there are lies, damn lies, and benchmarks. Mario Rodriguez recently took to task the
problem of benchmarking CPUsthe incumbent battle between Intel and AMD. As I've got more than twenty-five years
performing such benchmarking, I have some background on the subject that can augment what Mario's frustrated with.
In the early days of microcomputing (a phrase that's not used much anymore), there were conquests of comparing microprocessors against other processors and processor arrays found in mini-computers and mainframes. It was acknowledged that in theory, CPU performance had several different characteristics that were important to test for comparison reasons; most of them were strictly math and memory management related.
Microprocessor CPUs (and I'll use the shortened phrase CPU to narrow the category to microprocessors) do math have memory addressing and I/O addressing. The size of both system memory and data lines is determined by the number of actual electrical lines emerging from the CPU. Each CPU has a clock attached either externally, or generated internally that's used to frame the operations going on inside of the CPU. Faster, obviously, is better. "Fast" has been traditionally measured as the speed of the clock(s) controlling processor fetch and execution.
It's possible, by using paging techniques, to increase both memory addressing and I/O addressing. This was done to get around the limitations imposed on the 8088 chip which was at the heart of the original IBM PC. The 8088 was a hybrid chip with 16-bit operations but 8-bit memory management. Paging techniques take time to shuffle things, and so although paging techniques work, they add additional time to results.
In the early days of microcomputing, we used a variety of methods to perform analysis of possible speed. Some methods revolved around computational efficiencies, such as the ability to calculate number series and so on. Eventually, CPUs emerged that had both integer and floating point math operations on them, thereby combining what had been two different CPUs into one housing. Until that point, often two sets of benchmarks were published about CPU performance, one centered on integer performance, and the other with floating point math performance.
At the heart of every CPU is its instruction set, those minimalist commands that demand work. A debate rose about whether the number of instructions might have a bearing on performance as a minimalist instruction set was felt to be optimal from an engineering and potentially, a software development standpoint.
The Reduced Instruction Set Computer (RISC) became one family, while the Complex Instruction Set Computer (CISC) became another. This fork in CPU design thought caused endless conversation and argument. The *nix families, at the time mini-computer and mainframe makers largely, planted stakes in the RISC camps, while the microcomputer family tended to center around CISC.
The desire for 16-bit computing was strong because of its ability to have a larger memory address and larger data value component. RISC vs. CISC started showing itself in even the microcomputer families. Intel (and its "clone/derivative") families stayed CISC, while Motorola, IBM, and Apple tended to be RISC. Today, the lines are blurred.
So, to summarize so far, there were two major processor categories, "big" CPUs (think minicomputers/mainframes) and microprocessors (having many functions and capabilities merged on to one substrate that replaced many ancillary components used in big' CPU architectures).Microprocessor CPUs then divided into categories based on their memory size and word fetch (e.g. 8-bit, 16-bit, 32-bit, and now 64-bit).
The next subcategory of CPUs related to RISC vs CISC, with divisions about whether or not floating point math was done onboard, or with the addition of a floating point processor.
Whew. We're not anywhere near done yet.
Part Two: CPU Performance is Relative and Transient
The real performance of processors is relative, and transient at best. This is because CPUs are in constant
product change and subject to Moore's Law. But there's another, more onerous problem about CPU comparisons: standalone
CPUs are useless, and are inevitably connected to a motherboard that contains components that glue the CPU to its
memory and its peripherals. The glue components and peripheral/peripheral connectivity are key to overall systems
performance.
The original IBM PC was inventive for 1980, but proved to be a design that didn't evolve well of its own accord. The hybrid processor (the 4.77Mhz Intel 8088) and its bus (the ISA or IBM Standard Architecture) bus had a slow clock speed and an even slower bus fetch time. Hardware peripherals or software apps that didn't synchronize perfectly didn't work well.
As time went on, many vendors of peripherals such as graphics display adapters, learned how to offload work from the CPUs to improve perceived visual performance, as did network card vendors. But the ISA bus was still a stumbling block. IBM introduced an interim solution called the "Microchannel" bus, and Compaq countered with a less radical architectural bus evolution called the EISA or "Extended/Enhanced ISA" bus. Both were designed to integrate peripheral boards more easily into motherboard architecture. Both were eventual failures. Today, the evolving PCI bus has become a common denominator among microprocessor-based systems.
When viewed as a whole, the benchmarks for desktop computers were often based on simple measurements, such as how fast memory could be moved, math problems performed, and graphics displays manipulated. Disk benchmarks told of how fast information could be read/written to hard disks (floppy disks were so slow as to be irrelevant). Network benchmarks told how fast information could be obtained from a downstream host, and so on. But these benchmarks didn't take a look at a holistic or composite view of performance, just a weighted comparison number of the discrete benchmark measurements melded together. This is still how many benchmarks are written and used today. A counterweight to this are discrete measurement benchmarks, such as the widely abused IOMeter, written by Intel.
What's On Top
Today, there are numerous versions of Windows, MacOS, *nixnot counting alternate operating systems such as QNX
and others. These operating systems and especially their settings have an enormous impact on perceived systems
performance. On top of the operating system chosen to use in a benchmark test, are the applications or benchmarks that
will be used to discern performance. Consider these as exponents to the job of benchmarking products. Oddly, few people
understand or even take the time to optimize these platforms by tuning them; a very small, brave, and fanatical group
of individuals do in fact take performance optimizing seriously. Yet such optimizations, while laudable, seem to be
transient still, varying from each operating system/application combination to the next.
There are benchmark applications matrices, such as those offered by BAPCo's Sysmark, that take snapshots of realworld apps and try to emulate and stratify their performance for comparison purposes. There's been concern, as Mario Rodriguez mentioned, that the optimizations made on Sysmark have favoured Intel rather than AMD instructions. Your mileage will vary, and results are essentially meaningless unless you understand what's underneath Sysmark in the first place.
Appliances are now in the marketplace from Spirent and Ixia that can be connected to servers to profile how server applications react. These can be very useful, and we use some of these appliances today to gauge server performance characteristics. Desktop performance measurements are more elusive; I wish there were a way for desktops, be they from HP, Dell, Apple, or Sun, to have performance profiles for comparison purposes. These don't exist, and there is no compelling reason for vendors to do this, unless the user community mandates it with their buying power.
The difficulty in such benchmarking is that there are as many ways to use a computer as there are users. Gamers have different needs than office users. Servers require different resources than either gamers or office users. So far, little research has been performed on usage patterns, so as to establish profiles of systems usage activities. Doing so is also a moving target, and is useful only until the next version of Office, AutoCad, or Quake VII arrives.
In the meantime, CPU speed gets faster, new types of DRAM become available, and PCI bus/peripheral performance improves, while disk speeds and interfaces become more cached and efficient. It reminds me of the old Whack-a-Mole game where once something stabilizes, something else will popup.
Certain CPU recent innovations/techniques, such as hypterthreading, are bound to have an eventual impact on processor throughput, but the profiles of applications that actually benefit from this are somewhat small. The advantage claimed (and I think rightly so) of AMD's floating point math performance over that of Intel's is one quality among many that's going to have to be considered in overall performance observation and comparison.
There are no perfect tests for CPUs, and for a while, people will have to take a look at a number of differing views of performance to gauge how the views will impact how they use computing. And those results will be good for a few months until the next iteration of something skews the numbers one way or another.
What's Left Out
CPU performance, while a strong indicator of possible performance, tells little or nothing of the vendor's tech
support, the number of motherboards/chipsets available, nor of the ability for one CPU to have a noticeable impact over
that of another. Few of us actually try to upgrade our system's CPU as fairly few motherboards support doing this.
Civilians are wisest to keep their fingers outside of the box; the DIY crowd takes their own chances upgrading CPUs,
overclocking them, and squeezing micro-incremental performance capacities from their hardware. While specs are
important, they're too often used as an almighty, empirical way of viewing and summarizing expectations. The marketing
departments at AMD and Intel are dedicated to amplifying these characteristics because they need revenues for their
companies. They're also hoping that we'll continue to succumb to the perceived need to toss out the old stuff as soon
as possible; there is no such thing in computing as a speeding ticketand they know that.
Background
Until 1995, we (at ExtremeLabs, Inc. and predecessor organizations) wrote our own benchmarks, eschewing those
from other sources. We did desire, however, to develop or cooperate development on benchmarks that could be used to
establish a commonality of information. The desire was that we would be able to have people read different magazines,
see a benchmark number, then be able to understand what it meant, and even compare it with the benchmarks from other
publications.
We initially called this effort the Performance Test Alliance, but as the US-based Parent Teachers Association sent us a tort, we changed it to the Performance Testing Alliance for Networks, and incorporated in New York State. Several protagonists, from Scott Bradner at Harvard, to Drew Major from Novell and Gary Gunnerson from Gannett came together to find commonality among the benchmarkers in networking at the time. We failed to get the PTAN' organization off the ground for a number of reasons, but the strongest of them is that trade magazine publishers aren't often interested in people getting information from publications not their own.
This meant that ZD, CMP, Penton, and computer trade magazine publishers have largely developed their own measurement tools and benchmarks either in a vacuum, or in conjunction with third parties. Publishers also found that test labs were a business opportunity, because so many vendors live and die, economically speaking, by their perceived performance. These and other publishers invested in benchmark development both to help bring information to their perceived readership, but also to permit their test labs to underwrite the cost of doing the performance testing. PTAN failed after six months.
Magazines have often been market "king" makers. A good product review can mean millions in sales for a vendor. A poor comparison might conversely mean poorer sales, and so vendors were and are often the private customers of these publication's test facilities. ยต
Tom Henderson is managing director and principal researcher for ExtremeLabs, Inc. of Indianapolis. You can contact him at this address. The views he expresses are his own