The Inquirer-Home

Sifting through the benchmark dilemma

Measure this, measurer
Fri Feb 14 2003, 11:49
BEFORE WE BEGIN discussing benchmarking, and the problems related to it, let's define the term.

According to dictionary.com, a benchmark is: A standard by which something can be measured or judged. Benchmarks are heavily employed in the computing world and are the criteria by which computer performance is weighed and judged. Even the most basic measurement by which computers are rated — the megahertz — is a benchmark when applied to computing. A 2 GHz machine is generally considered faster than a 1 GHz machine, for example.

If you're an Apple fan then you might prefer measuring performance in FLOPS, or Floating Point Operations per Second. If we wanted to start at the very beginning of machine performance, we might speak of the number of operations per second a machine can perform. In the very, very beginning of computing performance was measured by the number of additions per second a machine could perform.

ENIAC, generally considered somewhere between the first and one of the first computers built, could perform 5,000 additions per second — a full two order of magnitudes faster than the fastest mechanical relays available from Bell Labs and other organizations.

Since the dawn of modern computing, researchers, marketing departments, and engineers have sought to achieve the following three goals:

#1: Define a standard by which computing performance can be reliably judged over time.
#2: Define a standard which gives an accurate indication of the computing potential of a particular machine or type of machines.
#3: Define a standard which makes a particular system or machine look good compared to its peers.

Unfortunately, these goals often conflict with each other. Benchmarks that can be accurately compared over time yield results that can be referenced back against themselves and allow for equal comparisons between machines of different generations. This gives corporations or governments a direct means to examine performance when considering an upgrade, which is a much-desired ability. The older a benchmark gets, however, the less likely it takes full advantage of a new computer design. While a certain degree of this can be overcome with patches or updates, inevitably new code must be written and comparability with the previous generation broken. There will always be tension between point #1 and point #2.

This would still be mainly an engineering problem, save for point #3, which is mandated by the marketing/sales department whose job it is to move the computers the engineers have designed. In this particular segment of business #1 is acknowledged as somewhat important (as it helps smooth potential buyer qualms about purchasing a system), #2 is somewhat less important then #1, but #3 takes supreme placement. Ideally, all three of these intersect — but when in doubt, pick the third door.

This is not meant to imply that marketing personal or salespeople are inherently dishonest—their job is to sell a product, emphasizing its positive aspects, regardless of how relevant those positive aspects may be. (Good) marketing literature will always be carefully tuned to imply as close to a universal promise of high performance as possible without ever claiming such is the case, lest the company trip and fall into the eternal hell of "false advertising."

Another paradox of benchmarking is the relationship between simple and complex measurements. When developing a benchmark (or assigning one) a simple, universal measurement is the best choice.

Simple measurements are easy to understand, can be applied to a wide category of systems, and give a one-number indication of system performance which even the not-so-technically inclined can understand, while remaining relevant. Thus we have the MHz. The MIPS or FLOPS.

Unfortunately, simple results often don't tell the whole story. They don't describe complex relationships between CPU and motherboard, the impact of bus speeds, bottlenecks, SIMD instructions, cache speeds, cache amounts, rotation speeds, or RAM. Thus, complex, weighted, and carefully tuned benchmarks are the best choice. By using such tests, factors such as the above can be considered, performance measured in a variety of applications and tests, and a final result can be given…..in the form of a simple number.

Sounds simple enough — save that the entire process by which that simple number is arrived at can be, and often is attacked, from a nearly infinite number of angles. Are the tests relevant? Is performance properly measured? Were special instruction sets taken advantage of? Are the special instruction sets in question relevant to the market or the way such tests are normally performed? Does anyone use the software used to measure the performance in question? How were testbeds configured? Etc, etc, etc, ad infinitum.

It's a near-perfect Catch-22: We desire simple numbers that can be easily understood, yet refuse to accept such numbers as valid for either being too simplistic or too arbitrarily generated.

The problems don't get any easier when you leave the high theory and return to earth. Witness the current battle over 3DMark 2003, with Nvida, and several prominent hardware sites, decrying it as too artificial, arbitrary, and inapplicable. Some would argue that such problems are inherent to synthetic benchmarks themselves, and thus only "real-world" benchmarks and scenarios should be considered. In the 3D gaming world, however, this is no assurance. Drivers can and are often optimized for specific game engines.

Coincidentally, many of the most popular games are also used as benchmarks. Are ATI and NVIDIA really optimizing to win a benchmark — or just to provide gamers with the best experience possible to give them the highest return for their hard-earned dollar? You can guess which one they'll tell you.

Synthetic 3D benchmarks are no answer. Not only can they also be optimized for, the criteria by which the benchmark score is weighted can be criticized, as can the tests used to gather the score in question, as is currently going on with 3DMark 2003. 3D2K3 has been heavily criticized for its choice of DirectX games — one DX7, two DX8.1, and one (sort-of, according to Nvidia) DX9 game make up its line-up. Those who aren't such fans of 3DK2K3 claim the tests aren't indicative of future video card performance and are based on too small a sample. Those standing behind the benchmark, on the other hand, could easily claim that the tests in question fairly reflect the current distribution of games on the market.

There are still some DX7-class games out there that are popular and played, a majority of new games are DX8.1, and as to DX9, software supporting it is a long way off. One could argue 3DMark 2003 fairly reflects video card performance in the here-and-now as any benchmark should. (Whether or not 3DMark 2003 should've billed itself as a DX9 benchmark is an issue we won't address here — the purpose of this article is to lay out the situation, not take sides).

Things only get nastier when we move over to the CPU side of things. One of the largest and longest-running arguments over Intel's Pentium 4 is how it should be benchmarked compared to AMD's AthlonXP. For those of you (very) late to the party or living under rocks, here's the deal. The P4 has two characteristics which make it difficult to benchmark — it is much less efficient than previous x86 processors (whether P3 or Athlon-based) and its performance is heavily dependent on the presence or absence of specialized SIMD (Single Instruction Multiple Data) instructions in a language called SSE/2. When using SSE/2 the P4 often exhibits superior mathematical ability, but without it the chip finds itself heavily compromised and runs much more slowly than would be expected given its clockspeed.

Of the issues above, the P4's efficiency was more directly a problem when the CPU first launched and was badly outperformed by the last-generation P3 and the Athlon running at a much slower clock. Since Intel launched the Northwood last year and the P4's clock speed rose (and its performance relative to Athlon rose as well) the problems of efficiency have become less important, though they still often occur when the question of SSE/2 compatibility is raised.

The biggest problem with the Pentium 4 is how the CPU should be benchmarked. Intel, if you ask, will stress the need to use the most modern versions of software, as these are the versions that businesses will be using going forward and the most advanced available. Coincidentally, these are also the versions that are most-heavily optimized for the P4.

AMD, on the other hand, will scoff at Intel's emphasis on only modern software (and hit the optimization angle) while recommending would-be benchmarkers focus on the software that's most common in the marketplace. This software, after all, is the software people are actually using, and thus is the software that should be most thoroughly tested. Businesses don't care about the performance of a CPU on software they don't own, they want to know how it performance on a product they can actually use. Right? Coincidentally (you guessed it) it's the older software that tends to perform best on AMD CPU's.

If the difference between the two were small, one could dismiss it readily, but unfortunately this is not the case. Fire up Sysmark 2000 and watch the AthlonXP pull ahead of the P4 by a wide margin. Swap to Sysmark 2001 and watch the P4 pull ahead in Internet Content Creation while the AthlonXP continues to dominate Office Performance. Then fire up Sysmark 2002 and all of a sudden the P4 smashes the AthlonXP in ICC, while the Athlon only manages to maintain a razor-sharp lead in OP, where it absolutely led the P4 before.

Lest you think this trend is confined to BAPCo, you'll find an identical one in the Content Creation and Business Winstone tests. In the 2001 versions of the test you'll find the AthlonXP leading dramatically, 2002 will show the P4 by a hair, and in 2003 the P4 will dominate.

Which should count more? Tough question. Certainly there are more people in the marketplace using software from the 2001 and earlier era, given the recent recession and the cost of software, which would seem to favor the AthlonXP. On the other hand, companies often upgrade hardware with an eye towards (eventually) upgrading software, and also want to guarantee future high performance as well — which would seem to favor the P4. As you might've guessed by now, there is no official right answer.

Unfortunately the problems of benchmarking (and benchmarketing) go far beyond the version of software selected. Its no secret that Bapco chose to eliminate many of the tests the AthlonXP won in Sysmark 2001 from their Sysmark 2002 suite. Did Bapco do so to give better and more accurate results (as they claim) or because Intel (who owns part of the company) unfairly influenced how results are scored? How much does performance in tests like SPEC matter? Itanium scores very well in the industry-standard benchmark—but people aren't exactly lining up in the streets to buy it. Does SPEC even measure what it claims to be measuring? Some will tell you yes, some will say no. The problem is, everyone answering the question has their own agenda.

Reviewers try to get around this problem by running as large a suite of benchmarks as possible and synthesizing a neutral result from a "biased" series of tests. Remember, bias does not necessarily mean a deliberate choice. Intel will claim older tests are biased towards AMD for not including modern results, AMD will claim modern tests are biased towards Intel for not reflecting what people use. Either way, you don't win. With a near infinite amount of software, however, and a decidedly infinite amount of time (and patience) the ability of the reviewer to truly draw a complete picture is limited indeed. Factor in inherent personal preferences and bias, and what you have is a very, very, muddled picture.

In the P4 vs. Barton reviews run this past week if you examine the reviews in aggregate and note the benchmarks used, you'll find an interesting fact: Its completely possible, if one chooses, to select an entire suite of "industry standard" benchmarks that the P4 wins. Its equally possible to select a suite of "industry standard" benchmarks that the Athlon wins. And finally, of course, we can select tests that demonstrate them both tying.

Ultimately there's no real answer to the benchmark muddle, and no way to untangle the knots. Every company chooses benchmarks that purport to give it an advantage, whether its Apple and FLOPS, Intel and MHz, although Centrino will force Chipzilla to change its 'MHz-is-everything' strategy, or AMD and their model numbers. From the ground-up, the system is, if not polluted, muddied with self-interest. Perhaps the best thing to do is to remember that a benchmark is ultimately only a measure of itself—nothing less, nothing more. The truth benchmarks offer is fleeting, conditional, and considerably less absolute than its proponents, whoever they might be, would like you to believe. We have to use them—they're all we have—but they aren't nearly the deus ex machina some would have them be.

Want to know how fast an AthlonXP really is against a Pentium 4? Buy yourself a good slingshot and clock them both. Be sure to use an equal amount of tensile strength for both, however — we wouldn't want you cheating….

The author bets that the Pentium 4 will hit a higher mph due to its better aerodynamic shape. µ

Share this:

Comments

There are no comments submitted yet. Do you have an interesting opinion? Then be the first to post a comment.

aboutus
Advertisement
Subscribe to INQ newsletters
Advertisement
INQ Poll

Authorities in several countries raided Megaupload recently, shut down all of its services, seized hundreds of servers and arrested several of its executives on criminal charges.

Do you think the move was justified?