Jump to content
The Inquirer-Home

Benchmarks used to mislead customers

Opinion Worrying trends of our time
Monday, 7 July 2003, 13:11
Tweedledum-and-tweedledee-go-out-to-do-battle--but-what-apos-s-the-difference- LATELY, THERE has been a disturbing trend in the computing world surrounding benchmarking. It is not all that specific to one platform, test or even company, it is an endemic problem that is only getting worse.

First a little definition, a benchmark is a program that will test a system or component, and return a statistic that can be used to judge its speed relative to another system or component.

This is a rather broad definition, and the benchmarks themselves can range from a simple 100K program that returns a MHz number for your processor to a huge suite that takes days to run, and requires a veritable fleet of hardware and software engineers. Quick to ages, free to millions of dollars, platform specific to cross-compilable, benchmarks come in all shapes and sizes.

Tweedledum-and-tweedledee-go-out-to-do-battle--but-what-apos-s-the-difference-These programs all have one thing in common, they have all been used to cheat and mislead consumers. Why? Think of the stakes involved. If you are selling computers, and you can show your machine is 5% faster than the next, you are in for a bigger slice of the grand revenue pie. PR people trumpet the victory, headlines are made, and ignored by the vast majority, and sales may climb 1%. Is it really worth it to jump and down for 1%? If you are in a $10 billion market, then hell, yes it is. This is why people cheat on benchmarks.

There are other reasons of course, but they all boil down to money. Say for example your latest video card is woefully underpowered, and the competition is demonstrably ahead of you in all testing. You have no new product coming for 6 months, and even then, you are not sure how well the other company will do with their next refresh either. This is obviously a bad situation, so what do you do? Cheat.

This cheating started out subtly, and probably went unnoticed by most. It wasn't an arms race, or worse yet, a backstabbing contest. It was simply a way to look a small bit better than the next guy, and be able to shout a little louder in the next press release. Fairly innocuous stuff, but it happened. Occasionally, someone went above and beyond the call of duty in their covert optimizations, and got caught with their pants down. Things were quickly rectified, and all was fine and dandy, black eyes not withstanding.

Lately however, the cheating has become endemic and blatant. From multi-million dollar computers, to new processors, to video cards, it has gotten to the point where cheating is almost expected.

Tweedledum-and-tweedledee-go-out-to-do-battle--but-what-apos-s-the-difference- Recently, it started out slow. ATI was caught red handed "optimizing" its drivers for the granddaddy of all game benchmarks, Quake III. A new version of the then current ATI drivers showed a huge increase in quake scores. Usually a good thing, that is what optimizations are for. If you renamed the executable file to quack3.exe instead, you ended up with much lower scores. Also, as screen captures would later show, the image quality was reduced if the game was run as "quake3". ATI backpedalled, and tried to pretend that is was all a mistake, or a necessary optimization, or a needless attack on its integrity. Whatever the case, there was a new driver out very quickly that showed about the same score as the pre-quack driver, and the damage was done. Nowadays, if you ask any moderately aware hardware junky, you know, the people who other ask for advice when buying technology, about graphics, and say "quack3", you will get a sad response. Usually shaking of heads, and some unprintable words about a certain Canadian graphics company.

Before you get down on ATI to much, they were just one of many, and had the misfortune of having a catchy name to pin to the issue. Quack is easy. 179.art does not roll off the tongue all that well, but Sun used it anyway. 179.art is one part of the SpecFP suite used in benchmarking high performance computers for scientific uses. It is widely used and quoted almost everywhere. When Sun released a new benchmark a few months ago, it showed a double digit gain in performance, a jaw-dropping speed jump. A closer look at the numbers (see here) showed that one subtest, the aforementioned 179.art was multiple times ahead of the last submission, and had a similar performance gain over other CPUs in its class (see here. Sun "optimized" that one benchmark to the point where it skewed the entire suite. As usual, pants were pulled down, fingers pointed, and eyes blackened.

Picking up steam, we again return to the graphics card industry. Recently, the most widely used graphics benchmark suite on the PC, 3Dmark, came out with a new version, 3Dmark 2003. Immediately, controversy swirled, and NVidia basically said it was invalid. ATI grinned. Scores soon showed that NVidia was way behind in the benchmarks, and ATI wasn't. While some people may have been shocked, you could have read the results from the press releases months in advance. Soon after, NVidia released a new driver set that showed a remarkable speed jump in 3Dmark 2003, again in the double digits. This may not have aroused much suspicion, NVidia is the king of valid driver optimizations, and has a long history of steadily improving the quality and speed of their drivers. No immediate alarm bells went off.

Soon after, it was pointed out that if you changed the benchmark slightly, basically moving the camera off the predetermined path, the scores drop precipitously. Futuremark, the makers of 3Dmark came out with a developer version that allowed the user to move the camera around and see for themselves. They then fired off a press release saying NVidia cheated.

Tweedledum-and-tweedledee-go-out-to-do-battle--but-what-apos-s-the-difference- You can almost hear the baying of hounds, and the flurry of angry phone calls that followed that release. The next day, presumably on threat of lawyers and other Lovecraftian creatures, Futuremark released a statement saying that NVidia didn't really cheat, but everything else was basically true. Nice of them to come to this understanding, it cleared up a lot for me. Pants down, eyes black, finger in the ready position.

Staying in the graphics world, days later ATI was accused of similar "optimization, not cheating mind you". These were of a much lesser nature by all accounts, but still, it shows that you don't rest even while you are ahead. With every new driver release since then, there has been an article or four on a few of the major review sites detailing the misdeeds of many graphics companies. It has become routine. Scratch that, it has become boring.

Digit-Life just did a superb breakdown of some of the technical aspects of this (see here), and if you can stand the hardcore geekery, it is really worth it to read. Whatever happens, the die has been cast, and the "optimizations, not cheating" are here to stay. How do I know? Well, there is a dirty secret in the tech world. Most of the review sites, while excellent in a variety of ways, are ill equipped to ferret out these sort of shenanigans. How come they get the stories on a regular basis? Easy, the competition is well equipped to ferret that stuff out, and they already have contacts at the sites.

If you are not following me, let me spell out the process. Company A comes out with a new driver for their card. Company B busily reverse engineers it for any secrets that can find. When they see an "optimization, not cheating" in the code, and they do, they pick up the phone. The information, usually with specific step by step instructions are fed to anyone who will listen. The only thing better than having your competition fall on its face is having you push them, and no one seeing you do it. If you want to make your boss happy in tech, you can either make sure you do better work than the competition, or you skewer the competition hard in the press. The latter generally gets more kudos than the former.

Tweedledum-and-tweedledee-go-out-to-do-battle--but-what-apos-s-the-difference-This scenario is so commonplace that it is expected, and boring. The fact that both companies expect the behavior, and have people prepared to deal with it, says it is now part of the standard toolbox. Even the fact that company B feeds the info to sites is no longer a secret. Cheating is status quo. It is starting to be a self-policing cycle now, and hopefully, but doubtfully, it will lead to honesty. Yeah right, I'm not holding my breath, are you?

So, what can be done about this situation? Well, as reviewers, we, collectively need to get out of our rut. Go to any big hardware site, like Toms, or Anandtech, and read a review of the latest chip or graphics board. You will see that they all benchmark using a similar mix of software, there are 4 or 5 that have become almost mandatory. They use these programs across reviews, and over a period of months, or years in the case of QuakeIII. This gives hardware vendors an easy target to "optimize, not cheat" their drivers against.

The solution, and it has been debated before, is simple, make your own benchmark suite, and don't reveal what it is to the public. There is a certain distrust factor in telling people ‘here are a bunch of numbers, don't ask questions', but is that any less valid than using drivers that you know "optimize, not cheat" to the very tests you use, and not everything else? Some sites do a great job of picking a wide variety of software to use, and they tend to show trends markedly different from the common benchmarks. Coincidence? Nope.

It may take a lot more work, but in addition to the tired old suite, pick up a few programs at retail and use them. Select the programs for their various strengths, like a DirectX game, an OpenGL game, and a clone of Photoshop, not the real thing. They may not be optimized, in the real sense of the word, for the latest and greatest stuff, but they are what people buy. Don't reveal what these programs are, just be generic, but consistently use the same programs across reviews so people can accurately compare. If the companies want to optimize, again legitimately, they can. Company A releasing a driver that boosts OpenGL scores 10% would be a win for us all. Boosting QuakeIII scores by as much using "optimizations, not cheating" doesn't do much for us all, in fact it harms us by not allowing us to make rational decisions in the marketplace.

Tweedledum-and-tweedledee-go-out-to-do-battle--but-what-apos-s-the-difference-Better yet, the hidden benchmark scheme should allow cheating to be detected much more easily by the general public. If company B releases a driver revision, and the UT2003 scores jump 17%, and the "hidden DirectX game" scores go down by .3%, you can guess what happened there. Better yet, use 2 things in each category to minimize freak anomalies that end up looking like cheating, the goal here is to be fair.

This whole thing is a lot more work for the reviewers, but in light of the situation, with just about every company under the sun "optimizing, not cheating" on just about everything, what else can you do? I am not trying to pick on graphics card companies here, it has been done by everyone from system vendors to peripheral makers since time immemorial, graphics card manufacturers are just getting all the press lately.

Back to non-graphics benchmarks, you might ask how you would benchmark a huge FP heavy scientific program. You can't go to the local supermegaconglomerate retail shop and plunk down $250K to get the latest computational fluid dynamics program, and whip up a valid dataset in an afternoon. In fact, simulating programs like Spec or TPC-C are nearly impossible. So what is the poor hardware reviewer to do? Easy, look to education.

Most people, even those in the Midwestern US have a university near them, even if it is not a big one. Go to the local computer science lab and ask who is running custom code and simulations of the sort you want to test. Offer to test their code out on a much wider variety of hardware and peripherals than they have access to, and have time to do. They get more information directly relevant to their work, and you get a "optimization, not cheat" proof benchmark. Win-win. You may be rebuffed, or laughed out, but from my experience, most college denizens are more than happy to help out, and discuss the arcane nature of their work.

Overall, the cheaters are out there, and are not going away. The solution to the problem is out there, and is not that hard. If readers of a site don't like the added secrecy, they can simply skip to the next page. This HTML stuff is really good at that. More work, but much more accurate results. It is high time that something is done, the industry needs it. ยต

Share this:

Comments

There are no comments submitted yet. Do you have an interesting opinion? Then be the first to post a comment.

Advertisement
Subscribe to the INQ Newsletter
Sign-up for the INQBot weekly newsletter
Click here to sign up Existing user
Advertisement
INQ Poll

Christmas computer sales

Will you be buying a new computer this Christmas?