How to Protect High-Range Tests

Chris Cole

The suspension of the Mega and Titan tests as admissions vehicles for the Mega Society leaves the Society in a difficult position. The explosion of the Internet since 1995 has made it extremely hard to keep test answers secret. Half of the Mega and Titan test answers are easily available on the Internet today. Even if we were to have a new high-range test in hand right now, it would be compromised within a relatively short period of time, perhaps days. In fact, it’s unclear how a high-range test would even be normed without rendering it useless in the process. Is this the end of high-range testing, and potentially the Mega Society?

One possible solution would to be to retain the secrecy of the test in the same way the College Entrance Examination Board does, namely, formulate a very large number of questions and have each specific test consist of a small subset of this larger set. Thus the potential cheater is defeated by the need to memorize thousands of problems. In addition, the test is copyrighted and physically protected.

One problem with this solution is that the College Board has a large market for its tests, and therefore can afford to employ hundreds of test designers to write thousands of sample problems, and additional thousands of test takers to verify and norm the problems. Another problem is that it seems to be a lot harder to write a high-range problem than it is to write a mid-range problem.

One possible solution to the first problem would be to use the power of the Internet for good instead of evil, namely, to publish the test over the Internet and let thousands of interested test takers verify and norm the test for free. While this is a cheap way to get a test normed, it works at cross-purposes to the idea of keeping the test secret.

The second problem, thinking of the problems in the first place, might also be solved via the Internet. Perhaps the test problems themselves could be submitted over the Internet. A system could be set up where people who wanted to take the test would be able to, but they could not receive a "certified" test result until they had submitted some quality problems themselves.

However, experience with this kind of self-generating content over the Internet does not lead to optimism. Quality suffers. Various "political" agendas tend to crop up and mix in with the effort, contaminating the outcome. This has led to several failures, notably Internet dictionaries, encyclopedias, etc.

There is an art to good test design, and the market for high-range tests will support relatively few artists. How can we leverage their efforts?

In looking at many tests, there is a certain pattern that appears. It is possible to classify the problems into groups. For example, Ron Hoeflin has a group of problems about cells formed by intersecting various solids such as spheres, cubes, etc. The solution to one member of this group (say, three cubes) does not help much in the solution of another (say, two cones and a sphere). Yet it might be the case that there is an underlying mathematics that yields the answers to all of the problems in the group. Then a very large number of problems could be generated, where the solution to one problem would not help in the solution of another. This would be ideal for creating an on-line test, because cheating would be impossible.

One difficulty would be in norming such a group of problems. It is usual practice to norm a problem by having a large number of people try exactly the same problem. If the problems were different, how could the test be normed? One problem in a group might be more difficult than another.

The answer to this is twofold: first, it is not true that a given problem has a specified difficulty. The difficulty of a problem is in the eyes of the beholder. What norming does is establish a distribution of difficulties over a sample population, which is an estimate of the distribution of difficulty over the entire population. Thus the real issue is to control the error bars around the estimated difficulty. A problem is rejected if the error bars are too large. Similarly, a group of problems would be rejected if its error bars are too large. The "art" is to select groups that have small error bars.

The second answer to this is to observe that an IQ is not estimated based upon one problem alone; there already is a group of problems involved, namely, the entire test itself. So what we are discussing here is the idea of estimating an IQ based upon a set of problems selected from a large normed set, versus estimating an IQ based upon a set of problems selected from a set of normed groups of problems. Either way, there is an inescapable statistical inference being performed; it’s all about propagation of errors.

Another objection to the idea of groups of problems with an underlying mathematical solution is that it might be possible to learn the underlying solution and thus learn how to answer all of the problems in the set. If the underlying mathematics is trivial, this is indeed a weakness. However, it might be that the underlying mathematics is sufficiently complicated that it is easy for a computer to work out, but difficult for a human to work out. Better yet, it might be a one-way or trapdoor function, such as occurs in many cryptographic systems. For example, the Allies during World War II had working copies of the Enigma cipher machine long before the war started, yet they were unable to crack the wartime coded correspondence without cribs, bombes, and a lot of espionage.

As a concrete example, consider problem 30 on the Mega Test. For those without the test at hand, this is the problem where three board positions in some game were given, and you had to figure out the fourth board position. Actually, the first half of the problem was to figure out that the figures shown were board positions in a game that was being played optimally. Even after figuring this out, however, it was a challenge to figure out what the underlying rules of the game were, and to deduce what the fourth position had to be. Now, this one problem could be expanded into a group of problems, by varying the underlying rules of the game and using standard alpha-beta pruned game tree search to find board positions that are unique and lead to simple answers. Even if a test taker know this was what was going on, it would take a similar level of mental effort to deduce the rules from the board positions in each case. And the solution for one set of rules would be of little help in the solution for another set. The size of the board, the number of different pieces, even the movement rules could be varied without greatly affecting the difficulty. A large group of problems with similar difficulty could be created, a group that, according to Grady Towers’ item analysis, is one of the best problems on the Mega Test.