Example Hypotheses

    Here is an extenisve rendering of hypotheses in two categories.  The first set consists of five "studies" presented at a symposium chaired by Tom Haladyna, Arizona State University (West).  They focused on the five major validity arguments in the research concerning the superiority of two contrasting test item formats used in large scale, high stakes assessments.  To test the validity of these arguments, the researchers formulated "working" hypotheses to guide their examination of existing research and formulate conclusions about existing knowledge.  The short discussion pertaining to each should also be instructive with regard to how the scholarly mind works when thinking about the verifiability of the logic inherent in certain theoretical assertions.

    The second set is actually three subsets and, like the first, they are directional in format; that is, they are not stated in the null.  These hypotheses are straightforward relational or comparative propositions that can be empirically tested (given an adequate research design and good data).  Brian Griffin at Georgia Southern University wrote these hypotheses.
 


######Set One######
Constructed-Response Or Multiple-Choice
A Research Synthesis


 


    For nearly a century, the choice of item formats has been debated. Should one use multiple-choice (MC) or constructed-response (CR)? Is there a difference in student performance if we use CR instead of MC for a specific interpretation or use of resulting test scores? The answer to these questions cannot be answered simply. One has to examine the type of interpretation or use of the test scores. To address this issue about item format choice, participants in this symposium will examine and evaluate over 125 studies of item format differences dating from the 1920s to the present. Five validity arguments were identified to guide this research synthesis. Each study was classified according to the intended use of the test scores in the study. Some studies involved more than one validity argument and were multiply classified.
 

Study 1: Constructed Response or Multiple Choice: Does the Format Make a Difference for Prediction?
Steven M. Downing
John J. Norcini
American Board of Internal Medicine

    Scores from many standardized tests are used to predict some future performance or achievement such that these examinations can be used to make selection or placement decisions. The college admissions tests such as the ACT Assessment (ACT) and the Scholastic Assessment Test (SAT) are good examples. Also, graduate level examinations such as the Graduate Record Examination (GRE), the Law School Admission Test (LSAT) and the Medical College Admission Test (MCAT) are also good examples of tests requiring predictive evidence.

    The predictive argument involves the comparison of comparable or equivalent multiple-choice (MC) and constructed response (CR) tests as predictors of some criterion variable. Admissions tests typically use grade-point average or academic program completion as a criterion variable. The typical study in this category involves identifying a criterion, constructing MC and CR predictors, and comparing any differences in predictability of the two formats. Analyses in these studies usually evaluate simple product-moment or multiple correlations between the types of predictors and the criterion. Thus, the common evaluation metric is the squared correlation between predictor(s) and criterion for each format. Another evaluation metric is the increase in the squared correlation attributable to adding a CR format.

    As with any interpretive argument, some methodological difficulties arise, such as: Some researchers incorporate the issue of cost of the development of measures while others do not. CR measures are usually more expensive to administer and score and take examinees longer to complete than MC measures; it is also usually more costly to establish defensible standards for CR tests. Non-linear relationships are seldom if ever considered. Many performance measures have non-linear distributions. Some predictive arguments are too contrived in that the relationships examined in the study do not occur in reality. When these or other methodological problems exist, results are noted in the context of these limitations.

Working hypothesis. To guide the review of these studies a working hypothesis was developed as suggested by Snow (1994): MC items lead to test scores that tend to be more effective predictors than test scores based on CR items regardless of what constructs are represented.
 

Study 2: The Construct Equivalence of Constructed-Response and Multiple-Choice Items
Michael Rodriguez
Michigan State University

    The most common type of study of item format differences involves whether multiple-choice (MC) and constructed-response (CR) formats measure the same construct. Interestingly, a variety of methods are used to study construct equivalence, ranging from simple product-moment correlations to factor analytic and causal model methods. Over forty studies were obtained which directly address the issue of construct-equivalence of MC and CR items.

    Some significant criticism has been leveled at these kinds of studies. Snow (1993) maintained that while a psychometric criteria may be satisfied, psychological criteria may not be satisfied in these trait-equivalence studies. Traub (1993) criticized the limited scope of these studies and believes that correlations corrected for attenuation should be used to assert equivalence. Traub also pointed out that most equivalence studies use a limited variety of MC formats, while a vvariety of CR formats are used. Thus, results may be variable due to the many specific types of CR formats used. We might then ask "Are the inferences we make regarding student knowledge of content dependent on the format of the question?"

Working hypothesis. When MC and CR items are designed to measure the same construct, they do; and, when they are designed to measure different constructs, they do. Additional characteristics will be examined in terms of their effects on equivalence results in a meta-analytic framework, including whether or not the items in the two formats are stem equivalent, which analytic method was used to establish equivalence, the subject matter of the items, the age group of the exaxminees, the type of CR item used, and others.

Study 3: Proximity to Criteria: Does Format Make a Difference
Thomas M. Haladyna
Arizona State University West

    Most criteria in educational testing are performance oriented. For instance, reading and writing are performance-based constructs. Frederiksen and Collins (1989) argued that the problem with using MC tests is that it offers indirect versus direct measurement of the criterion. During the revolutionary days of the 1920s and 1930s, test specialists promoted MC and criticized CR. But these specialists also recognized the salience of measuring directly versus indirectly (Eurich, 1931). In circumstances where the criterion is known to be a performance, MC might be considered a proxy for the more desirable direct measurement exemplified in performance. For instance, a driver's license examination could include a MC test instead of a driving test, because the former may be highly correlated with the latter. The issue here is the proximity of an indirect measure with a criterion. The argument is not that MC is more appropriate than a CR for a specific test score interpretation or use, but that MC can be a viable proxy for the more expensive and difficult-to-obtain CR measure of a criterion.

    The most obvious and most frequently studied construct involving this argument is writing ability. Writing assessments are common in most state and national achievement testing program. Another good example of the use of this argument is medical problem solving, which is central concern to those involves in medical licensing and certification. The need for direct criterion measures extends to high-stakes licensing and certification in virtually every profession, where the licensing or certifying authority is interested in measuring a candidate's ability to address and solve a patient/client problem. While performance may be authentic and most desirable, it is typically unfeasible due to cost, logistical, and technical considerations.

    The research issue in this category is: To what extent does a MC test predict the more costly and more difficult-to-obtain criterion performance measure? The answer to this question is not resolved empirically or statistically. Instead, it involves values and social consequences, an aspect of construct validity that has gained recent attention (Messick, 1989). For example, Frederiksen and Collins (1991) and Shepard (1993) have argued that MC testing has negative effects on students. This hypotheses will be addressed in a subsequent interpretive argument in this section.

    Studies to be reviewed that fall in the category of the proximity argument usually administered both criterion and proxy measures and study the relationship between the two. If the relationship is high enough, then the question arises, can MC serve as a useful, less expensive proxy?

Working hypothesis. A working hypothesis for this research review is that MC tests correlate highly with criterion performance, but for many reasons existing outside the realm of measurement theory and practice, more proximal measures are desirable with the limit being resources available and legal and social consequences of using a less proximal measure.

Study 4: Gender X Format Interactionist Perspective
Joseph Ryan, Arizona State University West

    "Bias" is systematic error in measurement that affects a subgroup of testtakers (Cole and Moss, 1989). Gender bias would involve the over- or under-estimation of the abilities or achievement of men and women who take what are thought to be comparable MC or CR version of the same test. Several issues arise in this context. The most straightforward issue iseasily examined and that involves observing that males consistently outperform females on MC tests while females consistently outperform males on CR tests. This result varies some based on subject/content matter, grade level, and some operational features of the particular study. This finding raises a second issue that is much more difficult to unravel: is the gender X format interaction reflecting the effect of construct-related or construct-irrelevant variation in the measurement of achievement for the two groups? If gender X format interaction is construct-related, than the choice of format is actual a curriculum issue since it reflects a choice of one construct over another. If the interaction effect is construct irrelevant, than the choice of one format over another reflects a measurement issue since the choice of format introduces bias.

    A useful example can be seen in measuring students' math ability using verbally presented mathematics problems to which students provide an extended response. In such situations, females typically outperform males. If the verbal component of the task and response are viewed as "irrelevant" to the construct of math ability, then this assessment is biased in favor of females. If verbal mediation is seen as part of the construct, however, the assessment is valid and the performance reflects a valid group difference.

    Studies that address gender X format interactions seem to fall into three general categories. First, a group of studies addresses gender differences that reflect differences in verbal ability and fluency. Another group of studies reflects involves potential "bias" from reading, handwriting or computer-based writing. A final set of studies is emerging that reflects rater bias in scoring constructed response assessments.

Working Hypothesis: Gender X format interactions exists and are well documented, but the meaning of these interactions is unclear. Curriculum reform efforts in mathematics and science, for example, would likely view verbal mediation in mathematics and science assessment as part of the intended construct and thus conclude that CRs are valid measures. Other curriculum developers, however, might focus on "pure" or decontexualized math and science and, in these cases, CR's would be viewed as introducing construct irrelevant variation.
 

Study 5: Cognition and the Question of Test Item Format
Michael E. Martinez
University of California, Irvine

    Ever since multiple-choice items were introduced on a large scale early this century, educators have expressed concern over the effects of multiple-choice testing on the quality of thinking in learners (Kinney and Eurich,1932; Guilford, 1967). Although inquiry on this issue has spanned several decades, the influence of test item format on the nature and quality of cognition remains a severely under-researched aspect of test interpretation and test use. Yet the effects of testing on cognition are vital, bearing not only on the meaning of what is measured but also the consequences of measurement not only during the testing episode, but also its anticipation and aftermath (Snow & Lohman, 1993). In discussions of the effects of item format on elicited cognitions, a typical assertion is that MC tests lead to low-level information processing while CR tests are more often associated with complex thinking (Frederiksen and Collins, 1989).

    While there is evidence to support this assertion, there are also several qualifiers, including the caveat that differences in cognition elicited by differing item formats are less a reflection of the limitations of format than they are of typical use. This paper will review the state of the cumulative research as it bears on the issue of the differential cognitive effects of test item formats, and will synthesize the current understanding of the field while being clear about where there are still gaps in our knowledge.
 


######Set Two######


 




1. Children taught by the vocabulary method will learn significantly better than children taught by the experimental method.

2. The greater retention ability, the more likely one is to increase one's learning from related prose.

3. Given equal prior learning, corrective and non-corrective instruction are likely to produce different levels of achievement among fourth-grade students.

4. Programs offering stipends will be just as successful at retaining students as programs not offering stipends.

5. In a middle-class, suburban, public school district in which a child is expected to meet the standards of a set curriculum, a child who is under five years of age upon entrance to kindergarten is less likely to be ready for first grade in one year than a child who is five years of age or more at the time of entrance to kindergarten.
 

1. Fourth grade students who participate in computer assisted instruction (CAI) will have higher mathematics achievement scores than fourth grade students who do not participate in CAI.

2. Teachers who establish rapport with their students will be more effective in motivating students to study than teachers who do not establish rapport with their students.

3. Students with higher SAT scores will also have higher GRE scores; similarly, students with lower SAT scores will have lower GRE scores.

4. Under intangible reinforcement conditions, middle-class children will learn more than lower-class children.

5. The average achievement group and the low achievement group will show the same level in ratings of self-worth.

6. Classroom intellectual composition was expected to directly influence students' academic achievement; the higher the classroom intellectual composition, the greater the academic achievement.

7. There will be little, if any, difference on mathematics achievement between the computer and tutor group, the computer-only group, and the traditional instruction group.

8. Students' confidence in their academic ability and their intelligence are both related to achievement.

9. Test-taking experience affects test performance.

10. Students who receive individually guided instruction will demonstrate greater gains in reading achievement than students who receive group based instruction.

11. Students exposed to the read, visualize, and draw condition are expected to comprehend more of the biology text than are students in the read and visualize condition or the read only condition.

12. Science achievement is independent of academic self-efficacy.

13. Perceptions of the characteristics of the "good" or effective teacher are in part determined by the perceiver's attitudes toward education.
 

1. If differences are found between the sight reading method and the phonics reading method in increasing verbal comprehension, the differences will be small and trivial.

2. Among poor readers, those participating in computer-assisted instruction will out-gain those receiving conventional instruction on reading achievement. Additionally, the computer-assisted group is expected to perform better on reading achievement than those learning via the phonic method, but the phonic method is expected to produce better results than conventional instruction.

3. Women who plan to pursue careers in science are more aggressive, less conforming, more independent, and have a greater need for achievement than women who do not plan such careers.

4. More structured instructional procedures will provoke greater achievement among concrete students, whereas less structured approaches will provoke greater achievement among abstract students.
 


#########################