Phys. Rev. ST Phys. Educ. Res. 2, 020102 (2006)
Evaluating multiple-choice exams in large introductory physics courses
Michael Scott, Tim Stelzer, and Gary Gladding
(Some reference links may require a separate subscription.)
-
D. K. Campbell, C. M. Elliot, and G. E. Gladding, Parallel parking an aircraft carrier: Revising the calculus-based introductory physics sequence at Illinois (Forum on Education of the American Physical Society, 1997).
-
R. Lukhele, D. Thissen, and H. Wainer, On the relative value of multiple-choice, constructed response, and examinee-selected items on two achievement tests, J. Educ. Meas. 31, 234 (1994).
-
E. F. Redish, Teaching Physics with the Physics Suite (John Wiley and Sons, New York, 2003).
-
S. Tobias and J. B. Raphael, In-class examinations in college-level science: New theory, new practice, J. Sci. Educ. Technol. 5, 311 (1996).
-
G. J. Aubrecht and J. D. Aubrecht, Constucting objective tests, Am. J. Phys. 51, 613 (1983) [SPIN][INSPEC].
-
Midterm exams are written to be
60 min
exams, but students are allotted
90 min
to complete them. Students are allotted
3 h
to take the final exam. For most students, time is not an issue.
-
For clarification, to get each student’s even and odd scores, each of the four exams were first ordered by item difficulty. Then a student’s even score is the sum of their scores from the even questions from exams 1 and 3 and the odd questions from exams 2 and 4. Likewise, a student’s odd score is the sum of their scores from the odd questions from exams 1 and 3 and the even questions from exams 2 and 4.
-
This analysis considers only our
A
to
C
students because it is these students whose exam performance shows a strong linear correspondence to their assigned letter grade. That is, these students tend to receive 90% or more credit on the effort components of the course (e.g., homework, quizzes, and laboratories). Thus, their effort grade is not a distinguishing factor to the grade they receive in the course. This is not true, in general, for D and F students. Not only do these students do poorly on the exams, they also tend to do poorly on the effort components of the class. Therefore, the strong linear relationship between exam performance and assigned letter grade that is present for
A
to
C
students is not present for
D
to
F
students.
-
It should also be noted that over this same time span, more than 50 physics professors contributed in creating the exams used in the introductory courses.
-
In a second splitting method, the “even” test is literally the collection of the even-numbered questions from the first and third midterms and the odd-numbered questions from the second midterm and final. The reverse construction is made for the “odd” test. The uncertainty found using this method was 3.5%.
-
A third splitting method is simply an alteration of the second splitting method. Here, the “even” test is questions 1, 4, 5, 8, 9,…, from the first and third midterms and questions 2, 3, 6, 7,…, from the second midterm and final. The reverse construction is made for the odd test. The uncertainty from this splitting was 3.6%.
-
An offset to zero for each semester could be made so that all semesters had the same average percent difference in even and odd tests. This correction would account for the fact that students in different course semesters do not have the same even and odd tests. Adding this offset has the inherit effect of diminishing the standard deviations in the distributions to 3.2% for both the second and third methods of splitting the questions. This offset had little effect on the first splitting method.
-
J. R. Taylor, An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements (University Science Books, Sausalito, CA, 1982).
-
A letter grade difference of 1.0 is equivalent to a letter grade difference of A to B or B to C. A letter grade difference of
1 / 3
is equivalent to the difference between an A and an
A−
or an
A−
to a
B+
.
-
H. Wainer and D. Thissen, in True Score Theory: The Traditional Method, edited by David Thissen and Howard Wainer (Lawrence Erlbaum Associates, Hillsdale, NJ. 2001), Chap. 2, pp. 23–72.
-
C. C. Peters and W. R. Van Voorhis, Statistical Procedures and Their Mathematical Bases (McGraw-Hill, New York, 1940).
-
Common convention is to desire reliability correlation coefficients greater than 0.80 to ensure that a student’s exam score uncertainty is less than half of the standard deviation in the class’ exam score distribution.
-
L. J. Cronbach, Coefficient alpha and the internal structure of tests, Psychometrika 16, 297 (1951).
-
Because some of the exam items are grouped together under the same physical situation, splitting these items into separate split-half exams generally increases the correlation coefficient between the split-half exams and thus artificially increases the coefficient alpha. It may be more appropriate to treat those questions that are grouped together under the same prompt as testlets, and then to calculate alpha using testlet scores. To see what effect this might have on our alpha values, we examined four semester sets of exams: two from calculus-based mechanics and two from algebra-based mechanics. In each of the four semesters, the testlet alpha was indeed less than the item alpha, but never by more than 2% of the item alpha. This difference between the item and testlet alphas is less than the variation between semester item alphas.
-
T. P. Hogan, Relationship between free-response and choice-type tests of achievement: A review of the literature (ERIC Clearinghouse on Tests and Measurements, Princeton, NJ, 1981).
-
One justification for this selection process is that if only A and F students participated in the study, correlations between multiple-choice and constructed-response scores would artificially be high. We wanted to make sure there was an even distribution of students in the letter grade range from
A
to
C
. This is the range of most interest to us since it is this range students’ course grades are predominately dependent upon exam performance. Students in the
D
to
F
range do poorly on all components of the course, not just the exams. To ensure that there were equal number of students in each grade category, we chose to select only those students who had scored consistently on their three midterm exams. If a student receives an “A” on one midterm but then receives a “C” on another, one does not know whether this student is really an A, B, or C student.
-
This weighting system was instituted to allow for partial credit. The five-option items are intended to be more difficult than two- and three-option items. Students can receive partial credit on a five-option item in one of the following ways: six points if only one option is chosen and is correct, three points if only two options are chosen and one of the chosen options is correct, two points if only three options are chosen and one of the chosen options is correct, and zero points for all other markings.
-
To address any concerns that these raw correlations are large because of the selection of students who participated in the study, there is a correction that can be made to estimate what the raw correlations would be if the students were a pure random sampling of the entire class. This correction of heterogeneity had little effect on our raw correlations: for group 1,
r=0.88
went to 0.90, and for group 2,
r=0.92
went to 0.89. We were able to test the validity of this correction from our reliability data and found that it predicted on average at most a value that was only
0.62%±0.07%
over the actual value.
-
Educational Measurement, edited by R. L. Thorndike (American Council on Education, Washington, D.C., 1971).
-
Using the heterogeneity correction, the raw correlation values between MC and CS went from 0.78 and 0.83 to 0.81 and 0.77 for groups 1 and 2, respectively.
-
S. Eidelman, K. G. Hayes, K. A. Olive, M. Aguilar-Benitez, C. Amsler, D. Asner, K. S. Babu, R. M. Barnett, J. Beringer, P. R. Burchat, C. D. Carone, C. Caso, G. Conforto, O. Dahl, G. D’Ambrosio, M. Doser, J. L. Feng, T. Gherghetta, L. Gibbons, M. Goodman, C. Grab, D. E. Groom, A. Gurtu, K. Hagiwara, J. J. Hernández-Rey, K. Hikasa, K. Honscheid, H. Jawahery, C. Kolda, Y. Kwon, M. L. Mangano, A. V. Manohar, J. March-Russell, A. Masoni, R. Miquel, K. Mönig, H. Murayama, K. Nakamura, S. Navas, L. Pape, C. Patrignani, A. Piepke, G. Raffelt, M. Roos, M. Tanabashi, J. Terning, N. A. Törnqvist, T. G. Trippe, P. Vogel, C. G. Wohl, R. L. Workman, W.-M. Yao, P. A. Zyla, B. Armstrong, P. S. Gee, G. Harper, K. S. Lugovsky, S. B. Lugovsky, V. S. Lugovsky, A. Rom, M. Artuso, E. Barberio, M. Battaglia, H. Bichsel, O. Biebel, P. Bloch, R. N. Cahn, D. Casper, A. Cattai, R. S. Chivukula, G. Cowan, T. Damour, K. Desler, M. A. Dobbs, M. Drees, A. Edwards, D. A. Edwards, V. D. Elvira, J. Erler, V. V. Ezhela, W. Fetscher, B. D. Fields, B. Foster, D. Froidevaux, M. Fukugita, T. K. Gaisser, L. Garren, H.-J. Gerber, G. Gerbier, F. J. Gilman, H. E. Haber, C. Hagmann, J. Hewett, I. Hinchliffe, C. J. Hogan, G. Höhler, P. Igo-Kemenes, J. D. Jackson, K. F. Johnson, D. Karlen, B. Kayser, D. Kirkby, S. R. Klein, K. Kleinknecht, I. G. Knowles, P. Kreitz, Yu. V. Kuyanov, O. Lahav, P. Langacker, A. Liddle, L. Littenberg, D. M. Manley, A. D. Martin, M. Narain, P. Nason, Y. Nir, J. A. Peacock, H. R. Quinn, S. Raby, B. N. Ratcliff, E. A. Razuvaev, B. Renk, G. Rolandi, M. T. Ronan, L. J. Rosenberg, C. T. Sachrajda, Y. Sakai, A. I. Sanda, S. Sarkar, M. Schmitt, O. Schneider, D. Scott, W. G. Seligman, M. H. Shaevitz, T. Sjöstrand, G. F. Smoot, S. Spanier, H. Spieler, N. J. C. Spooner, M. Srednicki, A. Stahl, T. Stanev, M. Suzuki, N. P. Tkachenko, G. H. Trilling, G. Valencia, K. van Bibber, M. G. Vincter, D. Ward, B. R. Webber, M. Whalley, L. Wolfenstein, J. Womersley, C. L. Woody, O. V. Zenin, and R.-Y. Zhu, Review of Particle Physics, Phys. Lett. B 592, 1 (2004) [INSPEC][ADS][CAS].
-
P. Heller and M. Hollabaugh, Teaching problem solving through cooperative grouping. Part 1: Group versus individual problem solving, Am. J. Phys. 60, 627 (1992) [SPIN][INSPEC].
-
P. Heller and M. Hollabaugh, Teaching problem solving through cooperative grouping. Part 2: Designing problems and structuring groups, Am. J. Phys. 60, 637 (1992) [SPIN][INSPEC][ADS].
-
Full credit for a two-choice, three-choice, or five-choice question is two points, three points, or six points, respectively. See endnote in the subsection “The Study” of the Validity section for an explanation of the weighted grading system.
-
For more examples of questions used in our exams, visit the Illinois Physics Education Research Group’s website at http://www.physics.uiuc.edu/Research/PER/ and click on the “Resources” link. Researchers and teachers can gain free access to all of the midterm exams used in the introductory courses in recent years.
|
|