The Effects of Test Familiarity on Person-Fit and Aberrant Behavior
NCME Paper 2019
The Effects of Test Familiarity on Person-Fit and Aberrant Behavior
Hotaka Maeda, Ph.D. & Xiaolin Wang, Ph.D.
Abstract (50 words)
The person-fit to the Rasch model was evaluated for examinees taking multiple subject tests with a similar structure. The evaluation considered which test in the sequence (i.e., first, second) was taken. Compared to an examinee’s first test, person-fit improved for later tests. Test score reliability may improve with test familiarity.
Introduction
Aberrant behaviors are unusual test-taking behaviors that introduce noise to test data. They introduce nuisance constructs that are not intended to be measured and thus threaten measurement validity. One source of aberrant behavior is unfamiliarity with tests (Meijer & Sijtsma, 2001; Rupp, 2013). Examinees who take a new and unfamiliar test are likely to struggle to understand the test structure, gauge how much time they have for each item, navigate through a computer-based test, and handle their nerves. In contrast, examinees who are familiar with the test structure are likely to be less stressed, know how to prepare, and be able to complete the test efficiently. Compared to first-time takers’ results, scores for examinees who are familiar with the test structure may be less affected by the nuisance construct of test unfamiliarity and be more representative of their underlying ability. To the authors’ knowledge, this speculation has not been investigated and reported in the literature. Therefore, the purpose of this study is to examine the effects of test familiarity on person-fit and aberrant behavior using observed data.
Method
The instrument used in this study is a comprehensive medical achievement examination composed of eight clinical subject tests. Medical students typically take the test at the end of their clinical rotation in a given clinical subject. All clinical subject tests are structured identically:
- They are administered through the same platform.
- Item stems are worded similarly as they all target commonly encountered patient scenarios.
- All items in all tests are multiple-choice items with only one best answer.
Many examinees take all eight clinical subjects, but they do not take them in the same order. They can also choose to retake any clinical subject test. Therefore, the context of the instrument used in this study can be considered a quasi-experimental setting for assessing the effects of test familiarity on person-fit and aberrant behavior, where test familiarity can be defined by the number of clinical subject tests (including retakes) a candidate has taken.
Response data in all clinical subjects from July 2017 to June 2018 were used. Exploratory factor analysis with no rotation was conducted for each subject separately in order to identify high-quality items. Items were removed from the data if the factor loadings on the first dimension were less than 0.1. Then, the data were modeled using the Rasch model. For each subject, test forms were equated through concurrent calibration. Ability was estimated with maximum likelihood, which was standardized as N(0,1) and bound between [-5, 5] so that the values could be compared across subjects.
Aberrant behavior was assessed using the lz* person-fit statistic (Snijders, 2001). The lz* is asymptotically distributed as N(0,1), where positive values represent good person-fit, and negative values represent poor fit. If examinees respond to the items in a reasonable manner (e.g., not aberrant because of the familiarity of tests), lz* should be a high value, which shows that their responses fit well to the model. The lz* is uncorrelated with ability when aberrant behavior is not present. One of the typical cutoffs for determining poor person-fit is -1.645, which is equivalent to the one-tailed .05 alpha level.
The degree of person-fit (i.e., lz*) was regressed on the sequence of tests using two separate two-level random intercept models. As examinees took multiple tests, the tests were modeled as nested within examinees. Model 1 included three exam-level predictors: 1) examinee age in years at the time of the exam, 2) standardized test score, and 3) whether the subject being taken is a retake. The only predictor at the examinee-level was the number of times the person had ever retaken any clinical subject test (0, 1, 2, and >2). The model could be written as:
Model 1: lz* ~ age + test.score + subject.retake + total.retake
Model 2 included all the predictors in Model 1 in addition to the test sequence as a categorical variable from 1 to 11 (i.e., the order in which the examinees took the test, such as first test, second test, etc.).
Model 2: lz* ~ age + test.score + subject.retake + total.retake + test.sequence
The test sequence for some students did not start with “first” if they had taken the tests prior to July 2017. The test sequence can extend longer for students who retake some clinical subject tests.
Residual plots were used to confirm that the residuals were approximately normally distributed with the same mean and standard deviation at every fitted value. Because Model 1 was nested within Model 2, they were compared using a likelihood-ratio test.
Result
For the purpose of this specific study, 1,422 out of 5,594 items were removed from analysis, many of which were pretest items. All subjects achieved unidimensionality after the removal of such items. In addition, response data from 55 tests were removed because of an abnormally high test sequence due to retakes (12 or more). The final sample size across all test subjects was 4,172 items on 42,903 test administrations given to 10,135 examinees (see Table 1). Each test contained an average of 96.7 items (SD = 9.3). A majority of examinees had no history of retaking any clinical subject test (68.4%). Only 6.7% of the tests were retakes.
Table 1. Number of Exams by Sequence and Clinical Subject
Test Sequence |
Clinical Subject |
A |
B |
C |
D |
E |
F |
G |
H |
Total |
1 |
71 |
672 |
568 |
394 |
2656 |
585 |
468 |
507 |
5,921 |
2 |
138 |
1,261 |
881 |
809 |
476 |
730 |
646 |
678 |
5,619 |
3 |
172 |
705 |
744 |
824 |
577 |
769 |
767 |
884 |
5,442 |
4 |
192 |
700 |
760 |
825 |
473 |
744 |
753 |
738 |
5,185 |
5 |
198 |
697 |
660 |
721 |
642 |
803 |
681 |
774 |
5,176 |
6 |
231 |
598 |
737 |
705 |
924 |
698 |
676 |
667 |
5,236 |
7 |
527 |
590 |
683 |
617 |
574 |
615 |
629 |
689 |
4,924 |
8 |
1181 |
352 |
334 |
288 |
575 |
262 |
287 |
353 |
3,632 |
9 |
207 |
124 |
175 |
90 |
109 |
99 |
90 |
125 |
1,019 |
10 |
228 |
51 |
64 |
26 |
75 |
37 |
41 |
35 |
557 |
11 |
75 |
4 |
7 |
8 |
80 |
8 |
6 |
4 |
192 |
Total |
3,220 |
5,754 |
5,613 |
5,307 |
7,161 |
5,350 |
5,044 |
5,454 |
42,903 |
Note. Many examinees take all eight clinical subjects, but they do not take them in the same order. Although there are only eight clinical subjects, the test sequence can extend beyond eight because of retakes.
Mean lz* was 0.04 (SD = 1.09), while mean standardized test scores was 0.02 (SD = 1.14). Mean SE of the standardized test scores was 0.51 (SD = 0.07). Mean standardized test scores for those who had a history of retaking any clinical subjects test was lower (M = -0.36, SD = 1.13) than those who did not (M = 0.25, SD = 1.08). The percent of the test records exhibiting poor person-fit (i.e., lz* < 1.645) was 6.7%. Standardized test scores were positively correlated with lz* (r = .23).
A likelihood-ratio test showed that the addition of the test sequence predictor significantly improved the model fit, χ2(10)=75.05, p<.001. Controlling for examinee age, total historical test retake count, whether the subject being taken is a retake, and standardized test score, the student person-fit was the poorest for the first test compared to all later tests (p < .05). The coefficients from Model 2 are shown in Table 2. Compared to the first test, person-fit improved for the second exam by 0.07, and on the 11th test by 0.27.
Table 2. Model 2 Coefficients
|
Coef |
SE |
df |
t |
p |
(Intercept) |
0.41 |
0.05 |
32,754 |
8.65 |
<.001 |
Examinee-level predictors |
|
|
|
|
|
Retake total = 0 (Reference) |
– |
– |
– |
– |
– |
Retake total = 1 |
-0.04 |
0.02 |
10,131 |
-2.62 |
.009 |
Retake total = 2 |
-0.11 |
0.02 |
10,131 |
-4.61 |
<.001 |
Retake total > 2 |
-0.19 |
0.03 |
10,131 |
-6.73 |
<.001 |
Test-level predictors |
|
|
|
|
|
Standardized score |
0.19 |
0.01 |
32,754 |
38.17 |
<.001 |
Examinee age in years |
-0.02 |
0.00 |
32,754 |
-9.30 |
<.001 |
Retaking the clinical subject |
0.10 |
0.02 |
32,754 |
4.44 |
<.001 |
Test sequence = 1 (Reference) |
– |
– |
– |
– |
– |
Test sequence = 2 |
0.07 |
0.02 |
32,754 |
3.63 |
<.001 |
Test sequence = 3 |
0.08 |
0.02 |
32,754 |
4.27 |
<.001 |
Test sequence = 4 |
0.10 |
0.02 |
32,754 |
5.03 |
<.001 |
Test sequence = 5 |
0.13 |
0.02 |
32,754 |
6.62 |
<.001 |
Test sequence = 6 |
0.13 |
0.02 |
32,754 |
6.57 |
<.001 |
Test sequence = 7 |
0.10 |
0.02 |
32,754 |
4.66 |
<.001 |
Test sequence = 8 |
0.13 |
0.02 |
32,754 |
5.86 |
<.001 |
Test sequence = 9 |
0.10 |
0.04 |
32,754 |
2.79 |
.005 |
Test sequence = 10 |
0.20 |
0.05 |
32,754 |
4.12 |
<.001 |
Test sequence = 11 |
0.27 |
0.08 |
32,754 |
3.27 |
.001 |
Note. Person-fit was modeled using a two-level random-intercept model.
Model 2 also showed that those who had a history of retaking any clinical subject test tended to have lower person-fit than those who did not (p <.05). However, retaking the same clinical subject test was associated with an increase in person-fit by 0.10 (p <.001).
Discussion
This study shows that person-fit to the Rasch model improves as examinees gain experience in taking a series of tests with a similar structure. Improvements in person-fit were observed beyond the first and second tests. Test familiarity increased lz* by 0.1 or more. For reference, an increase in lz* from 0 to 0.1 is equivalent to an increase in person-fit by 3.98 percentiles. The findings indicate that the reliability of the test scores may improve with test-taking experience, and they show the importance of examinee familiarity with the test structure. The improvement in person-fit by increased test familiarity supports the provision of practice materials in order to minimize the negative impacts from test unfamiliarity and to promote measurement validity.
When interpreting the data, retakes of the same clinical subject exams needed to be considered. The option to retake any test allowed the test sequence to go beyond the number of available clinical subjects (i.e., eight). Clearly, a person who has taken the same test multiple times (despite taking a different form every time) should be more familiar with the test than the first-time takers. The examinees who have retaken any of the clinical subject exams tend to be lower achievers and have lower person-fit compared with non-retakers. However, their person-fit improved upon retaking the same clinical subject test. Also, results suggest poor person-fit occurred due to spuriously low aberrant behavior (i.e., poor performance) such as running out of time, more often than spuriously high-scoring behavior such as item pre-knowledge. This led many of the poor performers to retake the test. However, regardless of the test-retaking behavior, familiarity of the test structure led to increases in person-fit.
The study is limited in that we did not directly investigate whether improvement in person-fit is in fact associated with an increase in the accuracy of the standardized test scores. This is rather difficult to show empirically, but it should be pursued in the future. Further, a quasi-experimental design was used, where some factors were uncontrolled, including allowing examinees to retake any test at their own will. These test-retaking patterns were not random as they were correlated with important variables such as the standardized test scores. The study should also be replicated using other psychometric models and test data.
References
Meijer, R., & Sijtsma, K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25, 107-135.
Rupp, A. A. (2013). A systematic review of the methodology for person fit research in Item Response Theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55, 3-38.
Snijders, T. (2001). Asymptotic null distribution of person-fit statistics with estimated person parameter. Psycho