Access Type

Open Access Dissertation

Date of Award


Degree Type


Degree Name



Education Evaluation and Research

First Advisor

Donald R. Marcotte, Ph.D.


Historically test fairness has been an issue for stakeholders such as test makers, administrators, educators, examinees, and others. The consequence of bias testing creates "high stakes" for all stakeholders. During the past 50 years several cycles of dissatisfaction and reform occurred in educational testing (McGhee, 1995). Large scale testing was used to make decisions regarding college entrance, employment opportunities, funding, school policies and curriculum. The Michigan Educational Assessment Program (MEAP) was a large scale testing program requiring students to pass several specific sections (reading, mathematics, and science) before receiving a state-endorsed diploma. Test measurement continues to come under the microscope because of the "high stake" outcomes. Early on classical statistics were used which experienced several short comings. Therefore, item response theory was developed in an attempt to explain difference in observed responses and unobserved characteristics such as ability. Statistical techniques were developed to detect differential item functioning (DIF) between two groups. This study measured DIF between gender and race which used the Mantel-Haenszel test statistic and the three parameters logistical model. Both techniques detected DIF but not for the same test items. The Mantel-Haenszel was an easier statistical method to use, with approximately 23 (20%) of the test items flagged for gender, and 19 (16.5%) flagged for race. The three parameter logistical model flagged 15 (13%) for gender, and 18 (15.6%) for race. This model was difficult to use in its DOS extended version however, it provided more information to the user. The difficulty with differential item functioning (DIF) was once the test item had been flagged the results would need to be carefully interpreted to determine if item bias existed. Although both techniques failed to detect the same test items, both tended to support the literature in terms of gender and race difference for mathematical test items. Further examination of the differences in statistical test for significance in an effort to replicate the findings.