Journal of Applied Measurement
P.O. Box 1283
Maple Grove, MN 55311
Volume 12, 2011 Article Abstracts
Vol. 12, No. 1 Spring 2011
Using Adjusted GPA and Adjusted Course Difficulty Measures
to Evaluate Differential Grading Practices in College
Dina Bassiri and E. Mathew Schulz
Abstract
In this study, the Rasch rating scale model (Andrich, 1978) was applied to college grades of four freshman cohorts
from a large public university. After editing, the data represented approximately 34,000 students, 1,700 courses
and 119 departments. The rating scale model analysis yielded measures of student achievement and course difficulty.
Indices of the difficulty of academic departments were derived through secondary analyses of course
difficulty measures. Differences between rating scale model measures and simple grade averages were examined
for both students, courses, and academic departments. The differences were provocative and suggest that the
rating scale model could be a useful tool in addressing a variety of issues that concern college administrators.
****
Optimizing the Compatibility between Rating Scales and Measures
of Productive Second Language Competence
Christopher Weaver
Abstract
This study presents a systematic investigation concerning the performance of different rating scales used in
the English section of a university entrance examination to assess 1,287 Japanese test takers’ ability to write a
third-person introduction speech. Although the rating scales did not conform to all of the expectations of the
Rasch model, they successfully defined a meaningful continuum of English communicative competence. In
some cases, the expectations of the Rasch model needed to be weighed against the specific assessment needs
of the university entrance examination. This investigation also found that the degree of compatibility between
the number of points allotted to the different rating scales and the various requirements of an introduction
speech played a considerable role in determining the extent to which the different rating scales conformed to
the expectations of the Rasch model. Compatibility thus becomes an important factor to consider for optimal
rating scale performance.
****
Developing a Domain Theory Defining and Exemplifying
a Learning Theory of Progressive Attainments
C. Victor Bunderson
Abstract
This article defines the concept of Domain Theory, or, when educational measurement is the goal, one might
call it a “Learning Theory of Progressive Attainments in X Domain”. The concept of Domain Theory is first
shown to be rooted in validity theory, then the concept of domain theory is expanded to amplify its necessary
but long neglected connection to design research disciplines. The development of a local learning theory of
progressive attainments in the domain of Fluent Oral Reading is presented as an illustration. Such a theory is
local to a defined domain of application, having well delineated boundaries. It depends on measures having a
deep and valid connection to constructs, and the constructs back to the items or tasks at pertinent levels of the
measurement scale. Thus instrument development and theory development, which occur in tandem, depend on
establishing construct validity in a deep and thoroughgoing manner.
****
Bringing Human, Social, and Natural Capital to Life:
Practical Consequences and Opportunities
William P. Fisher, Jr
Abstract
Capital is defined mathematically as the abstract meaning brought to life in the two phases of the development
of “transferable representations,” which are the legal, financial, and scientific instruments we take for granted
in almost every aspect of our daily routines. The first, conceptual and gestational, and the second, parturitional
and maturational, phases in the creation and development of capital are contrasted. Human, social, and natural
forms of capital should be brought to life with at least the same amounts of energy and efficiency as have been
invested in manufactured and liquid capital, and property. A mathematical law of living capital is stated. Two
examples of well-measured human capital are offered. The paper concludes with suggestions for the ways that
future research might best capitalize on the mathematical definition of capital.
****
Understanding Rasch Measurement:
Distractors with Information in Multiple Choice Items:
A Rationale Based on the Rasch Model
David Andrich and Irene Styles
Abstract
There is a substantial literature on attempts to obtain information on the proficiency of respondents from distractors
in multiple choice items. Information in a distractor implies that a person who chooses that distractor has
greater proficiency than if the person chose another distractor with no information. A further implication is that
the distractor deserves partial credit. However, it immediately follows from the Rasch model that if a distractor
deserves partial credit, then the response to that distractor and other distractors should not be pooled into a
single response with a single probability of an incorrect response. Using the partial credit parameterization of
the polytomous Rasch model, the paper shows how an hypothesis can be formed, and tested, regarding information
in a distractor. The hypothesis is formed by studying the shape of the distractor response curves across
the continuum, and the hypothesis is tested by scoring the correct response 2, the hypothesized distractor 1,
and other distractors 0, and then applying the polytomous Rasch model. Multiple pieces of evidence, including
fit of the responses at the two thresholds and the order of the two threshold estimates, are used in deciding if a
distractor has information. An example illustrating the theory and its application is provided.
****
Vol. 12, No. 2 Summer 2011
A Comparison between Robust z and 0.3-Logit Difference
Procedures in Assessing Stability of Linking Items for the Rasch Model
Huynh Huynh and Anita Rawls
Abstract
There are at least two procedures to assess item difficulty stability in the Rasch model: robust z procedure and
“.3 Logit Difference” procedure. The robust z procedure is a variation of the z statistic that reduces dependency
on outliers. The “.3 Logit Difference” procedure is based on experiences in Rasch linking for tests developed by
Harcourt. Both methods were applied to archival data from two large-scale South Carolina assessment programs:
HSEE 1986/1987 and PACT 2004/2005.The results of the analysis showed the “.3 Logit Difference” procedure
identifies slightly more stable items (2.6%) for all items under study. In addition, approximately 93% of all
items under consideration were identically classified as stable or unstable for both procedures. This very high
level of agreement between the two methods indicates that either procedure can be safely used to identify stable
items for use in a common-item linking design. The advantage of the robust z procedure lies in its foundation
of robust statistical inference. The procedure takes into account well-accepted models for identifying outliers
and permits critical values set at a specified Type I error.
****
Assessment of English Language Development: A Validity Study of a District Initiative
Juan D. Sanchez
Abstract
The San Francisco Unified School District (SFUSD) uses the Language and Literacy Assessment Rubric
(LALAR) as the secondary measurement required by the No Child Left Behind (NCLB) Act to measure English
proficiency of English language learners (ELLs). In this analysis, the Rasch model is used to identify whether
the LALAR is a valid measurement instrument and scale to measure the “English proficiency” of ELLs. This
analysis investigates the relationship between student ability (q) and the probability that the student will respond
correctly to an item on the LALAR. Controlling for this relationship, the item characteristics of each item,
ability of each student, and measurement error associated with each score were mathematically derived. This
will allow for validity and reliability tests to be conducted, which will help determine if the LALAR is a useful
accountability measure for ELLs.
****
Equating of Multi-Facet Tests Across Administrations
Mary Lunz and Surintorn Suanthong
Abstract
The desirability of test equating to maintain the same criterion standard from test administration to test
administration has long been accepted for multiple choice tests. The same consistency of expectations is desirable
for performance tests, especially if they are part of a licensure or certification process or used for other
high stakes decisions (e.g., graduation). Performance tests typically have three or more facets (e.g., examinees,
raters, items, and tasks); all of which must be accounted for in the test-equating process. The application of
the multi-facet Rasch model (Linacre, 2003a) is essential for equating performance tests because it provides
calibrations of the elements of each facet. It also accounts for the differences in the tests taken by each examinee
within a test administration. When multi-facet tests are equated across administrations, differences between the
benchmark scale and the current test must be accounted for in each facet. Examinee measures are then adjusted
for the differences between tests.
The examples presented in this article were selected because of their difference in size and complexity
of test design. Because they are different, they demonstrate how the same principles of common element test
equating can be used regardless of the number of facets included in the test. Performance tests with more than
two facets can be equated, as long as appropriate quality control methods are employed. First, use carefully
selected common elements for each facet that represent the content and properties of the test. The common
elements should be unaltered from their original use. Then, the most effective method is to initially anchor all
common elements in each facet, then iteratively unanchor those elements which do not meet the criteria for
displacement and fit. Strict criteria for displacement must be used consistently among facets. The suggested
criterion for displacement is equal to or less than 0.5 logits. Unanchoring inconsistent and/or misfitting facet
elements will improve the quality of the test equating.
****
Examining Student Rating of Teaching Effectiveness using FACETS
Nuraihan Mat Daud and Noor Lide Abu Kassim
Abstract
Students’ evaluations of teaching staff can be considered high-stakes, as they are often used to determine promotion,
reappointment, and merit pay to academics. Using Facets, the reliability and validity of one student rating
questionnaire is analysed. A total of 13,940 respondents of the Human Science Division of International Islamic
University Malaysia were involved in the study. The analysis shows that the student rating questionnaire used
was valid and reliable, and it allows identification of staff for the institution’s prestigious teaching excellence
awards, and those needing in-service training. It was found that there was no significant difference in terms of
rank, age and gender of the staff. The study also shows that the majority of staff have problems keeping the
class interested and getting students to participate in class activities. Faculty also hardly discussed common
errors in assignments and tests.
****
Exploring Differential Item Functioning (DIF) with the Rasch Model: A Comparison of Gender Differences on Eighth Grade Science Items in the United States and Spain
Tasha Calvert Babiar
Abstract
Traditionally, women and minorities have not been fully represented in science and engineering. Numerous
studies have attributed these differences to gaps in science achievement as measured by various standardized tests.
Rather than describe mean group differences in science achievement across multiple cultures, this study focused
on an in-depth item-level analysis across two countries: Spain and the United States. This study investigated
eighth-grade gender differences on science items across the two countries. A secondary purpose of the study was
to explore the nature of gender differences using the many-faceted Rasch Model as a way to estimate gender DIF.
A secondary analysis of data from the Third International Mathematics and Science Study (TIMSS) was used
to address three questions: 1) Does gender DIF in science achievement exist? 2) Is there a relationship between
gender DIF and characteristics of the science items? 3) Do the relationships between item characteristics and
gender DIF in science items replicate across countries. Participants included 7,087 eight grade students from
the United States and 3,855 students from Spain who participated in TIMSS. The Facets program (Linacre and
Wright, 1992) was used to estimate gender DIF.
The results of the analysis indicate that the content of the item seemed to be related to gender DIF. The
analysis also suggests that there is a relationship between gender DIF and item format. No pattern of gender DIF
related to cognitive demand was found. The general pattern of gender DIF was similar across the two countries
used in the analysis.
The strength of item-level analysis as opposed to group mean difference analysis is that gender differences
can be detected at the item level, even when no mean differences can be detected at the group level.
****
Understanding Rasch Measurement: A Mapmark Method of Standard Setting as Implemented for the National Assessment Governing Board
E. Matthew Schulz and Howard C. Mitzel
Abstract
This article describes a Mapmark standard setting procedure, developed under contract with the National Assessment
Governing Board (NAGB). The procedure enhances the bookmark method with spatially representative
item maps, holistic feedback, and an emphasis on independent judgment. A rationale for these enhancements,
and the bookmark method, is presented, followed by a detailed description of the materials and procedures used
in a meeting to set standards for the 2005 National Assessment of Educational Progress (NAEP) in Grade 12
mathematics. The use of difficulty-ordered content domains to provide holistic feedback is a particularly novel
feature of the method. Process evaluation results comparing Mapmark to Anghoff-based methods previously
used for NAEP standard setting are also presented.
****
Vol. 12, No. 3 Fall 2011
Diagnosing a Common Rater Halo Effect in the Polytomous Rasch Model
Ida Marais and David Andrich
Abstract
The ‘halo effect’ may be unique to different raters or common to all raters. When common to all raters, halo is
not detectable through standard fit indices of the three-facet Rasch model used to account for differences in rater
severities. Using a formulation of halo as a violation of local independence, a halo effect common to all raters is
simulated and shown to be diagnosable through contrasts between two-facet stack and rack Rasch analyses. In
the former, the thresholds are clustered and the distribution of persons is multimodal; in the latter, all thresholds
are close together and the distribution of persons is unimodal. In the former, the scale is stretched, and the person
separation inflated, relative to the latter.
****
A Comparison of Structural Equation and Multidimensional Rasch Modeling Approaches to Confirmatory Factor Analysis
Edward W. Wolfe and Kusum Singh
Abstract
This paper compares the results of applications of the Multidimensional Random Coefficients Multinomial Logit
Model (MRCMLM) to comparable Structural Equation Model (SEM) applications for the purpose of conducting
a Confirmatory Factor Analysis (CFA). We review SEM as it is applied to CFA, identify some parallels between
the MRCMLM approach to CFA and that utilized in a standard SEM CFA, and illustrate the comparability of
MRCMLM and SEM CFA results for three datasets. Results indicate that the two approaches tend to identify
similar dimensional models as exhibiting best fit and provide comparable depictions of latent variable correlations,
but the two procedures depict the reliability of measures differently.
****
The Rainbow Families Scale (RFS): A Measure of Experiences Among Individuals with Lesbian and Gay Parents
David J. Lick, Karen M. Schmidt, and Charlotte J. Patterson
Abstract
According to two decades of research, parental sexual orientation does not affect overall child development.
Researchers have not found significant differences between offspring of heterosexual parents and those of lesbian
and gay parents in terms of their cognitive, psychological, or emotional adjustment. Still, there are gaps in the
literature regarding social experiences specific to offspring of lesbian and gay parents. This study’s objective
was to construct a measure of those experiences. The Rainbow Families Scale (RFS) was created on the basis
of focus group discussions (N = 9 participants), and then piloted (N = 24) and retested with a new sample (N =
91) to examine its psychometric properties. Exploratory factor analyses uncovered secondary dimensions and
Rasch analytic procedures examined item fit, reliability, and category usage. Misfitting items were eliminated
where necessary, yielding a psychometrically sound measurement tool to aid in the study of individuals with
lesbian and gay parents.
****
Development of an Instrument for Measuring Self-Efficacy in Cell Biology
Suzanne Reeve, Elizabeth Kitchen, Richard R. Sudweeks, John D. Bell, and William S. Bradshaw
Abstract
This article describes the development of a ten-item scale to assess biology majors’ self-efficacy towards the
critical thinking and data analysis skills taught in an upper-division cell biology course. The original seven-item
scale was expanded to include three additional items based on the results of item analysis. Evidence of reliability
and validity was collected and reported for the revised scale. In addition, the effect of varying the number of
response categories presented with the items was empirically examined by administering different versions of
the instrument containing 6, 11, 21, and 101 response categories to randomly selected samples of students in
the course. Rasch scaling procedures were used to analyze the results.
Contrary to Bandura’s recommendation for using the 101-point scale (0-100), the results indicated that most
respondents used only a subset of the options in the 101-point scale and that the 6-point and 11-point scales
produced less threshold disordering for the purpose of assessing changes in students’ self-efficacy in the context
of a one-semester course.
****
Measuring Schools’ Efforts to Partner with Parents of Children Served Under IDEA:
Scaling and Standard Setting for Accountability Reporting
Batya Elbaum, William P. Fisher, Jr., and W. Alan Coulter
Abstract
Indicator 8 of the State Performance Plan (SPP), developed under the 2004 reauthorization of the Individuals
with Disabilities Education Act (IDEA 2004, Public Law 108-446) requires states to collect data and report
findings related to schools’ facilitation of parent involvement. The Schools’ Efforts to Partner with Parents
Scale (SEPPS) was developed to provide states with a means to address this new reporting requirement. Items
suggested by stakeholder groups were piloted with a nationally representative sample of 2,634 parents of students
with disabilities ages 5-21 in six states. Rasch scaling was used to calibrate a meaningful and invariant
item hierarchy. The 78 calibrated items had measurement reliabilities ranging from .94-.97. Using data from
the pilot study, stakeholders established a recommended performance standard set at a meaningful point in the
item hierarchy. Implications of the findings are discussed in relation to the need for rigorous metrics within
state accountability systems
****
An ADL Measure for Spinal Cord Injury
Anne Bryden and Nikolaus Bezruczko
Abstract
Occupational therapists do not have a comprehensive, objective method for measuring how persons with tetraplegia
perform activities of daily living (ADL) in their homes and communities, because SCI ADL performance is
usually determined in rehabilitation. The ADL Habits Survey (ADLHS) is designed specifically to address this
knowledge gap by surveying performance on relevant and meaningful activities in homes and communities.
After a comprehensive task analysis and pilot development, 30 activities were selected that emphasize a broad
range of hand and wrist, reaching, and grasping movements in compound activities. A sample of 49 persons with
cervical spinal cord injuries responded to items. The sample was predominantly male, median age was 41 years,
and ASIA motor classification levels ranged from C2 through C8/T1 with majority concentration in C4, C5, or
C6 (68%). Each participant report was rated by an occupational therapist using a seven category rating scale, and
the item by participant response matrix (30 X 49) was analyzed with a Rasch model for rating scales. Results
showed excellent participant separation (>4) and very high reliability (>.95), and both item and participant fit
values were adequate (STANDARDIZED INFIT <+/ –3 SD units). With only two exceptions, all participants
fit the Rasch rating scale model, and only one item “Light housekeeping” presented significant fit issues. Principal
Components Analysis an analysis of item residuals did not reveal serious threats to unidimensionality. A
between group fit comparison of participants with more versus less movement found invariant item calibrations,
and ANOVA of participant measures found statistically significant differences across ASIA motor classification
levels. These ADLHS results offer occupational therapists a new method for measuring ADL that is potentially
more sensitive to functional changes in tetraplegia than most instruments in common use. Accommodation of
step disorder with a three category rating scale did not diminish measurement properties.
****
Understanding Rasch Measurement: Selecting Cut Scores with a Composite of Item Types:
The Construct Mapping Procedure
Karen Draney and Mark Wilson
Abstract
In this paper, we describe a new method we have developed for setting cut scores between levels of a test. We
outline the wide variety of potential methods that have been used for such a process, and emphasize the need
for a coherent conceptual framework under which the variety of methods could be understood. We then describe
our particular method, based on an item response modeling framework, which uses the Wright Map, a graphical
model of item and threshold difficulties, and a piece of computer software that provides probabilities of various
responses for scores under consideration as cut scores. Finally, we describe a study we conducted for the Golden
State Examination in Chemistry, in which we investigate the classification agreement for two groups using the
method, and also investigate the reactions of the committee members to the procedure and the software, and the
lessons we learned from this process.
****
Vol. 12, No. 4 Winter 2011
Reducing the Item Number to Obtain Same-Length Self-Assessment Scales: A Systematic
Approach using Result of Graphical Loglinear Rasch Modeling
Tine Nielsen and Svend Kreiner
Abstract
The Revised Danish Learning Styles Inventory (R-D-LSI) (Nielsen 2005), which is an adaptation of Sternberg-
Wagner Thinking Styles Inventory (Sternberg, 1997), comprises 14 subscales, each measuring a separate learning
style. Of these 14 subscales, 9 are eight items long and 5 are seven items long. For self-assessment, self-scoring
and self-interpretational purposes it is deemed prudent that subscales measuring comparable constructs are of
the same item length. Consequently, in order to obtain a self-assessment version of the R-D-LSI with an equal
number of items in each subscale, a systematic approach to item reduction based on results of graphical loglinear
Rasch modeling (GLLRM) was designed. This approach was then used to reduce the number of items in the
subscales of the R-D-LSI which had an item-length of more than seven items, thereby obtaining the Danish
Self-Assessment Learning Styles Inventory (D-SA-LSI) comprising 14 subscales each with an item length of
seven. The systematic approach to item reduction based on results of GLLRM will be presented and exemplified
by its application to the R-D-LSI.
****
Using Rasch Modeling to Measure Acculturation in Youth
Melinda F. Davis, Mary Adam, Scott Carvajal, Lee Sechrest, and Valerie F. Reyna
Abstract
Ethnic differences in health outcomes are assumed to reflect levels of acculturation, among other factors. Health
surveys frequently include language and social interaction items taken from existing acculturation instruments.
This study evaluated the dimensionality of responses to typical bilinear items in Latino youth using Rasch
modeling. Two seven-item scales measuring Anglo-Hispanic orientation were adapted from Marín and Gamba
(1996) and Cuéllar, Arnold, and Maldonado (1995). Most of the items fit the Rasch model. However, there were
gaps in both the Hispanic and Anglo scales. The Anglo items were not well targeted for the sample because most
students reported they always spoke English. The lack of variability found in a heterogeneous sample of Latino
youth has negative implications for the common practice of relying on language as a measure of acculturation.
Acculturation instruments for youth probably need more sensitive items to discriminate linguistic differences,
or to measure other factors
****
Measurement of Mothers’ Confidence to Care for Children Assisted with Tracheostomy Technology in Family Homes
Nikolaus Bezruczko, Shu-Pi C. Chen, Constance D. Hill, and Joyce M. Chesniak
Abstract
The purpose of this research was to develop an objective, linear measure of mothers’ confidence to care for children
assisted with tracheostomy medical technology in their homes. Caregiver confidence is addressed in this research
for three technologies, namely, a) trachesotomy, b) tracheostomy and ventilator, and c) BiPAP/CPAP although
detailed measurement results are only reported for tracheostomy, and its co-calibration with tracheostomy and
ventilator caregiving items. The sample consisted of 53 mothers responding to several caregiver questionnaires
based on a caregiving task matrix after content and clinical validation. A major challenge was integrating this
construct with overarching principles already established by Functional Caregiving, a multi-level humanistic
caregiving model for children with intellectual disabilities. Empirical analyses included principal components
analysis, and then linear transformation of Tracheostomy item ratings to an objective, equal-interval scale with a
Rasch model. Results show caregiver separation on the Tracheostomy caregiving scale was 2.66 and reliability,
.88. In general, co-calibration improved measurement properties without affecting mothers’ caregiving confidence
measures. Although sample size was small, measuring mothers’ confidence to care for a child supported
by complex medical technologies appears very promising.
****
Comparability of Item Quality Indices from Sparse Data Matrices with Random and Non-Random Missing Data Patterns
Edward W. Wolfe and Michael T. McGill
Abstract
This article summarizes a simulation study of the performance of five item quality indicators (the weighted
and unweighted versions of the mean square and standardized mean square fit indices and the point-measure
correlation) under conditions of relatively high and low amounts of missing data under both random and conditional
patterns of missing data for testing contexts such as those encountered in operational administrations
of a computerized adaptive certification or licensure examination. The results suggest that weighted fit indices,
particularly the standardized mean square index, and the point-measure correlation provide the most consistent
information between random and conditional missing data patterns and that these indices perform more comparably
for items near the passing score than for items with extreme difficulty values.
****
The Influence of Labels Associated with Anchor Points of Likert-type Response Scales in Survey Questionnaires
Jean-Guy Blais and Julie Grondin
Abstract
Survey questionnaires are among the most used data gathering techniques in the social sciences researchers’
toolbox and many factors can influence respondants’ answers on items and affect data validity. Among these
factors, research has accumulated which demonstrates that verbal and numeric labels associated with item’s
response categories in such questionnaire may influence substantially the way in which respondents operate
their choices within the proposed response format. In line with these findings, the focus of this article is to use
Andrich’s Rating scale model to illustrate what kind of influence the quantifier adverb “totally,” used to label
or emphasize extreme categories, could have on respondants’ answers.
****
Analysis of Letter Name Knowledge using Rasch Measurement
Ryan P. Bowles, Lori E. Skibbe, and Laura M. Justice
Abstract
Letter name knowledge (LNK) is a key predictor of later reading ability and has been emphasized strongly in
recent educational policy. Studies of LNK have implicitly treated it as a unidimensional construct with all letters
equally relevant to its measurement. However, some empirical research suggests that contextual factors can
affect the measurement of LNK. In this study, we analyze responses from 909 children on measures of LNK
using the Rasch model and its extensions, and consider two contextual factors: the format of assessment and the
own-name advantage, which states that children are more likely to know letters in their own first names. Results
indicate that both contextual factors have important impacts on measurement and that LNK does not meet the
requirements of Rasch measurement even when accounting for the contextual factors. These findings introduce
philosophical concerns for measurement of constrained skills which have limited content for assessment.
****
Understanding Rasch Measurement: Converging on the Tipping Point: A Diagnostic Methodology for Standard Setting
John A. Stahl and Kirk A. Becker
Abstract
This article discusses the strengths and weakness of the Angoff and Bookmark standard setting procedures.
An alternative approach that focuses on the strengths of these procedures and adds three diagnostic indices is
presented. This alternative approach is applied to three standard setting data sets and the results are discussed.