Journal of Applied Measurement

P.O. Box 1283

Maple Grove, MN 55311

 

Volume 11, 2010 Article Abstracts

Vol. 11, No. 1 Spring 2010

Predicting Responses from Rasch Measures

John M. Linacre

Abstract

There is a growing family of Rasch models for polytomous observations. Selecting a suitable model for an existing dataset, estimating its parameters and evaluating its fit is now routine. Problems arise when the model parameters are to be estimated from the current data, but used to predict future data. In particular, ambiguities in the nature of the current data, or overfit of the model to the current dataset, may mean that better fit to the current data may lead to worse fit to future data. The predictive power of several Rasch and Rasch-related models are discussed in the context of the Netflix Prize. Rasch-related models are proposed based on Singular Value Decomposition (SVD) and Boltzmann Machines.

****

Concrete, Abstract, Formal, and Systematic Operations as Observed in a “Piagetian” Balance-Beam Task Series

Theo Linda Dawson, Eric Andrew Goodheart, Karen Draney,Mark Wilson, and Michael Lamport Commons

Abstract

We performed a Rasch analysis of cross-sectional developmental data gathered from children and adults who were presented with a task series derived from Inhelder’s and Piaget’s balance beam. The partial credit model situates both participants and items along a single hierarchically ordered dimension. As the Model of Hierarchical Complexity predicted, order of hierarchical complexity accurately predicted item difficulty, with notable exceptions at the formal and systematic levels. Gappiness between items was examined using the saltus model. A two level saltus model, which examined the gap between the concrete/abstract and formal/systematic items, was a better predictor of performance than the Rasch analysis (chi square= 71.91, df = 4, p < .01).

****

Sources of Self-Efficacy Belief: Development and Validation of Two Scales

Ou Lydia Liu and Mark Wilson

Abstract

Self-efficacy belief has been an instrumental affective factor in predicting student behavior and achievement in academic settings. Although there is abundant literature on efficacy belief per se, the sources of efficacy belief have not been fully researched. Very few instruments exist to quantify the sources of efficacy-beliefs. To fill this void, we developed two scales for the two main sources of self-efficacy belief: past performance and social persuasion. Pilot test data were collected from 255 middle school students. A self-efficacy measure was also administered to the students as a criterion measure. The Rasch rating scale model was used to analyze the data. Information on item fit, item design, content validity, external validity, internal consistency, and person separation reliability was examined. The two scales displayed satisfactory psychometric properties. Applications and limitations of these two scales are also discussed.

****

Reducible Or Irreducible? Mathematical Reasoning and the Ontological Method

William P. Fisher, Jr

Abstract

Science is often described as nothing but the practice of measurement. This perspective follows from longstanding respect for the roles mathematics and quantification have played as media through which alternative hypotheses are evaluated and experience becomes better managed. Many figures in the history of science and psychology have contributed to what has been called the “quantitative imperative,” the demand that fields of study employ number and mathematics even when they do not constitute the language in which investigators think together. But what makes an area of study scientific is, of course, not the mere use of number, but communities of investigators who share common mathematical languages for exchanging quantitative and quantitative value. Such languages require rigorous theoretical underpinning, a basis in data sufficient to the task, and instruments traceable to reference standard quantitative metrics. The values shared and exchanged by such communities typically involve the application of mathematical models that specify the sufficient and invariant relationships necessary for rigorous theorizing and instrument equating. The mathematical metaphysics of science are explored with the aim of connecting principles of quantitative measurement with the structures of sufficient reason.

****

Children’s Understanding of Area Concepts: Development, Curriculum and Educational Achievement

Trevor G. Bond and Kellie Parkinson

Abstract

As one part of a series of studies undertaken to investigate the contribution of developmental attributes of learners to school learning, a representative sample of forty-two students (age from 5 years and 3 months to 13 years and 1 month) was randomly selected from a total student population of 142 students at a small private primary school in northern Australia. Those children’s understandings of area concepts taught during the primary school years were assessed by their performance in two testing situations. The first consisted of a written classroom test of ability to solve ‘area’ problems with items drawn directly from school texts, school examinations and other relevant curriculum documents. The second, which focused more directly on each child’s cognitive development, was an individual interview for each child in which four “area” tasks such as the Meadows and Farmhouse Experiment taken from Chapter 11 of The Child’s Conception of Geometry (Piaget, Inhelder and Szeminska, 1960, pp. 261-301) were administered. Analysis using the Rasch Partial Credit Model provided a finely detailed quantitative description of the developmental and learning progressions revealed in the data. It is evident that the school mathematics curriculum does not satisfactorily match the learner’s developmental sequence at some key points. Moreover, the children’s ability to conserve area on the Piagetian tasks, rather than other learner characteristics, such as age and school grade seems to be a precursor for complete success on the mathematical test of area. The discussion focuses on the assessment of developmental (and other) characteristics of school-aged learners and suggests how curriculum and school organization might better capitalize on such information in the design and sequencing of learning experiences for school children. Some features unique to the Rasch family of measurement models are held to have special significance in elucidating the development/attainment nexus.

****

Thinking about Thinking—Thinking about Measurement: A Rasch Analysis of Recursive Thinking

Ulrich Müeller and Willis F. Overton

Abstract

Two studies were conducted to examine the dimensionality and hierarchical organization of a measure of recursive thinking. In Study 1, Rasch analysis supported the claim that the recursive thinking task measures a single underlying dimension. Item difficulty, however, appeared to be influenced not only by level of embeddedness but also by syntactic features. In Study 2, this hypothesis was tested by adding new items to the recursive thinking measure. Rasch analysis of the modified recursive thinking task produced evidence for the undimensionality and segmentation. However, Study 2 did not support the idea that syntactic features influence item difficulty.

****

Understanding Rasch Measurement: Psychometric Aspects of Item Mapping for Criterion-Referenced Interpretation and Bookmark Standard Setting

Huynh Huynh

Abstract

Locating an item on an achievement continuum (item mapping) is well-established in technical work for educational/psychological assessment. Applications of item mapping may be found in criterionreferenced (CR) testing (or scale anchoring, Beaton and Allen, 1992; Huynh, 1994, 1998a, 2000a, 2000b, 2006), computer-assisted testing, test form assembly, and in standard setting methods based on ordered test booklets. These methods include the bookmark standard setting originally used for the CTB/TerraNova tests (Lewis, Mitzel, Green, and Patz, 1999), the item descriptor process (Ferrara, Perie, and Johnson, 2002) and a similar process described by Wang (2003) for multiple-choice licensure and certification examinations. While item response theory (IRT) models such as the Rasch and two-parameter logistic (2PL) models traditionally place a binary item at its location, Huynh has argued in the cited papers that such mapping may not be appropriate in selecting items for CR interpretation and scale anchoring.

****

 

 

Vol. 11, No. 2 Summer 2010

Using Item Response Modeling Methods to Test Theory Related to Human Performance

Diane D. Allen

Abstract

Testing theories of human performance requires measurement of latent constructs like performer perception and motivation, intrinsic parts of performance, but nebulous to assess. Item response modeling (IRM) methods model latent constructs directly and thus can assist in the development and testing of theory. This article presents an application of IRM methods in initial testing of the Movement Continuum Theory. A construct was derived from the theory, and an instrument was generated to put the construct into operation. Over 300 people of varying movement abilities, aged 18 to 101, completed the 24-item Movement Ability Measure, a self-report questionnaire asking for participants’ perceptions of their current and preferred ability to move. Wright Maps derived from estimated locations of participants and item thresholds showed the strong relationship between the theorized construct and the empirical data. Data gathered with the instrument and analyzed with IRM methods provided mixed support for principles of the theory.

****

From Model to Measurement with Dichotomous Items

Don Burdick, A. Jackson Stenner, and Andrew Kyngdon

Abstract

Psychometric models typically represent encounters between persons and dichotomous items as a random variable with two possible outcomes, one of which can be labeled success. For a given item, the stipulation that each person has a probability of success defines a construct on persons. This model specification defines the construct, but measurement is not yet achieved. The path to measurement must involve replication; unlike coin-tossing, this cannot be attained by repeating the encounter between the same person and the same item. Such replication can only be achieved with more items whose features are included in the model specifications. That is, the model must incorporate multiple items. This chapter examines multi-item model specifications that support the goal of measurement. The objective is to select the model that best facilitates the development of reliable measuring instruments. From this perspective, the Rasch model has important features compared to other models.

****

Using Guttman’s Mapping Sentences and Many Facet Rasch Measurement Theory to Develop an Instrument that Examines the Grading Philosophies of Teachers

Jennifer Randall and George Engelhard, Jr.

Abstract

This study presents an approach to questionnaire design within educational research based on Guttman’s mapping sentences (Guttman, 1977) and Many-Facet Rasch Measurement Theory (Linacre, 1994). The primary purpose of this study was to illustrate how Guttman’s mapping sentences can be used to develop an instrument that explores the grading philosophies of teachers. A secondary purpose was to clarify teacher grading philosophies (i.e., severity or leniency) as a measurement construct. We designed a 54-item questionnaire in which each item represented a unique combination of student characteristics, i.e., varying levels of classroom achievement, ability, behavior, and effort. The grades assigned by the teachers to the scenarios were analyzed using the FACETS (Linacre, 2007) computer program. The results of the analyses suggest that the grading philosophies of teachers represent a unidimensional construct which is influenced, to varying extents, by the classroom achievement (primarily), behavior, and effort of students; whereas the measurement value added by the inclusion of the ability facet is uncertain.

****

Development of Scales Relating to Professional Development of Community College Administrators

Edward W. Wolfe and Kim E. Van Der Linden

Abstract

This article reports the results of an application of the Multidimensional Random Coefficients Multinomial Logit Model (MRCMLM) to the measurement of professional development activities in which community college administrators participate. The analyses focus on confirmation of the factorial structure of the instrument, evaluation of the quality of the activities calibrations, examination of the internal structure of the instrument, and comparison of groups of administrators. The dimensionality analysis results suggest a five-dimensional model that is consistent with previous literature concerning career paths of community college administrators—education and specialized training, internal professional development and mentoring, external professional development, employer support, and seniority. The indicators of the quality of the activity calibrations suggest that measures of the five dimensions are adequately reliable, that the activities in each dimension are internally consistent, and that the observed responses to each activity are consistent with the expected values of the MRCMLM. The hierarchy of administrator measure means and of activity calibrations is consistent with substantive theory relating to professional development for community college administrators. For example, readily available activities that occur at the institution were most likely to be engaged in by administrators, while participation in selective specialized training institutes were the least likely activities. Finally, group differences with respect to age and title were consistent with substantive expectations—the greater the administrator’s age and the higher the rank of the administrator’s title, the greater the probability of having engaged in various types of professional development.

****

Comparing Décalage and Development with Cognitive Developmental Tests

Trevor Bond

Abstract

The use of Rasch measurement techniques with data from developmental psychology has provided important insights into human development (e.g., Bond, 1997, 2003; Dawson, 2002 a, b;). In particular, Rasch methods support investigations into what has been, up until now, intractable theoretical and empirical problems. Research into the development of formal operational thinking using the Rasch model (Bond 1995 a, b; Bond and Bunting, 1995; Bond and Fox, 2001) substantiates important aspects of the original theorizing of Piaget (Inhelder and Piaget, 1955/1958), which was based wholly on qualitative structural analyses of children’s problem-solving responses. Common-person equating of student performances has been used across different formal operational thinking tasks to estimate the relative difficulties of tasks measuring the same underlying developmental construct (Bond, 1995b; Bond and Fox, 2001). Repeated person performance measures on the same task have been used in order to estimate cognitive development over time. Rasch measurement estimates of cognitive development do not exceed 0.5 logits per annum (Bond, 1996; Endler, 1998; Stanbridge, 2001); a result that has been estimated independently in two large research projects in the United Kingdom (Shayer, 1999) and in Papua-New Guinea (Lake, 1996). Interestingly, difficulty differences (décalage) between tests of formal thought are as large as 2.0 logits (Bond, 1995a; Bond, 1996; Bond and Fox, 2001), confounding attempts to differentiate development from décalage. Given the problems and possibilities raised by the Rasch measurement quantification of cognitive development, this article canvasses the promise of using Rasch modelling techniques to investigate systematically these fundamental aspects of human cognitive performance.

****

Reliability of Performance Examinations: Revisited

Mary E. Lunz and John M. Linacre

Abstract

This article discusses the reliability for performance examinations with respect to the reproducibility of candidate pass-fail decisions. The multi-facet Rasch model accounts for the difficulty of test forms (examiners + tasks + items) taken by each candidate, so that all candidates are measured against the same criterion. The examiners provide analytic ratings of candidate performance using a defined rating scale. The examination is standardized so that sufficient information is collected about the candidates, and facets of the examination are linked. When the examination data are properly collected, there are high levels of confidence in the reproducibility of candidate outcomes for a high percentage of candidates.

****

Understanding Rasch Measurement: Equating Designs and Procedures Used in Rasch Scaling

Gary Skaggs and Edward W. Wolfe

Abstract

The development of alternate forms of tests requires a statistical score adjustment called equating that permits the interchanging of scores from different test forms. Equating makes possible several important measurement applications, including removing practice effects in pretest-posttest research designs, improving test security, comparing scores between new and old forms, and supporting item bank development for computerized adaptive testing. This article summarizes equating methods from a Rasch measurement perspective. The four sections of this article present an introduction and definition of equating and related linking methods, data collection designs, equating procedures, and evaluating equated measures. The methods are illustrated with worked examples

****

 

 

Vol. 11, No. 3 Fall 2010

Foreword, Emergence of Efficiency in Health Outcome Measurement

Nikolaus Bezruczko, Guest Editor

Abstract

Psychosocial measurement in the 21st Century is a dynamic field that is addressing challenges unthinkable even a generation ago. Sophisticated methods and modern technology has brought psychometrics to the cusp of scientific objectivity. This Foreword provides historical context and intellectual foundations for appreciating contemporary psychometric advancements, as well as a perspective on issues that are determining future advances. Efficiency in outcome measurement is one of these forces driving future advances. Efficiency, however, can easily become conflated with expediency, and neither can substitute for effectiveness. Blind efficiency runs risk of degrading measurement properties. Likewise, measurement advancement without accommodation to ordinary needs leads to practical rejection. Bouchard presents a biographical link between scientific physics and Rasch models that opened the door for fundamental psychosocial measurement. Symposium papers presented in this issue present a broad range of ideas about contemporary psychosocial measurement. Granger summarizes key ideas underlying achievement of objective, fundamental measurement. Massof, then, Stenner and Stone present alternative perspectives on scientific knowledge systems, which are prominent landmarks on the psychometric horizon. Fisher and Burton describe fundamental measurement methodology in diagnosis and implementation of technology, which will consolidate isolated and redundant constructs in PROMIS. Hart presents an overview on computer adaptive testing, which is the vanguard in health outcome measurement. Kisala and Tulsky present a qualitative strategy that is improving sensitivity and validity of new outcome measures. Their diversity reflects an intense competition of ideas about solving measurement problems. Their collection together in this special issue is a milestone and tribute to scientific ingenuity.

****

Measuring One Variable at a Time: The Wright Way

Ed Bouchard

Abstract

This article, based on research for Ben Wright’s biography, explores some influences leading to his profound commitment to useful and accurate measures. We briefly touch on his decision to leave a promising career in physics for education and psychometrics, stemming from his belief that understanding how children learn is even more important than understanding molecular structure. Along the way, we focus on a debate over measurement models that Wright began with Fred Lord at an ETS Invitational Conference in 1967.

****

Rasch-Derived Latent Trait Measurement of Outcomes: Insightful Use Leads to Precision Case Management and Evidence-Based Practices in Functional Healthcare

Carl V. Granger, Marsha Carlin, John M. Linacre, Ronald Mead, Paulette Niewczyk, A. Jackson Stenner, and Luigi Tesio

Abstract

ment evolved from other fields, particularly education. Person-metrics is the measurement of how much chronic disease and disablement affects an individual’s daily activities physically, cognitively, and through vocational and social role participation. The ability of the Rasch model to assume that the probability of a given person/ item interaction is governed by the difficulty of the item and the ability of the person is invaluable to disability measurement. The difference between raw scores and true measures is illustrated by an example of a patient whose physical difficulty is rated on rising from a wheelchair and walking 100m (known to be more difficult), and then walking an additional 200m. Though number ratings of 0-1-2 are assigned to these tasks, they are not equidistant, and only a true measure shows the actual levels of physical difficulty.

****

Generally Objective Measurement of Human Temperature and Reading Ability: Some Corollaries

A. Jackson Stenner and Mark Stone

Abstract

We argue that a goal of measurement is general objectivity: point estimates of a person’s measure (height, temperature, and reader ability) should be independent of the instrument and independent of the sample in which the person happens to find herself. In contrast, Rasch’s concept of specific objectivity requires only differences (i.e., comparisons) between person measures to be independent of the instrument. We present a canonical case in which there is no overlap between instruments and persons: each person is measured by a unique instrument. We then show what is required to estimate measures in this degenerate case. The canonical case encourages a simplification and reconceptualization of validity and reliability. Not surprisingly, this reconceptualization looks a lot like the way physicists and chemometricians think about validity and measurement error. We animate this presentation with a technology that blurs the distinction between instruction, assessment, and generally objective measurement of reader ability. We encourage adaptation of this model to health outcomes measurement.

****

A Clinically Meaningful Theory of Outcome Measures in Rehabilitation Medicine

Robert W. Massof

Abstract

Comparative effectiveness research in rehabilitation medicine requires the development and validation of clinically meaningful and scientifically rigorous measurements of patient states and theories that explain and predict outcomes of intervention. Patient traits are latent (unobservable) variables that can be measured only by inference from observations of surrogate manifest (observable) variables. In the behavioral sciences, latent variables are analogous to intensive physical variables such as temperature and manifest variables are analogous to extensive physical variables such as distance. Although only one variable at a time can be measured, the variable can have a multidimensional structure that must be understood in order to explain disagreements among different measures of the same variable. The use of Rasch theory to measure latent trait variables can be illustrated with a balance scale metaphor that has randomly added variability in the weights of the objects being measured. Knowledge of the distribution of the randomly added variability provides the theoretical structure for estimating measures from ordinal observation scores (e.g., performance measures or rating scales) using statistical inference. In rehabilitation medicine, the latent variable of primary interest is the patient’s functional ability. Functional ability can be estimated from observations of surrogate performance measures (e.g., speed and accuracy) or self-report of the difficulty the patient experiences performing specific activities. A theoretical framework borrowed from project management, called the Activity Breakdown Structure (ABS), guides the choice of activities for assessment, based on the patient’s value judgments, to make the observations clinically meaningful. In the case of low vision, the functional ability measure estimated from Rasch analysis of activity difficulty ratings was discovered to be a two-dimensional variable. The two visual function dimensions are independent of physical limitations and psychological state. To explain outcome measures (latent variable estimated from difficulty ratings), a theory must be developed that explicitly defines how latent variables are related to the observed manifest variables and to each other. The latent variables are categorized as primary variables, which in the case of low vision are the two visual function dimensions, and as effect modifiers, which in the case of low vision are other physical and psychological latent traits of the patients that can influence the outcome measures. Interventions give rise to latent intervention effect variables that can alter the latent primary variables or independently affect the outcome measures. The latent effect modifier variables, in turn, can alter the latent intervention effect variables. Once developed and validated, a theory of this form will predict the rehabilitation potential of individual patients, i.e., the probability of obtaining criterion outcome measures given the observed state of the patient and the choice of interventions.

****

Embedding Measurement within Existing Computerized Data Systems: Scaling Clinical Laboratory and Medical Records Heart Failure Data to Predict ICU Admission

William P. Fisher, Jr. and Elizabeth C. Burton

Abstract

This study employs existing data sources to develop a new measure of intensive care unit (ICU) admission risk for heart failure patients. Outcome measures were constructed from laboratory, accounting, and medical record data for 973 adult inpatients with primary or secondary heart failure. Several scoring interpretations of the laboratory indicators were evaluated relative to their measurement and predictive properties. Cases were restricted to tests within first lab draw that included at least 15 indicators. After optimizing the original clinical observations, a satisfactory heart failure severity scale was calibrated on a 0-1000 continuum. Patients with unadjusted CHF severity measures of 550 or less were 2.7 times more likely to be admitted to the ICU than those with higher measures. Patients with low HF severity measures (<550) adjusted for demographic and diagnostic risk factors are about six times more likely to be admitted to the ICU than those with higher adjusted measures. A nomogram facilitates routine clinical application. Existing computerized data systems could be programmed to automatically structure clinical laboratory reports using the results of studies like this one to reduce data volume with no loss of information, make laboratory results more meaningful to clinical end users, improve the quality of care, reduce errors and unneeded tests, prevent unnecessary ICU admissions, lower costs, and improve patient satisfaction. Existing data typically examined piecemeal form a coherent scale measuring heart failure severity sensitive to increased likelihood of ICU admission. Marked improvements in ROC curves were found for the aggregate measures relative to individual clinical indicators.

****

Implementing Computerized Adaptive Tests in Routine Clinical Practice: Experience Implementing CATs

Dennis L. Hart, Daniel Deutscher, Mark W. Werneke, Judy Holder, and Ying-Chih Wang

Abstract

This paper traces the development, testing and use of CATs in outpatient rehabilitation from the perspective of one proprietary international medical rehabilitation database management company, Focus On Therapeutic Outcomes, Inc. (FOTO). Between the FOTO data in the United States and Maccabi Healthcare Services data in Israel, over 1.5 million CATs have been administered. Using findings from published studies and results of internal public relations surveys, we discuss (1) reasons for CAT development, (2) how the CATs were received by clinicians and patients in the United States and Israel, (3) results of psychometric property assessments of CAT estimated measures of functional status in routine clinical practice, (4) clinical interpretation of CAT functional status measures, and (5) future development directions. Results of scientific studies and business history provide confidence that CATs are pertinent and valuable to clinicians, patients and payers, and suggest CATs will be prominent in the development of future integrated computerized electronic medical record systems with electronic outcomes data collection.

****

The Use of PROMIS and Assessment Center to Deliver Patient-Reported Outcome Measures in Clinical Research

Richard C. Gershon, Nan Rothrock, Rachel Hanrahan, Michael Bass, and David Cella

Abstract

The Patient-Reported Outcomes Measurement Information System (PROMIS) was developed as one of the first projects funded by the NIH Roadmap for Medical Research Initiative to re-engineer the clinical research enterprise. The primary goal of PROMIS is to build item banks and short forms that measure key health outcome domains that are manifested in a variety of chronic diseases which could be used as a “common currency” across research projects. To date, item banks, short forms and computerized adaptive tests (CAT) have been developed for 13 domains with relevance to pediatric and adult subjects. To enable easy delivery of these new instruments, PROMIS built a web-based resource (Assessment Center) for administering CATs and other self-report data, tracking item and instrument development, monitoring accrual, managing data, and storing statistical analysis results. Assessment Center can also be used to deliver custom researcher developed content, and has numerous features that support both simple and complicated accrual designs (branching, multiple arms, multiple time points, etc.). This paper provides an overview of the development of the PROMIS item banks and details Assessment Center functionality.

****

Opportunities for CAT Applications in Medical Rehabilitation: Development of Targeted Item Banks

Pamela A. Kisala and David S. Tulsky

Abstract

Researchers in the field of rehabilitation medicine have increasingly turned to qualitative data collection methods to better understand the experience of living with a disability. In rehabilitation psychology, these techniques are embodied by participatory action research (PAR; Hall 1981; White, Suchowierska, and Campbell 2004), whereby researchers garner qualitative feedback from key stakeholders such as patients and physicians. Glaser and Strauss (1967) and, later, Strauss and Corbin (1998) have outlined a systematic method of gathering and analyzing qualitative data to ensure that results are conceptually grounded to the population of interest. This type of analysis yields a set of interrelated concepts (“codes”) to describe the phenomenon of interest. Using this data, however, becomes somewhat of a methodological problem. While this data is often used to describe phenomena of interest, it is challenging to transform the knowledge gained into practical data to inform research and clinical practice. In the case of developing patient-reported outcomes (PRO) measures for use in a rehabilitation population, it is difficult to make sense of the qualitative analysis results. Qualitative feedback tends to be open-ended and free-flowing, not conforming to any traditional data analysis methodology. Researchers involved in measure development need a practical way to quantify the qualitative feedback. This manuscript will focus on a detailed methodology of empiricizing qualitative data for practical application, in the context of developing targeted, rehabilitation-specific PRO measures within a larger, more generic PRO measurement system.

****

Postscript, Emergence of Efficiency in Health Outcomes Measurement

Karon F. Cook

Abstract

The purpose of this postscript is to comment on psychometric issues raised by the collective articles of this special issue of the Journal of Applied Measurement. Topics discussed include the need to engage relevant literature in the psychometric and larger academic communities, the role of measurement models in psychometrics, and challenges to efficient measurement of patient reported outcomes. Finally, I argue that psychometrics should play “second fiddle” to larger scientific questions.

****

 

 

Vol. 11, No. 4 Winter 2010

Rasch Model’s Contribution to the Study of Items and Item Response Scales Formulation in Opinion/Perception Questionnaires

Jean-Guy Blais, Julie Grondin, Nathalie Loye, and Gilles Raîche

Abstract

Questionnaire-based inquiries make it possible to obtain data rather quickly and at relatively low cost, but a number of factors may influence respondents’ answers and affect data’s validity. Some of these factors are related to the individuals and the environment, while others are directly related to the characteristics of the questionnaire and its items: the text introducing the questionnaire, the order in which the items are presented, the number of responses categories and their labels on the proposed scale and the wording of items. The focus of this article is on this last point and its goal is to show how the developments of diagnostic features surrounding Rasch modelling can be used to study the impact of item wording in opinion/perception questionnaires on the responses obtained and on the location of anchor points of the item response scale.

****

Estimating Tests Including Subtests

Steffen Brandt

Abstract

Current assessment studies often face a dilemma that arises from the necessity of a simultaneous unidimensional and multidimensional interpretation of a test. When, additional to the measurement of a single domain, subdomains of this domain are to be measured, one and the same data set will have to be analyzed unidimensionaly and multidimensionaly at the same time even though it is a contradiction of the theoretical assumptions underlying the analysis. This article, at first, describes the psychometric deficiencies typically going along with current analysis approaches. Subsequently, a subdimension model is proposed that explicitly allows for the existence of subdomains, or subdimensions, in a measured domain. Thereby, the model provides a means to obtain calibration results that suffer less from the deficiencies described and also allows for an item selection in the test development process that already considers the multidimensional structure of the test.

****

Measure for Measure: Curriculum Requirements and Children’s Achievement in Music Education

Trevor Bond and Marie Bond

Abstract

Children in all public primary schools in Queensland, Australia have weekly music lessons designed to develop key musical concepts such as reading, writing, singing and playing simple music notation. Their understanding of basic musical concepts is developed through a blend of kinaesthetic, visual and auditory experiences. In keeping with the pedagogical principles outlined by the Hungarian composer, Zoltan Kodaly, early musical experiences are based in singing well-known children’s chants—usually restricted to notes of the pentatonic scale. In order to determine the extent to which primary school children’s musical understandings developed in response to these carefully structured developmental learning experiences, the Queensland Primary Music Curriculum was examined to yield a set of over 70 indicators of musical understanding in the areas of rhythm, melody and part-work—the essential skills for choral singing. Data were collected from more than 400 children’s attempts at elicited musical performances. Quantitative data analysis procedures derived from the Rasch model for measurement were used to establish the sequence of children’s mastery of key musical concepts. Results suggested that while the music curriculum did reflect the general development of musical concepts, the grade allocation for a few concepts needed to be revised. Subsequently, children’s performances over several years were also analysed to track the musical achievements of students over time. The empirical evidence confirmed that children’s musical development was enhanced by school learning and that indicators can be used to identify both outstanding and atypical development of musical understanding. It was concluded that modest adjustments to the music curriculum might enhance children’s learning opportunities in music.

****

On the Factor Structure of Standardized Educational Achievement Tests

Tim W. Gaffney, Robert Cudeck, Emilio Ferrer, and Keith F. Widaman

Abstract

This research analyzed the factor structure at both the item- and subtest-level of California’s norm- and criterionreferenced standardized educational achievement tests (SEAT) used in that state’s high-stakes educational accountability assessments. It was shown through full information factor analysis and multidimensional IRT models (e.g., TESTFACT and NOHARM) that, at the item-level, SEATs are invariably highly unidimensional (i.e., they appear to tap a unidimensional theta scale) even when items representing such diverse content areas such as English, science, mathematics, and history are analyzed simultaneously as a single measure. These item-level factors also accounted for a relatively small proportion (1/4 to 1/3) of the variance. It was also shown that, when these tests are analyzed using more reliable indicators such as subtests, a much richer factor structure emerged that accounted for a larger portion (about 2/3) of the total common variance. As expected, these factor structure configurations (and underlying dimensionality) were preserved across the item- and subtest-levels. However, the factors emerging from both the item- and subtest-level analyses were highly correlated and produced strong second-order and general factors. The meaning underlying these results was examined, along with their implications with respect to the assumptions underlying modern approaches to test calibration, scaling, and score interpretation.

****

The Practical Application of Optimal Appropriateness Measurement on Empirical Data using Rasch Models

Iasonas Lamprianou

Abstract

Optimal Appropriateness Measurement (OAM) is a general statistical method for the identification of examinees whose test scores might not be a valid indicator of their true latent ability or trait. The method is statistically very powerful and it pinpoints towards the direction of the suspected aberrance instead of simply identifying that a specific response pattern is, in some way, aberrant. The method has been traditionally used with multiparameter IRMs for the identification of examinees with spuriously low and high scores. This article presents the practical application of the method, using Rasch models, in the context of a large-scale activity which aimed to provide secondary education schools with feedback about their students’ performance on a high-stakes University entrance science test. Although researchers in the past claimed that OAM was not ready to be routinely used in practical settings, this article maintains that the practical use of OAM to answer specific educationally meaningful questions is feasible.

****

Features of the Sampling Distribution of the Ability Estimatein Computerized Adaptive Testing According to Two Stopping Rules

Jean-Guy Blais and Gilles Raîche

Abstract

Whether paper and pencil or computerized adaptive, tests are usually described by a set of rules managing how they are administered: which item will be first, which should follow any given item, when to administer the last one. This article focus on the latter and looks at the effect of two stopping rules on the estimated sampling distribution of the ability estimate in a CAT: the number of items administered and the a priori determined size of the standard error of the ability estimate.

****

Understanding Rasch Measurement: Developing Examinations that use Equal Raw Scores for Cut Scores

Andrew Swanlund and Everett Smith

Abstract

This study describes and demonstrates a set of processes for developing new forms of examinations which are intended to have equivalent cut scores in the raw score metric. This approach goes beyond the traditional Rasch-based approach which develops forms with cut scores that are equated in the logit metric. The methods described in this study can be used to create multiple forms of an assessment, all of which have the same raw score cut score (i.e., the number correct required to pass each examination form represents the same amount of the underlying construct). This paper provides an overview of equating standards, the research related specifically to pre-equating procedures, and three guidelines which can be used to achieve equal raw score cut scores. Three examples of how to use the guidelines as part of an iterative form-development process are provided using simulated data sets.

****

 

Home