AUTHOR: Knoch, Ute TITLE: Diagnostic Writing Assessment SUBTITLE: The Development and Validation of a Rating Scale SERIES TITLE: Language Testing and Evaluation 17 PUBLISHER: Peter Lang AG YEAR: 2010
Mark Brenchley, PhD Candidate, Graduate School of Education, University of Exeter
SUMMARY
‘Diagnostic Writing Assessment' details a recent study aiming to ''develop a theoretically-based and empirically-developed rating scale'' suitable for diagnostic contexts (p. 5). To this end, Knoch has conducted an empirical comparison of two trait scales. The first, that of the Diagnostic English Language Needs Assessment (DELNA), pre-dates the study; the second was purpose-built and tested as part of the study. To address the overall research problem, three specific questions were posed and addressed: 1) What discourse analytic measures successfully distinguish between writing samples at different DELNA levels? 2) Along what axes do the ratings produced using the two rating scales differ? 3) How do raters perceive the two scales? The first chapter provides a brief overview of the study, in which Knoch draws attention to the lack of clarity regarding ''how direct diagnostic tests of writing should differ from proficiency or placement tests'' (p.13), a situation particularly true of rating scale design. It is to this latter that the study is addressed. Knoch investigates whether an empirically developed scale would be ''more valid for diagnostic writing assessment'' than an intuitively developed scale of the kind typified by DELNA (p.15). In chapter 2, Knoch situates ''diagnostic assessment within the literature on performance assessment of writing'' (p. 35). Diagnostic assessment is described and distinguished through both content and purpose, with reference to features identified by Alderson (2005). Diagnostic tests, for example, should be expected to ''identify strengths and weaknesses'' (p. 21) and to provide ''a detailed analysis and report of responses to items or tasks'' (p. 21). Chapter 3 provides a definition of rating scales and focuses on issues relating to their design, Knoch discussing how such scales relate to diagnostic contexts. She argues that many current rating scales are significantly flawed, often developed, for example, on the basis of a single theory of writing development. She considers this to be a particular problem, our understanding of writing being ''not sufficiently developed to base a writing scale just on one theory'' (p. 68). In chapter 4, Knoch synthesises a taxonomy of linguistic constructs from the various theories and models of writing development discussed in the previous chapter, arguing that such a taxonomy provides ''the most comprehensive description of our current knowledge about writing development'' (p. 71). Eight potentially relevant writing constructs are identified, including Accuracy and Cohesion (p. 75), and used to evaluate the current DELNA scale. She then further analyses the available literature in order to determine what specific measures have successfully identified writing development within these constructs, and to determine which measures might be suitably operationalised for the pilot study of phase one. Chapter 5 outlines phase one. For the pilot study, Knoch catalogued 15 writing scripts randomly selected from the University of Auckland's 2004 DELNA administration. She coded them according to the specific linguistic measures identified in chapter 4. From these scripts, Knoch determined a subset of measures that successfully distinguished between different DELNA levels. This subset then served as the basis for the main study, in which 601 randomly selected DELNA scripts were analysed according to these measures. Chapter 6 presents the analysis of the main study from phase one. Of the 26 linguistic measures identified by Knoch through the pilot study, the main study identifies 17 – those which successfully differentiated between the various levels of ability according to the original DELNA levels. These included percentage of error-free t-units, the number of hedges, and the number of propositions (p. 168). Chapter 7 summarises the results presented in chapter 6, which are used to devise a new rating scale for investigation during phase two. Knoch discusses the success of each specific measure in turn, subsequently outlining a fresh trait scale for that measure. She argues that the new trait scale offers more explicit descriptors than the original DELNA scale, often stipulating fairly precise quantitative measures (e.g. ''11-15 self-corrections'' (p. 172)). The new scale is, therefore, arguably more objective and less open to rater subjectivity. Chapter 8 presents the methodology for the empirical study comprising phase two. 10 current DELNA raters were asked to use both the DELNA scale and the new scale to rate 100 randomly selected DELNA scripts. Their ratings were subjected to a Rasch analysis and further analysed according to five hypotheses designed to evaluate their respective superiorities. Finally, the raters were interviewed and asked to fill in a questionnaire about their experiences, affording feedback as to the raters' own particular evaluations of the two scales. Chapter 9 presents the results from phase two and addresses the two research questions framing this phase. The individual trait ratings from both scales are directly compared so as to determine their respective superiorities according to the five hypotheses outlined in chapter 8; the scales are then further compared overall. In terms of the individual traits, Knoch determines the new rating scale to be generally superior, noting in particular that the new scale resulted in a reduced halo effect and greater success in identifying learners' strengths and weaknesses. She also notes, however, that, analysed as a whole, ''the existing scale resulted in a higher candidate discrimination'' (p. 229), and attributes this to the new scale ''assessing different information…not measured by the existing scale'' (p. 230). Finally, Knoch presents and discusses the results from the rater interviews, demonstrating a general preference for the new scale. Chapter 10 discusses the results of phase two. The two scales are compared according to various aspects relevant to rating scale validity, these aspects defined in terms of ten distinct warrants. She finds the new scale to be clearly more valid on those warrants covering a scale's Construct Validity (the extent to which the test actually assesses what it is meant to assess), Reliability (how consistently the same script is rated by different raters), and Authenticity (how the test scores generalise to actual language use). The DELNA scale, on the other hand, is deemed to have greater validity on the two warrants covering scale Practicality (how easy a scale is to operationalise). In chapter 11, Knoch summarises the findings of the study and discusses its overall implications in both theoretical and practical terms. She argues the new scale to be ''more suitable in a diagnostic context'' (p. 298), presents a model of performance assessment, and argues the need to distinguish between analytic scales that have been intuitively developed and those that have been developed empirically.
EVALUATION
A more extensive account of research recently published as Knoch (2009), 'Diagnostic Writing Assessment' represents a constructive contribution to the literature on diagnostic language assessment. The study as a whole is well-conceived, planned, and executed. Each stage is thoughtfully conducted so as to serve as the empirical foundation for the succeeding stage, and the results are carefully analysed and presented in a form which makes them easy to engage with. Knoch is, further, both aware of the inherent difficulties of rating scale research and displays a clear understanding of the limits of her study. Finally, the study's conclusions, primarily that an empirically-developed scale with more explicit descriptors is more appropriate for diagnostic purposes since it more reliably isolates distinct aspects of learner proficiency, are measured, plausible and supported by the empirical evidence as presented. A particular virtue of Knoch's study is the explicitness of the construction process, allowing for a clearer understanding of the basis of the resultant scale. Given the fundamental role rating scales play in operationalising the relevant linguistic constructs and evaluating test-taker proficiency, such explicitness is vital. Yet, as Knoch herself notes, and despite this state of affairs noted as far back as Brindley (1998), ''there is surprisingly little information on how commonly used rating scales are constructed'' (p. 42). Hence, it is often difficult to evaluate the rationale and validity of assessment scales, substantially hampering an understanding of the nature of these scales in general. Not so here. Indeed, if productive research in this area is to take place, then assessment scales need to be presented and investigated in much the same manner as Knoch has done - openly, explicitly, and methodically. A further virtue is Knoch’s comparative approach to rating scale validation. Rather than evaluating a single scale in isolation, Knoch's study ascertains the respective worth of two scales, using each to illuminate the validity of the other. This is an approach that yields some interesting results. Thus, Knoch notes that while ''most individual trait scales on the new scale were more discriminating…as a whole, the existing scale was more discriminating'' (p. 222). This discrepancy prompted a Principal Factor Analysis of the two scales which leads her to conclude that the new scale accounts ''for not only more aspects of writing ability, but also for a larger amount of variation of the scores'' (p. 228). Significantly, this is a conclusion prompted by the contrastive nature of the study, a fact which marks such an approach out as a fruitful avenue for further research into assessment scales. A final desirable quality of Knoch's study is her synthetic approach to language assessment. Firstly, the study is firmly contextualised within the framework of current literature. This feature is, of course, perhaps to be expected given that the study was undertaken for doctoral purposes. Nevertheless, it ensures the study is firmly and properly grounded, drawing on a broad basis of theoretical and empirical writing. Secondly, and more interestingly, Knoch synthesises her linguistic constructs from a range of available models on writing proficiency. This is significant since, as she herself rightly notes, ''no adequate model or theory of writing or writing proficiency is currently available'' (p. 104), a situation which calls into question the validity of any rating scale based on only one such model. Knoch's response circumvents this difficulty, resulting in a construct taxonomy which has broad theoretical support yet which is not tied to any one particular theory per se. Consequently, her study can be pursued as a more open and empirically-driven investigation of rating scale construction and validity, one which cuts across the various models as well as having the potential to productively feed back into them. Her results demonstrate this to be a promising approach which would allow rating scale research to develop with its own measure of independence and integrity. 'Diagnostic Writing Assessment' is not without flaw, of course, and there are several features worth drawing attention to. Throughout the study, for example, Knoch is careful to control for the possible variables, something she makes clear herself (p. 186). This methodology did not, however, extend to rater selection, all of whom were drawn from a pool of current DELNA raters (p. 185). Consequently, although the raters received training on the newly-devised scale, they would have been substantially more familiar with the DELNA version, a factor that could have had a significant effect on the rating outcomes. It would have been perhaps preferable, therefore, to have selected raters who were equally inexperienced on both scales, though this may have been unavoidable given the inevitable labour constraints of a PhD study. Further, Knoch makes clear that the goal of the study is to devise and investigate an ''empirically developed rating scale'' (p. 15). To her credit, she generally succeeds in this, each component empirically constructed, researched, and feeding into the next. It is regrettable, therefore, that, following the results of the pilot study of phase one, Knoch does not select a greater number of measures for the main study than she actually does. So, for example, although ''error-free t-units, error-free clauses and errors/clause'' (p. 113) were all found to distinguish successfully between different levels, only the ''percentage of error-free t-units was selected for the second phase of this study'' (p. 113). Knoch's choices are mostly not without reason; this particular measure was selected, for example, because it ''might be the easiest for the raters to apply and is unaffected by the length of the script'' (p. 113). Nevertheless, since the study is decidedly empirical in intent, it would have made more sense to take on all the measures empirically identified by phase one. These could then have been further investigated during the two main studies to see how they actually affected the raters and rating scores, rather than being eliminated a priori. Finally, though Knoch's analysis is generally sound, there are a couple of points regarding the statistical methodology of the study. The first is a somewhat minor one. This is simply that no breakdown is provided according to the native:non-native (47%:53%, respectively) profiles of the study cohort. These are groups likely to display different proficiency characteristics and needs, something particularly significant for a diagnostically-oriented assessment scale. Hence, it would have been relevant to explore the extent to which these groups were differently handled by the two scales. It is true, as Knoch notes, that ''it is very difficult to establish the language background of students'' (295); nevertheless, even a brief exploration would have provided an interesting further dimension for comparing the two scales. The second point is more substantial and concerns the fact that the pilot study bases its conclusions on an analysis of only 15 writing scripts. This is quite a small sample, one for which ''no inferential statistics were calculated and the data was not double coded'' (p. 112). This sample size makes her use of means questionable since it renders the mean vulnerable to outlier scores. It also often results in mean scores that are distinct but fairly close together (as in the case of 'grammatical complexity' (p. 115)) and in standard deviation scores that overlap (sometimes significantly, as in the case of 'number of words' (p. 115)). As a result, there is a residual uncertainty as to the accuracy of the selected measures, reinforcing the point made above about carrying all of the successful measures forward. That this may indeed have been a significant factor is suggested by the fact that the success of the pilot study in identifying clauses-per-t-units as a successful measure was not replicated in the main study (p. 173). Consequently, it would have been helpful either to have utilised a larger sample or to have included the individual scores alongside of the means so as to present a more detailed picture of the data; both would have improved the general empirical rigour of the study. Nevertheless, it is to Knoch's credit that the above criticisms are only available precisely because of the study's explicitness. It is also worth remembering that 'Diagnostic Writing Assessment' is a PhD study and as such is inevitably bound by all the labour constraints such a study entails. Indeed, in this context, Knoch's work is particularly impressive, the end product a mature piece of empirical research that extends our current understanding of diagnostic rating scale design, raises relevant and important issues, and serves as a useful staging post for future research in this area. REFERENCES
Alderson, J. C. (2005) Diagnosing Foreign Language Proficiency: The Interface Between Learning and Assessment. London: Continuum.
Brindley, G. (1998) Describing Language Development? Rating Scales and Second Language Acquisition. In Bachman, L. F. and Cohen, A. D. (eds.), Interfaces Between Second Language Acquisition and Language Testing Research. Cambridge: Cambridge University Press. pp. 112-140.
Knoch, U. (2009) Diagnostic Assessment of Writing: A Comparison of Two Rating Scales. Language Testing 26(2), pp. 275-304.
|