Language Learning & Technology
June 2008, Volume 12, Number 2
BEYOND THE DESIGN OF AUTOMATED WRITING EVALUATION: PEDAGOGICAL PRACTICES AND PERCEIVED LEARNING EFFECTIVENESS IN EFL WRITING CLASSES
Chi-Fen Emily Chen and Wei-Yuan Eugene Cheng
National Kaohsiung First University of Science and Technology, Taiwan
Automated writing evaluation (AWE) software is designed to provide instant computer-generated scores for a submitted essay along with diagnostic feedback. Most studies on AWE have been conducted on psychometric evaluations of its validity; however, studies on how effectively AWE is used in writing classes as a pedagogical tool are limited. This study employs a naturalistic classroom-based approach to explore the interaction between how an AWE program, MY Access!, was implemented in three different ways in three EFL college writing classes in Taiwan and how students perceived its effectiveness in improving writing. The findings show that, although the implementation of AWE was not in general perceived very positively by the three classes, it was perceived comparatively more favorably when the program was used to facilitate students’ early drafting and revising process, followed by human feedback from both the teacher and peers during the later process. This study also reveals that the autonomous use of AWE as a surrogate writing coach with minimal human facilitation caused frustration to students and limited their learning of writing. In addition, teachers’ attitudes toward AWE use and their technology-use skills, as well as students’ learner characteristics and goals for learning to write, may also play vital roles in determining the effectiveness of AWE. With limitations inherent in the design of AWE technology, language teachers need to be more critically aware that the implementation of AWE requires well thought-out pedagogical designs and thorough considerations for its relevance to the objectives of the learning of writing.
Automated writing evaluation (AWE), also referred to as automated essay scoring (AES)1, is not a brand-new technology in the twenty-first century; rather, it has been under development since the 1960s. This technology was originally designed to reduce the heavy load of grading a large number of student essays and to save time in the grading process. Early AWE programs, such as Project Essay Grade, employed simple style analyses of surface linguistic features of a text to evaluate writing quality (Page, 2003). Since the mid-1990s, the design of AWE programs has been improving rapidly due to the advance of artificial intelligence technology, in particular natural language processing and intelligent language tutoring systems. Newly developed AWE programs, such as Criterion with the essay scoring engine "e-rater" by Educational Testing Service and MY Access! with the essay scoring engine "Intellimetric" by Vantage Learning, boast the ability to conduct more sophisticated analyses including lexical complexity, syntactic variety, discourse structures, grammatical usage, word choice, and content development. They provide immediate scores along with diagnostic feedback in various aspects of writing and can be used for both formative and summative assessment purposes. In addition, a number of AWE programs are now web-based and equipped with a variety of online writing resources (e.g., thesauri and word banks) and editing features (e.g., grammar, spelling, and style checkers), which make them not only an essay assessment tool but also a writing assistance tool. Students can make use of both AWE’s assessment and assistance functions to help them write and revise their essays in a self-regulated learning environment.
Although AWE developers claim that their programs are able to assess and respond to student writing as human readers do (e.g., Attali & Burstein, 2006; Vantage Learning, 2007), critics of AWE express strong skepticism. Voices from the academic community presented in Ericsson and Haswell’s (2006) anthology, for example, question the truth of the industry’s publicity for AWE products and the consequences of the implementation of AWE in writing classes. They distrust the ability of computers to "read" texts and evaluate the quality of writing because computers are unable to understand meaning in the way humans do. They also doubt the value of writing to a machine rather than to a real audience, since no genuine, meaningful communication is likely to be carried out between the writer and the machine. Moreover, they worry whether AWE will lead students to focus only on surface features and formulaic patterns without giving sufficient attention to meaning in writing their essays.
The development and the use of AWE, however, is not a simple black-and-white issue; rather, this issue involves a complex mix of factors concerning software design, pedagogical practices, and learning contexts. Given the fact that many AWE programs have already been in use and involve multiple stakeholders, a blanket rejection of these products may not be a viable, practical stand (Whithaus, 2006). What we need are more investigations and discussions of how AWE programs are used in various writing classes "in order to explicate the potential value for teaching and learning as well as the potential harm" (Williamson, 2004, p. 100). A more pressing question, accordingly, is probably not whether AWE should be used but how this new technology can be used to achieve more desirable learning outcomes while avoiding potential harms that may result from limitations inherent in the technology. The present study, therefore, employed a naturalistic classroom-based approach to explore how an AWE program was implemented in three EFL college writing classes and how student perceptions of the effectiveness of AWE use were affected by different pedagogical practices.
Assessment Validation of AWE: Theory-Based Validity and Context Validity
AWE programs have been promoted by their developers as a cost-effective option of replacing or enhancing human input in assessing and responding to student writing2. Due to AWE vendors’ relentless promotion coupled with an increasing demand for technology use in educational institutions, more and more teachers and students have used, are using, or are considering using this technology, thus making research on AWE urgently important (Ericsson & Haswell, 2006; Shermis & Burstein, 2003a; Warschauer & Ware, 2006). Most of the research on AWE, however, has been funded by AWE developers, serving to promote commercial vitality and support refinement of their products. Industry-funded AWE studies have been mostly concerned with psychometric issues with a focus on the instrument validity of AWE scoring systems. They generally report high agreement rates between AWE scoring systems and human raters. They also demonstrate that the scores given by AWE systems and those by other measures of the same writing construct are strongly correlated (see Dikli, 2006; Keith, 2003; Phillips, 2007). These findings aim to ensure AWE scoring systems’ construct validity and provide evidence that AWE can rate student writing as well as humans do.
Assessment validation, however, is more complex than simply comparing scores from different raters or measures. Chung and Baker (2003) caution that "high reliability or agreement between automated and human scoring is a necessary, but insufficient condition for validity" (p. 29). As Weir (2005) points out, construct validity should not be seen purely as "a matter of the a posteriori statistical validation," but it also needs to be viewed as an "a priori investigation of what should be elicited by the test before its actual administration" (p. 17). Weir stresses the importance of the non-statistical a priori validation and suggests that "theory-based validity" and "context validity" are crucial for language assessment. From a socio-cognitive approach, Weir notes that these two types of validity have "a symbiotic relationship" and are influenced by, and in turn influence, the criteria or construct used for marking as part of scoring validity (p. 20). He calls special attention to the role of context in language assessment, as context highlights the social dimension of language use and serves as an essential determinant of communicative language ability.
To examine the theory-based validity of AWE, we need to discuss what writing is from a theoretical perspective. A currently accepted view of writing employs a socio-cognitive model emphasizing writing as a communicative, meaning-making act. Writing requires not only linguistic ability for formal accuracy but, more importantly, meaning negotiation with readers for genuine communicative purposes. Writing thus needs to take into account both internal language processing and contextual factors that affect how texts are composed and read (Flower, 1994; Grabe & Kaplan, 1996; Hyland, 2003). Most AWE programs, however, are theoretically grounded in a cognitive information-processing model, which does not focus on the social and communicative dimensions of writing. They treat texts solely as "code" devoid of sociocultural contexts and "process them as meaningless ‘bits’ or tiny fragments of the mosaic of meaning" (Ericsson, 2006, p. 36). They "read" student essays against generic forms and preset information, but show no concern for human audiences in real-world contexts.
Even the Conference on College Composition and Communication (CCCC) in the U.S. has expressed disapproval of using AWE programs for any assessment purpose and made a strong criticism: "While they [AWE programs] may promise consistency, they distort the very nature of writing as a complex and context-rich interaction between people. They simplify writing in ways that can mislead writers to focus more on structure and grammar than on what they are saying by using a given structure and style" (CCCC, 2006). CCCC’s criticism expresses their concern with not only the theory-based validity of AWE programs but also a washback effect, or a "consequential validity" (in Weir’s, 2005, terminology): AWE use may encourage students to write to gain high scores by giving more attention to the surface features that are more easily detected by AWE systems than to the construction of meaning for communicative purposes (Cheville, 2004).
With regard to the context validity, information on how and why AWE is used to assess student writing in educational contexts is often lacking (Chung & Baker, 2003). When the context and the purpose of using AWE are unknown, it is difficult to truly judge the validity of AWE programs. In addition, Keith (2003) points out that most psychometric studies on AWE have been conducted on large-scale standardized tests rather than on classroom writing assessments; hence, the validity of AWE could differ in these two types of contexts. He speculates that the machine-human agreement rate may be lower for classroom assessments, since the content and meaning of student essays is likely to be more valued by classroom teachers.
Another important validation issue for AWE is the credibility of the scoring systems. A number of studies found that writers can easily fool these systems. For instance, an essay that is lengthy or contains certain lexico-grammatical features preferred by the scoring systems can receive a good score, even though the content is less than adequate (Herrington & Moran, 2001; Powers, Burstein, Chodorow, Fowles, & Kukich, 2002; Ware, 2005). Students can thus devise means of beating such systems, rather than making a real effort to improve their writing. Moreover, since AWE systems process an essay as a series of codes, they fail to recognize either inventive or illogical writing (Cheville, 2004), nor can they recognize nuances such as sarcasm, idioms, and clichés used in student essays (Herrington & Moran, 2001). When meaning and content are more emphasized than form, the fairness of AWE scoring is often called into question.
Pedagogical Foundation of AWE: Formative Learning and Learner Autonomy
To enhance their pedagogical value, several AWE programs, such as MY Access! and Criterion, have been developed for not only summative but also formative assessment purposes by providing scores and diagnostic feedback on various rhetorical and formal aspects of writing for every essay draft submitted to their scoring systems. Students can then use the computer-generated assessment results and diagnostic advice to help them revise their writing as many times as they need. The instructional efficacy of AWE, as Shermis and Burstein (2003b) suggest, increases when its use moves from that of summative evaluation to a more formative role. Though AWE scoring systems’ validity remains contended, their diagnostic feedback function seems pedagogically appealing for formative learning.
Formative assessment is used to facilitate learning rather than measure learning, as it focuses on the gap between present performance and the desired goal, thus helping students to identify areas of strengths and weaknesses in gaining directions for improvement (Black & Wiliam, 1998). For second language (L2) writing, formative feedback, as Hyland (2003) suggests, is particularly crucial in improving and consolidating learners’ writing skills. It serves as an in-process support that helps learners develop strategies for revising their writing. Formative feedback can therefore support process-writing approaches that emphasize the need for multiple drafting through a scaffold of prompts, explanations, and suggestions. Although formative feedback is a central aspect of writing instruction, Hyland and Hyland (2006) point out that research has not been unequivocally positive about its role in L2 writing development since many pedagogical issues regarding feedback remain only partially addressed. The form, the focus, the quality, the means of delivery, the need for, and the purpose of feedback can all affect the usefulness of feedback in improving writing. These issues are vital not only for human feedback but for automated feedback as well.
Studies on AWE for formative learning, however, have not been able to demonstrate that automated feedback is of much help during students’ revising process. The most frequently reported reason is that automated feedback provides formulaic comments and generic suggestions for all the submitted revisions. Thus, students may find it of limited use. Moreover, since such feedback is predetermined and unable to provide context-sensitive responses involving rich negotiation of meaning, it is useful only for the revision of formal aspects of writing but not of content development (Cheville, 2004; Grimes & Warschauer, 2006; Yang, 2004; Yeh & Yu, 2004). Additionally, Yang’s (2004) study reveals that more advanced language learners appeared to show less favorable reactions toward the AWE feedback. Learners’ language proficiency may constitute another variable affecting the value of such feedback.
AWE programs, like many other CALL tutors, are designed to foster learner autonomy by performing error diagnosis of learner input, generating individualized feedback, and offering self-access resources such as dictionaries, thesauri, editing tools, and student portfolios. In theory, such programs can provide opportunities for students to direct their own learning, independent of a teacher, to improve their writing through constant feedback and assistance features in a self-regulated learning environment. However, whether students can develop more autonomy in revising their writing through computer-generated feedback and making use of the self-help writing and editing tools available to them is uncertain. This may lead to questions of student attitudes toward and motivation for the use of AWE (Ware, 2005). Additionally, Beatty (2003) cautions that CALL tutors often "follow a lock-step scope and sequence," thus giving learners "only limited opportunities to organize their own learning or tailor it to their special needs" (p. 10). Such a problem may also occur when AWE is used.
Vendor-funded studies on AWE programs have demonstrated significant improvement on standardized writing tests (e.g., Attali, 2004; Elliot, Darlington, & Mikulas, 2004; Vantage Learning, 2007). Although these results are encouraging, Warschauer and Ware (2006) criticize many of these studies for being methodologically unsound and outcome-based. Accordingly, a major problem of this type of research is that "it leaves the educational process involved as a black box" (p. 14). These studies seem to attribute the observed writing improvement to the AWE software itself but ignore the importance of learning and teaching processes. Warschauer and Ware thus suggest that research on AWE should investigate "the interaction between use and outcome" (p. 10), for it can provide a more contextualized understanding of the actual use of AWE and its effectiveness in improving writing.
One recent classroom-based AWE study on the interaction between use and outcome is particularly worth noting. Grimes and Warschauer (2006) investigated how MY Access! and Criterion were implemented in U.S. high school writing classes. They found two main benefits of using AWE: increased motivation to practice writing for students and easier classroom management for teachers. More importantly, their study revealed three paradoxes of using AWE. First, teachers’ positive views of AWE did not contribute to more frequent use of the programs in class, as teachers needed class time for grammar drills and preparation for state tests. Second, while teachers often disagreed with the automated scores, they viewed AWE positively because, for students, the speed of responses was a strong motivator to practice writing, and, for teachers, the automated scores allowed them to offload the blame for grades onto a machine. Third, teachers valued AWE for revision, but scheduled little time for it. Students thus made minimal use of automated feedback to revise their writing except to correct spelling, punctuation, and grammatical errors. Their revisions were generally superficial and had little improvement in content. In addition, the use of these two programs did not significantly improve students’ scores on standardized writing tests. The authors caution that AWE can be misused to reinforce artificial, mechanistic, and formulaic writing disconnected from communication in real-world contexts.
Based on the studies reviewed here, it can be concluded that the validity of AWE scoring systems has not been thoroughly established and the usefulness of automated feedback remains uncertain in any generalized sense. AWE programs, even those designed for formative learning and emphasizing learner autonomy, do not seem to improve student writing significantly in either form or meaning. Therefore, AWE programs are often suggested to be used as a supplement to writing instruction rather than as a replacement of writing teachers (Shermis & Burstein, 2003b; Ware, 2005; Warschauer & Ware, 2006). Yet, how AWE can be used as an effective supplement in the writing class and how different learning contexts and pedagogical designs might affect the effectiveness of AWE warrants further investigation. The present study addresses these issues by exploring the interaction between different pedagogical practices with an AWE program in three EFL writing classes and student perceptions of learning outcomes. The purpose is to reveal how different learning/teaching processes affect the perceived value of AWE in improving students’ writing.
This study is a naturalistic classroom-based inquiry that was conducted in three EFL college writing classroom contexts in a university in Taiwan. The three writing classes, for third-year English majors, were taught by three instructors who were all experienced EFL writing teachers. They shared some common features in their writing instruction: 1) the three writing courses were required of third-year Taiwanese college students majoring in English; 2) their course objectives all aimed to develop students’ academic writing skills; 3) the three instructors used the same textbook and taught similar content; 4) each class ran for 18 weeks and met three hours per week; and 5) they adopted a similar process-writing approach, including model essay reading activities followed by language exercises and pre-writing, drafting and revising activities.
An AWE program, MY Access! (Version 5.0) (Vantage Learning, 2005), was implemented in the three writing classes for one semester on a trial basis. The main purpose for the AWE implementation was to facilitate students’ writing development and to reduce the writing instructors’ workload. Before the writing courses started, the three instructors received a one-hour training workshop given by a MY Access! consultant. The workshop introduced how each feature of the program worked; however, it did not provide hands-on practice or instructional guidelines. Consequently, the three instructors had to spend extra time working with the program to familiarize themselves with its features and to develop their own pedagogical ideas for the AWE implementation. They had total autonomy to design writing activities with MY Access! as they saw fit for their respective classes. No predetermined decision on how to incorporate the program with their writing instruction was made by the institution or the researchers.
The three writing classes varied slightly in size: there were 26 students in Class A, 19 in Class B, and 23 in Class C. All the students were Taiwanese third-year college students majoring in English. They had formally studied English for eight years: six years in junior and senior high school and two years in college. Their English language proficiency was approximately at the upper-intermediate level. They were taking the required junior year EFL academic writing course and also had taken fundamental academic writing courses in their freshman and sophomore years. It was their first time using AWE software in their writing classes. As English majors, most of them were highly motivated to develop their English writing skills.
MY Access! is a web-based AWE program using the IntelliMetric automated essay scoring system developed by Vantage Learning. The scoring system has been calibrated with a large set of pre-scored essays with known scores assigned by human raters. These essays are then used as a basis for the system to extract the scoring scale and the pooled judgment of the human raters (Elliot, 2003). It can provide holistic and analytic scores on a 4-point or 6-point scale along with diagnostic feedback on five dimensions of writing: focus and meaning, content and development, organization, language use and style, and mechanics and conventions. The program offers a wide range of writing prompts from informative, narrative, persuasive, literary, and expository genres for instructors to select for writing assignments. It can be used as a formative or summative assessment tool. When used for formative learning, the program allows for multiple revisions and editing. Students can revise their essays multiple times based on the analytic assessment results and diagnostic feedback given to each essay draft submitted to the program. When run for summative assessment, the system is configured to provide a single submission with an overall assessment result.
In addition, the program provides a variety of writing assistance features, including My Editor, Thesaurus, Word Bank, My Portfolio, Writer’s Checklist, Writer’s Guide, Graphic Organizers, and Scoring Rubrics. The first four features were most commonly used in the three writing classes: 1) My Editor is a proofreading system that automatically detects errors in spelling, grammar, mechanics and style, and then provides suggestions on how such problems can be corrected or improved; 2) Thesaurus is an online dictionary that offers a list of synonyms for the word being consulted; 3) Word Bank offers words and phrases for a number of writing genres, including comparison, persuasive, narrative, and cause-effect types of essays; 4) My Portfolio contains multiple versions of essays from a student along with the automated scores and feedback. It allows students to access their previous works and view their progress.