Enhancing Writing Assessment: AI Assisted Essay Marking

Enhancing Writing Assessment: AI Assisted Essay Marking

Table of Contents:

  1. Introduction
  2. Background Information on Writing Assessment
  3. Challenges in Developing Automated Scoring Models
  4. Process for Developing and Evaluating an Automated Evaluation System
  5. Training and Validating the Models
  6. Analyzing Model Performance
  7. Findings from Initial Model testing
  8. Promising Potential of Automated Scoring Models
  9. Limitations of Automated Evaluation Systems
  10. Future Research and Conclusion

Introduction

In this article, we will delve into the world of automated writing scoring models and their implications for the field of writing assessment. We will explore the considerations, procedures, and preliminary findings from the work to transition from a solely human reader system to an assistant that incorporates machine scores. By incorporating automated writing evaluation (AWE) systems, which utilize artificial intelligence, natural language processing, and statistical techniques, we can enhance efficiency, scalability, and the overall quality of ratings. However, there are certain challenges and limitations that need to be addressed in order to ensure a reliable and fair writing assessment for all test takers. In this article, we will discuss the development, evaluation, and potential of automated scoring models, while also recognizing the need for human input and further refinement.

Background Information on Writing Assessment

Before diving into the intricacies of automated scoring models, let's first familiarize ourselves with the writing assessment system. The writing assessment typically revolves around two types of prompts: writing an email and responding to a survey question. Test takers read the instructions and prompts on the screen and then proceed to write and edit their responses on the computer. In the current human-only rating session, all writing tasks are evaluated by human readers using a standardized analytic rating rubric. Each writing sample undergoes evaluation by two independent raters, with any significant disagreements being resolved by a third rater.

Challenges in Developing Automated Scoring Models

Developing automated scoring models for the writing assessment poses several challenges. Firstly, there exists a wide range of proficiency levels among test takers, with 11 possible scoring points. Additionally, the writing samples are relatively short, averaging about 118 words. Furthermore, the test taker population comprises individuals from diverse linguistic, cultural, and educational backgrounds. These factors make it challenging to create scoring models that accurately capture the nuances of writing ability across such a varied test taker population.

Process for Developing and Evaluating an Automated Evaluation System

The development and evaluation of an automated evaluation system follow a systematic process. It begins with data preparation, which involves collecting, selecting, reviewing, and pre-processing the data to be used for training the models. The data is split into the training set and a small initial testing set. To create accurate scoring models, data scientists and engineers employ techniques from artificial intelligence, natural language processing, and statistics. The models are trained using a diverse range of writing samples that cover different proficiency levels and represent the test taker population accurately.

To ensure the quality of the ratings provided, responses where human readers reached a Consensus in their primary judgment are selected. Additionally, internal staff extensively review a subset of the writing samples to ensure the ratings adequately reflect the writing performance level. High-quality rating samples used in greater training and certification are also included in the training set.

Training and Validating the Models

Two scoring models are trained, one for each writing test type (W1 and W2). Language assessment experts and psychometricians play a crucial role in ensuring the integrity of the models. Once the initial models are trained and validated on a small sample, they are applied to score an independent sample of 271 writings. This analysis of the model performance on the small testing sample serves as a foundation for a more comprehensive evaluation of the models. It also sheds light on areas where the models may need further refinement.

Analyzing Model Performance

The agreement between human and machine scores is a critical aspect of evaluating the model performance. Apart from agreement statistics, the correlation between machine and human ratings is also examined. In addition to these quantitative metrics, writing samples with significant disagreements between machine and human-assigned scores are reviewed. These analyses provide insight into the model's ability to accurately identify off-topic writings and grade templated responses.

Findings from Initial Model Testing

The results from the initial model testing show satisfactory overall agreement with human scoring. However, the machine scores tend to show a central tendency at level 8, resulting in lower agreement levels at the high end of the score distribution. At the lower end, the machine struggles to accurately identify off-topic writings or precisely grade templated responses. While these findings indicate promising potential for automating scoring models, they also highlight areas that require improvement and refinement.

Promising Potential of Automated Scoring Models

Despite the limitations, automating scoring models hold immense promise as an addition to the writing assessment process. Well-trained scoring models can provide consistent, objective, and Speedy ratings across a large number of essays. This enhances scalability and allows for the assessment of a larger volume of test takers. However, it is crucial to acknowledge the limitations of automated evaluation systems in evaluating Originality and creativity.

Limitations of Automated Evaluation Systems

Automated evaluation systems have their limitations and cannot fully replace human readers. While automated scoring models excel in efficiency and scalability, they cannot fully comprehend the meaning of essays or evaluate complex constructs such as creativity and critical thinking. These limitations necessitate the use of machine scores in conjunction with human readings to ensure a reliable and valid writing assessment.

Future Research and Conclusion

The initial findings from this study pave the way for further research and refinement of the automated scoring models. More comprehensive studies will be conducted to evaluate and improve the models, ensuring their validity and fairness for all test takers. By striking a balance between automated scoring and human judgment, we can harness the benefits of efficiency and scalability while maintaining the quality and reliability of the writing assessment.

Highlights:

  • Automated writing scoring models aim to incorporate machine scores in the human reader system for more efficient and scalable writing assessment.
  • Challenges in developing scoring models include a wide range of proficiency levels, short writing samples, and a diverse test taker population.
  • The process for developing and evaluating an automated evaluation system involves data preparation, training and validating the models, and analyzing their performance.
  • Initial findings from model testing show promising agreement with human scoring, but room for improvement in identifying off-topic writings and grading templated responses.
  • The potential of automated scoring models lies in providing consistent, objective, and speedy ratings, enhancing the scalability of the writing assessment.
  • Limitations of automated evaluation systems include the inability to evaluate originality and creativity, necessitating the use of machine scores in conjunction with human readings.
  • Further research and refinement of the models are necessary to ensure their validity and fairness for all test takers.

FAQ:

Q: Can automated scoring models accurately assess creativity?

A: Automated scoring models have limitations when it comes to evaluating creativity. While they excel in efficiency and objectivity, creativity is a complex construct that requires human judgment and understanding.

Q: How do automated scoring models handle a diverse test taker population?

A: Automated scoring models are trained using a diverse range of writing samples that represent the test taker population accurately. This helps ensure that the models can capture the nuances and variations in writing ability across different language, cultural, and educational backgrounds.

Q: Are machine scores more consistent than human scores?

A: Machine scores have the advantage of providing consistent ratings across a large number of essays. However, human scores bring the ability to comprehend the meaning of essays and evaluate complex constructs. Combining machine scores with human readings ensures a reliable and valid writing assessment.

Q: Can automated scoring models handle large volumes of essays?

A: Yes, one of the key advantages of automated scoring models is their scalability. They can handle large volumes of essays, making the assessment process more efficient and manageable.

Q: What are the future implications of automated scoring models in writing assessment?

A: The initial findings suggest that automated scoring models have the potential to greatly enhance the writing assessment process. However, further research and refinement of the models are necessary to ensure their validity and fairness for all test takers.

Resources:

Most people like

Find AI tools in Toolify

Join TOOLIFY to find the ai tools

Get started

Sign Up
App rating
4.9
AI Tools
20k+
Trusted Users
5000+
No complicated
No difficulty
Free forever
Browse More Content