Revolutionizing Essay Marking with AI: Automated Writing Scoring Models

Find AI Tools in second

Find AI Tools
No difficulty
No complicated process
Find ai tools

Revolutionizing Essay Marking with AI: Automated Writing Scoring Models

Table of Contents:

  1. Introduction
  2. Background Information on Writing Assessment
  3. Challenges in Developing Automated Scoring Models
  4. Process of Developing and Evaluating an Automated Evaluation System
  5. Training and Validating the Models
  6. Applying the Models to Score Independent Writing Samples
  7. Evaluating Model Performance
  8. Strengths and Limitations of Automated Scoring Models
  9. Using Machine Scores in Conjunction with Human Ratings
  10. Future Studies and Improvements

Introduction

Automated writing evaluation has become a popular and efficient method for conducting ratings in writing assessments. This paper explores the development and evaluation of automated scoring models in the Context of a human reader-only system. By incorporating machine scores, the workload can be elevated, efficiency improved, and high-quality ratings maintained.

Background Information on Writing Assessment

The writing assessment used in this study consists of two types of Prompts: writing an email and responding to a survey question. Test takers Read the instructions and prompts on the screen, then write and edit their responses on the computer. Currently, all writing tasks are rated by human readers using a standardized analytic rating rubric.

Challenges in Developing Automated Scoring Models

Developing automated scoring models for this particular writing assessment poses several challenges. These include a wide range of proficiency levels, relatively short writing samples, and a diverse test taker population with different language, cultural, and educational backgrounds.

Process of Developing and Evaluating an Automated Evaluation System

The process of developing and evaluating an automated evaluation system involves data preparation, training and validating the models, applying the models to score independent writing samples, and evaluating model performance. The data is collected, reviewed, and pre-processed to train the models. Multiple teams handle specific tasks, including data scientists, engineers, language assessment experts, and psychometricians.

Training and Validating the Models

Two models are trained, one for each writing test Type (W1 and W2). The models are trained using artificial intelligence, natural language processing, and statistical techniques. Writing samples covering a broad range of proficiency levels are selected for training. Response samples that had agreement among human readers in the primary judgment are chosen. Additionally, high-quality rating samples are included in the training set.

Applying the Models to Score Independent Writing Samples

After the initial models are trained and validated on a small sample, they are applied to score an independent sample of 271 writings. The agreement between human and machine scores by proficiency levels is analyzed, and correlation between machine and human ratings is investigated. The writing samples with the largest disagreement between machine and human ratings are reviewed.

Evaluating Model Performance

The initial model testing shows satisfactory overall agreement with human scoring. However, the machine scores tend to show central tendency at level 8, resulting in low agreement levels at the high end of the score distribution. The machine also struggles with accurately identifying off-topic writings and grading templated responses. These findings indicate the need for refinement and improvement of the models.

Strengths and Limitations of Automated Scoring Models

Automated scoring models offer consistent, objective, and Speedy ratings across a large number of essays. They enhance the scalability of the assessment by handling large volumes of test takers. However, limitations include the inability to evaluate Originality and creativity.

Using Machine Scores in Conjunction with Human Ratings

To ensure a reliable and valid writing assessment, machine scores can be used in conjunction with human ratings. This combination of automated and human evaluation can provide a more comprehensive and fair assessment of test takers' writing abilities.

Future Studies and Improvements

Further studies will be conducted to evaluate and improve the models, ensuring they are valid and fair to all test takers. Continuous refinement and input from multiple teams will contribute to enhancing the accuracy and reliability of the automated scoring models.

Article: Developing and Evaluating Automated Writing Scoring Models

Writing assessments play a crucial role in evaluating individuals' writing abilities. Traditional assessments involve human readers who analyze and rate written compositions Based on established rubrics. However, the emergence of automated writing evaluation (AWE) systems has transformed the way writing assessments are conducted.

AWE systems utilize artificial intelligence, natural language processing, and statistical techniques to provide efficient and scalable ratings. They have been shown to generate scores that exhibit a high correlation with human judgments, making them an attractive option for conducting reliable and objective evaluations.

In the early stages of AWE system development, the focus was on surface-level language features. This approach raised concerns about their ability to fully represent the constructs of interest. Unlike human readers who comprehend the meaning of essays, consider multiple aspects in their judgments, and evaluate complex constructs like creativity and critical thinking, machine scores may lack this comprehensive understanding.

To address these concerns, this study explores the incorporation of machine scores into a human reader-only system. By introducing machine scores, the aim is to alleviate the workload, improve efficiency, and maintain high-quality ratings.

The writing assessment used in this study includes two types of prompts: writing an email and responding to a survey question. Test takers engage with these prompts by reading the instructions and writing their responses on a computer. Currently, the evaluation of writing tasks is carried out exclusively by human readers who Apply a standardized analytic rating rubric.

Developing automated scoring models for this particular writing assessment presents several challenges. The assessment encompasses a wide range of proficiency levels, and the writing samples are relatively short, consisting of approximately 118 words. Additionally, the test taker population is diverse, with various language, cultural, and educational backgrounds.

The process of developing and evaluating an automated evaluation system involves data preparation, training and validating the models, applying the models to score independent writing samples, and evaluating the performance of the models.

Data preparation encompasses activities such as collecting, selecting, reviewing, and pre-processing the data to be used for training the models. The data is split into a training set and a small initial testing set. Separate scoring models are trained for each writing test type (W1 and W2), with different teams shouldering primary responsibilities at each step. Data scientists and engineers handle the main tasks of building the models, while language assessment experts and psychometricians contribute to supporting tasks and provide crucial input.

To ensure the quality of ratings provided by the models, responses with agreement among readers in the primary judgment serve as training samples. Internal staff also review many of the resulting rating samples to ensure their reflective nature in terms of the test takers' writing performance level. High-quality rating samples used in training and certification are also included in the training set.

Once the initial models are trained and validated on a small sample, they are applied to score an independent sample of 271 writings. The performance of the models on this small testing sample lays the groundwork for more comprehensive evaluation. Additionally, it sheds light on areas in which the models need refinement.

The agreement between human and machine scores is analyzed, particularly in relation to proficiency levels. Correlation between machine and human ratings is also investigated. The writing samples that exhibit the largest disagreement between machine and human ratings are given particular Attention during the review process.

The initial testing of the models shows satisfactory overall agreement with human scoring. However, the machine scores tend to cluster around level 8, leading to low agreement levels at the high end of the score distribution. Furthermore, the machine struggles to accurately identify off-topic writings and grade templated responses. These findings indicate the need for further refinement and improvement of the models to ensure their reliability and validity.

Despite the strengths of automated scoring models, such as providing consistent, objective, and speedy ratings, there are limitations. These models cannot evaluate originality and creativity, which are important aspects of writing. Therefore, a combination of machine scores and human ratings is proposed to ensure a comprehensive and fair assessment. This approach would leverage the efficiency and scalability of automated scoring while maintaining the expertise and judgment of human readers.

To enhance the accuracy and reliability of the automated scoring models, more studies will be conducted. These studies will evaluate the models in terms of validity and fairness for all test takers. Continuous refinement and input from multiple teams, including data scientists, engineers, language assessment experts, and psychometricians, will contribute to improving the models and enhancing the overall writing assessment process.

In conclusion, the development and evaluation of automated writing scoring models present a promising addition to the writing assessment field. By incorporating machine scores alongside human ratings, reliable and efficient evaluations can be achieved on a large Scale. However, it is important to acknowledge the limitations of automated systems and continuously work towards enhancing their capabilities. Through further studies and improvements, a more comprehensive and fair writing assessment system can be realized.

Most people like

Are you spending too much time looking for ai tools?
App rating
4.9
AI Tools
100k+
Trusted Users
5000+
WHY YOU SHOULD CHOOSE TOOLIFY

TOOLIFY is the best ai tool source.

Browse More Content