Revolutionize Essay Scoring with Automated System

Revolutionize Essay Scoring with Automated System

Table of Contents:

  1. Introduction
  2. Literature Review
  3. Feature Extraction and Regression Approach
  4. Grammar and Spelling Error Detection
  5. Readability Features
  6. Part of Speech Features
  7. Argument Related Features
  8. Topic Related Sentences
  9. Regression Models for Essay Scores
  10. Recurrent Neural Network for Scoring Essays
  11. Network Structure of RNN
  12. Training Process
  13. Results and Performance
  14. Discoveries and Insights
  15. Discussion
  16. Drawbacks and Future Improvements

Introduction

Hello, our group is working on an automated scoring system called Ozzy. My name is Armand, and I will give You a brief introduction to our final project. The aim of our project is to Create an automated essay scoring system that simplifies the traditional grading process, reduces subjectivity, and minimizes human errors. Our system is designed to analyze grammar errors, vocabulary coherence, and the relationship of the essay with the given topic or source.

Literature Review

Our project is inspired by various literature works and publications that have explored similar topics. We have also incorporated concepts learned in lectures, such as Bayesian and n-grams in neural networks. As for our data set, we have obtained it from a previous Kaggle competition called the ASAP Automated Student Assessment Prize. This competition consists of two types of essays: persuasive essays and source-dependent essays. Each essay is scored Based on different criteria, and we have a total of eight subsets with varying sizes and score ranges.

Feature Extraction and Regression Approach

One of the approaches we have taken in our project is feature extraction combined with regression. In this approach, our first task is to transform the essays into numerical representations. We have tried four different forms of word embedding: Doc2Vec, Bag of Words, TF-IDF, and word embeddings trained as a classification problem. We have experimented with various classifiers, such as SVM, random forests, and logistic regression, to determine the most accurate prediction for each subset.

Grammar and Spelling Error Detection

To assess grammar errors in the essays, we have utilized the language-check Package, which detects grammar errors in text. We tokenized each sentence in the essays and calculated the percentage of sentences with grammar errors as a measure of the existing errors. For spelling errors, we initially used traditional dictionaries but encountered issues with words like "Facebook" appearing as spelling errors. Therefore, we created a customized dictionary using high-score essays for additional filtering. This approach helped us identify spelling errors more accurately.

Readability Features

To measure the readability of the essays, we used the textstat package to calculate the Flesch-Kincaid score and reading ease. These features provide insights into the essay's complexity and the grade level it corresponds to. Additionally, we extracted part-of-speech features by employing the natural language toolkit (NLTK) to tag words and determine the percentage of nouns, verbs, and other word types in the essay.

Argument Related Features

To assess argument-related features, we followed the guidelines provided in a research paper on scoring persuasive essays using LDA (Latent Dirichlet Allocation). We identified opinion synthesis by using the opinion lexicon from the NLTK. For source-dependent essays, which include a large source text, we created a dictionary based on the essay prompt to detect target synthesis. In the case of persuasive essays, we identified topic sentences by identifying keywords from the prompt and searching for sentences that directly Mentioned those keywords or their synonyms and antonyms.

Regression Models for Essay Scores

We employed two regression models to predict essay scores. The input data consisted of essays and corresponding features, with each essay being represented by a matrix of 26 features. The models we used were linear regression and Bayesian linear Ridge regression. The latter estimates a probabilistic model with l2 regularization, assuming a prior given by a spherical Gaussian. We found that the choice of model depended on the data distribution, with the Bayesian linear Ridge regression performing better for subsets with a Gaussian-like distribution.

Recurrent Neural Network for Scoring Essays

Another approach we explored to score essays was the use of recurrent neural networks (RNNs). We organized each essay into words and punctuation tokens and converted them into word vectors using pre-trained global vectors. The RNN models we built employed either LSTM or GRU layers. The input was passed through a linear layer, followed by a two-layer bi-directional LSTM or GRU. We took the mean of the Hidden states and fed them into a batch normalization layer, which provided the essay scores as the output.

Network Structure of RNN

The network structure of our RNN models consists of a linear layer that takes the input as batch size times the sequence length times the vector dimension. This output is then fed into a two-layer bi-directional LSTM or GRU. The mean of the LSTM or GRU hidden states is passed through a batch normalization layer, and the final output is obtained from a linear layer, resulting in a vector of scores.

Training Process

During the training process, we used mean squared error (MSE) as the loss function and Adam optimizer for model optimization. We calculated the validation loss at the end of each epoch to monitor the model's performance. Although we did not implement early stopping, we saved the model with the best validation set performance for further evaluation.

Results and Performance

The performance of our models was evaluated using quadratic weighted Kappa (QWK) scores. QWK is a measure of inter-rater agreement between curators who provide numeric ratings. A higher QWK score indicates better agreement between the raters. Overall, our models outperformed the feature extraction models, with the PR subset performing better in comparison to others. We made various discoveries during the evaluation process, such as the faster convergence of GRU compared to LSTM and the superior performance of LSTM for some data subsets.

Discoveries and Insights

Through our project, we made several discoveries and gained insights into automated essay scoring. We found that GRU models generally converge faster, but LSTM models perform better for some data subsets. Furthermore, we observed that linear regression models work well for subsets with Gaussian-like distributions, while Bayesian linear Ridge regression models perform better in other cases. Additionally, we realized that the recursive evaluation of each word in the essay provides a better evaluation of overall performance.

Discussion

After evaluating the results and analyzing our findings, we engaged in discussions about the strengths and limitations of our project. We learned about different data processing operations, tokenization techniques, and language models for numerical prediction. However, there are still areas for improvement. For instance, our feature extraction approach needs further exploration and fine-tuning. We also need to incorporate more text-specific features and evaluate the essay organization and logical flow. Furthermore, considering other state-of-the-art networks like DenseNet could enhance the overall performance of our automated scoring system.

Drawbacks and Future Improvements

During the project, we encountered a few drawbacks and identified potential areas for improvement. We found it challenging to extract argument-based features, vary the style and topics, and create comprehensive spelling and grammar checks suitable for different semantics. To improve our system, we suggest refining the feature extraction process, exploring abstract features, increasing the Dimensions and layers of our model architecture, and experimenting with other state-of-the-art networks like DenseNet.

Highlights:

  • Introduction to an automated scoring system for essays
  • Incorporation of literature works and concepts from lectures
  • Feature extraction and regression approach for essay scoring
  • Grammar and spelling error detection using language-check package
  • Readability features based on Flesch-Kincaid score and reading ease
  • Part of speech features using NLTK tagging
  • Argument-related features based on opinion lexicon and topic sentences
  • Regression models (linear and Bayesian linear Ridge) for essay scores
  • Recurrent neural network (RNN) for scoring essays
  • Network structure and training process of RNN models
  • Evaluation of models based on quadratic weighted Kappa scores
  • Discoveries, insights, and discussions on model performance
  • Drawbacks and future improvements for the automated scoring system

FAQ

Q: How accurate are the essay scores generated by the automated scoring system? A: The accuracy of the essay scores depends on the specific subset and the model used. Overall, our models outperformed the feature extraction approach and achieved good agreement with human raters, as indicated by high quadratic weighted Kappa (QWK) scores.

Q: Can the system detect and evaluate grammar and spelling errors in essays? A: Yes, the system includes a grammar error detection mechanism using the language-check package. Spelling errors are also detected using customized dictionaries, including words from high-score essays. These error detection features contribute to the holistic evaluation of essays.

Q: Are there any limitations to the system's performance? A: The system has some limitations, such as the difficulty of extracting meaningful arguments and the need for improved grammatical and spelling checks that are adaptable to various semantics. Additionally, the computational cost of certain models needs to be considered when working with larger datasets.

Q: How can the system be improved in the future? A: Future improvements could include refining the feature extraction process, exploring more abstract features, and incorporating other state-of-the-art network architectures. Additionally, expanding the dimensions and layers of the models and incorporating larger datasets can enhance the system's performance.

Q: Is the system suitable for scoring essays on any topic? A: The system can be customized and fine-tuned to score essays on a wide range of topics. However, further development and exploration of topic-specific features are needed to ensure accurate scoring across diverse essay subjects.

Most people like

Find AI tools in Toolify

Join TOOLIFY to find the ai tools

Get started

Sign Up
App rating
4.9
AI Tools
20k+
Trusted Users
5000+
No complicated
No difficulty
Free forever
Browse More Content