From Engineering to Data Science: Mahmoud's Journey to Solving an NLP Problem

From Engineering to Data Science: Mahmoud's Journey to Solving an NLP Problem

Table of Contents

  1. Introduction
  2. Background and Education
  3. Journey in Data Science
  4. Fellowship AI Experience
  5. NLP Problem Solution
    • Data Preprocessing
    • Exploratory Data Analysis
    • Cleaning the Texts
    • Building a Cleaning Function
    • Visualizing the Cleaned Reviews
    • Shuffle and Split the Data
    • Building Baseline Models
    • Using Stemming
    • Using TF-IDF Vectorizer
    • Implementing Deep Learning
    • Using Pre-Trained Embeddings
    • Training LSTM Networks
    • Evaluating Different Models
  6. Conclusion
  7. Highlights
  8. Frequently Asked Questions (FAQs)

Introduction

In this article, we will delve into the journey of Mahmoud, a junior data scientist, who will walk us through his experience in data science and his solution to an NLP problem. We will explore his background, education, and his participation in the Fellowship AI program. Mahmoud will share his approach to solving the NLP problem, highlighting the steps he took, the models he built, and the results he obtained. Let's dive in and discover Mahmoud's story.

Background and Education

Before embarking on his career in data science, Mahmoud pursued a degree in Engineering in Egypt. However, he seized a unique opportunity to start a new life in Cherokee. Mahmoud joined Public Administration as his new Bachelor's major, although it was unrelated to data science. His passion for data science eventually led him to pursue exchange programs in Uncle University and Stuttgart University, where he focused on data science courses.

Journey in Data Science

During his time in Germany, Mahmoud immersed himself in the study of data science. He learned Python and honed his programming skills in German, as well as gained knowledge in JavaScript, HTML, CSS, SQL, and Tableau. Mahmoud delved into non-traditional machine learning techniques such as deep learning and TensorFlow. With these skills, he proudly identifies himself as a junior data scientist.

Fellowship AI Experience

Seeking to enhance his knowledge and experience, Mahmoud applied for the Fellowship AI program. He recognized the valuable opportunity it presented, providing access to senior data scientist mentors and real-life tasks. Mahmoud eagerly awaits the chance to gain additional expertise and secure a better job in the near future.

NLP Problem Solution

Mahmoud now proceeds to walk us through his solution to the NLP problem. The first step involves importing necessary libraries such as Pandas, NumPy, Beautiful Soup, and NLTK, followed by exploring the data. He discovered the presence of HTML tags, punctuation marks, and digits that needed to be removed. Additionally, he addressed the issue of duplicated values in the dataset.

Data Preprocessing

To prepare the text data for analysis, Mahmoud visualized a sample of the reviews and then proceeded to clean them by removing punctuation and digits. He developed a function for text cleaning, which he tested and applied to the dataset. Mahmoud continued to visualize the reviews after cleaning and performed shuffling of the data.

Building Baseline Models

Mahmoud began building his models, starting with the baseline model using CountVectorizer from scikit-learn and logistic regression. This initial model achieved a 65% accuracy rate. He further improved the model's performance by introducing a stemmer class, resulting in a slightly higher accuracy of 64%.

Using TF-IDF Vectorizer and Deep Learning

Next, Mahmoud utilized TF-IDF vectorization to enhance the model's performance and achieved an accuracy rate of 86%. He then proceeded to explore non-traditional machine learning techniques, including random forest classifiers and decision tree classifiers, which proved to be overfitting.

Mahmoud's Curiosity in deep learning prompted him to employ pre-trained embeddings, specifically the Universal Sentence Encoder from TensorFlow. Despite achieving a 60.2% accuracy rate, he continued his exploration by training LSTM networks, but the results did not meet his expectations.

Evaluating Different Models

After comparing the performance of various models, Mahmoud concluded that his best model was the conventional network, which incorporated sequential data. This model achieved a 68% accuracy rate. Mahmoud also visualized the data to gain further insights into the model's predictions, identifying both strong positive and negative sentiments.

Conclusion

In conclusion, Mahmoud's journey in data science showcases his drive and dedication to the field. His diverse educational background and experiences have shaped him into a junior data scientist with a wide range of skills. Through his participation in the Fellowship AI program, Mahmoud hopes to acquire more expertise and secure a rewarding job in the field of data science.

Highlights

  • Mahmoud's transition from studying Engineering to pursuing a career in data science
  • His educational experiences in Egyptian and German universities
  • The skills Mahmoud developed in Python, JavaScript, HTML, CSS, SQL, and Tableau
  • Mahmoud's application to the Fellowship AI program and his aspirations for professional growth
  • The step-by-step walkthrough of Mahmoud's solution to an NLP problem
  • Analysis of various models, including baseline models, TF-IDF vectorization, and deep learning techniques
  • The evaluation and comparison of different models to determine the best solution

Frequently Asked Questions (FAQs)

Q: What was Mahmoud's educational background before entering the field of data science? A: Mahmoud studied Engineering in Egypt and later pursued a degree in Public Administration.

Q: What motivated Mahmoud to pursue a career in data science? A: Mahmoud discovered his passion for data science during an exchange program in Germany, where he took courses related to the field and developed an interest in non-traditional machine learning techniques.

Q: What was the purpose of Mahmoud's application to the Fellowship AI program? A: Mahmoud applied to the Fellowship AI program to gain more hands-on experience, receive mentorship from senior data scientists, and enhance his chances of securing a better job in the future.

Q: What models did Mahmoud use to solve the NLP problem? A: Mahmoud utilized various models, including a baseline model with CountVectorizer and logistic regression, TF-IDF vectorization, pre-trained embeddings using the Universal Sentence Encoder, and LSTM networks.

Q: What was Mahmoud's best-performing model for the NLP problem? A: Mahmoud's best-performing model was the conventional network, which achieved a 68% accuracy rate in predicting sentiments in the text data.

Q: How does Mahmoud plan to improve upon his solution to the NLP problem? A: Mahmoud suggests improving the dataset's labeling system, incorporating three classes (positive, negative, and neutral) instead of just positive and negative, to enhance the accuracy and to better classify the sentiments expressed in the text.

Resources:

Find AI tools in Toolify

Join TOOLIFY to find the ai tools

Get started

Sign Up
App rating
4.9
AI Tools
20k+
Trusted Users
5000+
No complicated
No difficulty
Free forever
Browse More Content