Automating Data Science Tasks with Large Language Models

Automating Data Science Tasks with Large Language Models

Table of Contents

  1. Introduction
  2. The Need for Automated Data Science
  3. Natural Language Processing and Large Language Models
  4. Evaluating the Performance of Large Language Models
  5. Previous Work on Automating Data Science
  6. Goals of the Research Project
  7. Creating a Benchmark for the Model
  8. The Action Plan Generator
  9. Example of the Model in Action
  10. Results and Key Takeaways
  11. Future Plans and Improvements
  12. Acknowledgements
  13. References

Introduction

In this article, we will explore the ability of large language models (LLMs) to answer zero-shot natural language queries in the field of data science. Data scientists often face challenges in executing tasks efficiently due to the time-consuming nature of their work. Our goal is to automate data science tasks using LLMs, specifically focusing on natural language processing (NLP) technology. In this article, we will discuss the background and significance of this research project, the methodology used, and the results obtained. We will also delve into previous work in automating data science, the goals of our research project, and the creation of a benchmark for evaluating LLM performance. Finally, we will highlight the key takeaways from this project and discuss future plans for further improvement.

The Need for Automated Data Science

Data scientists play a crucial role in extracting Meaningful insights from datasets using tools like numpy and pandas. However, the speed at which tasks are executed in data science can be a significant challenge, especially considering the rapidly evolving industry. This hampers the applicability of data science in various research fields. By automating data science tasks, we aim to make it more practical and effective in different domains.

Natural Language Processing and Large Language Models

Natural language processing (NLP) technology has gained immense popularity in the world of artificial intelligence. NLP enables the development of chatbots and other AI models that can understand and generate human-like text. Our research focuses on utilizing large language models (LLMs), a subcategory of NLP, to automate data science tasks. We aim to evaluate the capabilities of LLMs in performing such tasks and, accordingly, improve the models using different variants of the GPT (Generative Pre-trained Transformer) architecture.

Evaluating the Performance of Large Language Models

Before delving into the evaluation of LLMs, it was crucial to create a benchmark dataset. This dataset consists of toy or mini datasets, each accompanied by a set of questions typical of data science tasks. The datasets range in size from small (less than 100 rows) to medium (100-200 rows) and large (over 200 rows). The questions vary in difficulty, providing a comprehensive test for the LLMs. We manually compared the answers generated by the model with the expected answers obtained using numpy and pandas.

Previous Work on Automating Data Science

Automated Machine Learning (AutoML) is an area of research that also focuses on automating data science tasks, although not specifically through natural language processing. AutoML has gained popularity in automating tedious tasks like data pre-processing. Our research builds upon and extends this previous work by utilizing LLMs and natural language prompts to automate data science tasks.

Goals of the Research Project

Our research project has two primary goals. First, we aim to investigate whether LLMs can function as data scientists by employing a GPT 3.5-based action plan generator and executor. This way, the LLMs can process natural language questions commonly asked in the field of data science and respond with code-based answers. Second, we Seek to create a benchmark dataset comprising toy datasets and corresponding questions. By evaluating the LLMs' performance on this benchmark, we can assess if they are suitable for answering data science tasks and explore potential improvements using different GPT variants.

Creating a Benchmark for the Model

To create a benchmark dataset, we used GPT 3.5 to generate 50 toy datasets, each with their own set of questions. For instance, we generated a toy dataset with 100 rows and six columns that represented weather data in different cities. The datasets ranged in size, and the questions varied in difficulty. We manually determined the expected answers to the questions using numpy and pandas, providing a reference for evaluating the model's performance.

The Action Plan Generator

To process queries from the benchmark dataset, we employed the action plan generator. This mechanism converts the queries into a series of smaller steps, utilizing a technique called Chain of Thought prompting. The action plan generator leverages the GPT model's ability to understand the query and generate code in numpy and pandas to execute the steps. By breaking down the queries into smaller subtasks, we achieved higher accuracies and accounted for errors in query language.

Example of the Model in Action

An example of our model in action is the query "What is the 26 percentile of the date founded column?" using a small dataset called "zoos". The LLM, along with the action plan generator, is able to generate a natural language plan and subsequently write Python code, pandas code, and numpy code to solve the query. The code is executed in the LLM's programming environment, and the final output is obtained.

Results and Key Takeaways

Our evaluation yielded promising results. For small, medium, and large datasets, our LLM achieved accuracies of 88.82%, 87.5%, and 89.41%, respectively. The main errors observed were logical errors in which the LLM failed to grasp the question's context correctly and instances where only one output was provided instead of the expected two outputs. Additionally, formatting errors were identified when outputting decimals and percentages.

Key takeaways from this project include the ability of LLMs to automate low-level data analysis tasks, the effectiveness of GPT 3.5 in generating code for numpy and pandas operations, the significance of chain of thought prompting in improving accuracy, and the need for further refinement and exploration with larger toy datasets, improved language models (e.g., GPT4), and the ability to generate data visualizations.

Future Plans and Improvements

Moving forward, we plan to refine our action plan generator by incorporating techniques like reflection and exploring the option of changing the action plan mid-execution if an incorrect answer is obtained. We also aim to use larger toy datasets for evaluation purposes and assess the model's capability in generating data visualizations. Additionally, we plan to explore advanced language models, such as GPT4, to further enhance the LLM's performance in automating data science tasks.

Acknowledgements

We would like to express our gratitude to all the mentors and researchers who provided valuable guidance and support throughout this research project. Their contributions were instrumental in the success of this endeavor.

References

(References listed here)

Most people like

Find AI tools in Toolify

Join TOOLIFY to find the ai tools

Get started

Sign Up
App rating
4.9
AI Tools
20k+
Trusted Users
5000+
No complicated
No difficulty
Free forever
Browse More Content