PythonでAmazonのレビューを感情分析！

Find AI Tools

No difficulty

No complicated process

Find ai tools

Home AI News JP PythonでAmazonのレビューを感情分析！

PythonでAmazonのレビューを感情分析！

プロジェクトの概要

本プロジェクトは、自然言語処理を使用してAmazonのレビューを感情分析する方法について解説します。Pythonの自然言語処理ツールキットであるNLTKを使用して伝統的なアプローチを説明した後、Hugging FaceのRobertaモデルを実装し、どのように異なるモデルがパフォーマンスするかを分析します。さらに、Hugging Faceのパイプラインを使用して簡単に感情分析を行います。

自然言語処理とは

自然言語処理（NLP）は、テキストや音声などの自然言語データをコンピュータが理解できる形式に変換するための技術です。NLPは、テキスト分類、情報抽出、機械翻訳、感情分析など、さまざまなタスクに応用されます。

感情分析の概要

感情分析は、自然言語処理の一種であり、テキストから感情や情緒を抽出する方法です。感情分析は、ポジティブ、ネガティブ、ニュートラルのような感情を識別するために使用されます。これにより、大量のテキストデータから意味を把握し、人々の態度や感情を理解することができます。

Pythonと自然言語処理

Pythonは、自然言語処理によく使用されるプログラミング言語です。NLTK（Natural Language Toolkit）などのツールキットや、Hugging Faceのようなライブラリを使用すると、簡単に自然言語処理タスクを実行できます。Pythonの柔軟性と豊富なライブラリのおかげで、自然言語処理は非常に効率的に行うことができます。

準備

まず、必要なライブラリをインポートしましょう。以下のライブラリを使用します。

pandas：データの読み込みと処理のためのライブラリ
numpy：数値計算のためのライブラリ
matplotlib：プロットと可視化のためのライブラリ
seaborn：データの可視化のためのライブラリ
nltk：自然言語処理のためのツールキット
transformers：Hugging Faceのモデルとパイプラインを使用するためのライブラリ
tqdm：進捗状況の表示のためのライブラリ
torch：機械学習のためのライブラリ

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
from tqdm import tqdm
import torch

データの取得

まず、Amazonの食品のレビューデータセットを読み込みます。このデータセットには、レビューのテキストと評価（1〜5段階）が含まれています。

df = pd.read_csv('reviews.csv')

データセットの形状を確認してみましょう。

print(df.Shape)

データセットの形状は（500000、10）であり、50万件のレビューが含まれていることがわかります。

データの整理

次に、データセットを必要な形式に整理します。まず、データセットをランダムにサンプリングし、処理を高速化するためにデータの一部を使用します。その後、NaN値を削除し、テキストと評価のカラムを残すだけにします。

df = df.sample(frac=0.1, random_state=42).reset_index(drop=True)
df = df.dropna(subset=['text', 'score']).reset_index(drop=True)

次に、データセットからテキストとスコアのカラムを取得します。

text = df['text']
score = df['score']

これでデータセットの準備が整いました。

バガーオブワーズアプローチ

まずは、伝統的なアプローチとして、バガーオブワーズモデルを使用して感情分析を行いましょう。NLTKのツールを使用して、テキストをトークン化し、各単語のスコアを計算します。

from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()

polarity_scores = []
for t in tqdm(text):
    polarity_scores.append(sia.polarity_scores(t))

計算したスコアをデータセットに追加します。

df['vader_score_negative'] = [score['neg'] for score in polarity_scores]
df['vader_score_neutral'] = [score['neu'] for score in polarity_scores]
df['vader_score_positive'] = [score['pos'] for score in polarity_scores]
df['vader_score_compound'] = [score['compound'] for score in polarity_scores]

これでVaderモデルによる感情分析が完了しました。

Hugging FaceのRobertaモデル

次に、より高度な感情分析モデルであるHugging FaceのRobertaモデルを使用して感情分析を行います。まず、Robertaのトークナイザーとモデルを読み込みます。

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = AutoModelForSequenceClassification.from_pretrained('roberta-base')

次に、データセットのテキストをエンコードし、Robertaモデルに適用します。

encoded_texts = []
for t in tqdm(text):
    encoded_text = tokenizer.encode_plus(t, add_special_tokens=True, max_length=512, padding='longest', truncation=True)
    encoded_texts.append(encoded_text)

inputs = [torch.tensor([item['input_ids']]) for item in encoded_texts]
outputs = []
for i in tqdm(inputs):
    with torch.no_grad():
        output = model(*i)
        outputs.append(output.logits.numpy())

結果をデータセットに追加します。

df['roberta_score_negative'] = [softmax(score[0])[0] for score in outputs]
df['roberta_score_neutral'] = [softmax(score[0])[1] for score in outputs]
df['roberta_score_positive'] = [softmax(score[0])[2] for score in outputs]

これでRobertaモデルによる感情分析が完了しました。

パイプラインの使用

最後に、Hugging Faceのパイプラインを使用して感情分析を行います。パイプラインを作成するだけで、簡単に感情分析を実行できます。

from transformers import pipeline

nlp = pipeline('sentiment-analysis')

results = []
for t in text:
    result = nlp(t)
    results.append(result[0]['score'])

結果をデータセットに追加します。

df['pipeline_score'] = results

これでパイプラインを使用した感情分析が完了しました。

結果の比較

それでは、3つのモデルによる感情分析の結果を比較してみましょう。

Fig, axes = plt.subplots(1, 3, figsize=(15, 5))
sns.barplot(x=score, y=df['vader_score_compound'], ax=axes[0])
axes[0].set_title('Vader Model')
sns.barplot(x=score, y=df['roberta_score_positive'], ax=axes[1])
axes[1].set_title('Roberta Model')
sns.barplot(x=score, y=results, ax=axes[2])
axes[2].set_title('Pipeline Model')

plt.tight_layout()
plt.show()

このプロットを通じて、3つのモデルの感情分析結果を比較することができます。

ファイナルテスト

最後に、3つのモデルで感情分析を行い、結果を比較します。

test_text = "This product is amazing!"
vader_result = sia.polarity_scores(test_text)
roberta_result = nlp(test_text)[0]['score']
pipeline_result = nlp(test_text)[0]['score']

print("Vader Model:", vader_result)
print("Roberta Model:", roberta_result)
print("Pipeline Model:", pipeline_result)

それぞれのモデルでの感情分析結果を確認しましょう。