Extract Texts and Contact Information from Web Pages with Open AI ChatGPT

Find AI Tools

No difficulty

No complicated process

Find ai tools

Home GPTS Extract Texts and Contact Information from Web Pages with Open AI ChatGPT

Extract Texts and Contact Information from Web Pages with Open AI ChatGPT

Introduction
Retrieving Text from Web Pages
- Installing Beautiful Soup 4
- Retrieving Text from Web Pages
  - Reading the Current Working Directory
  - Making an HTTP Request
  - Parsing the Response with Beautiful Soup
  - Getting the Text from the Response
- Retrieving Contact Information
  - Using Regular Expressions for Email IDs
  - Using Regular Expressions for Phone Numbers
  - Retrieving URLs
Conclusion

Introduction

In this tutorial, we will learn how to Read text from web pages and retrieve contact information such as phone numbers, email IDs, and URLs. We will use the Beautiful Soup library for parsing HTML content and regular expressions for extracting specific Patterns from the text.

Retrieving Text from Web Pages

Installing Beautiful Soup 4

Before we begin, we need to install the Beautiful Soup 4 library. This can be done by executing the command pip install beautifulsoup4 in your notebook.

Reading the Current Working Directory

To start, we need to import the OS module and read the content of the current working directory using the os.listdir() function.

import os

current_directory = os.listdir()

Making an HTTP Request

Next, we will make an HTTP request to the URL of the web page we want to scrape. We will use the requests library for this task. Depending on the Website, some might have JavaScript disabled, which can affect the scraping process.

import requests

url = "https://example.com"
response = requests.get(url)

Parsing the Response with Beautiful Soup

Now that we have the response from the HTTP request, we can use Beautiful Soup to parse the HTML content. We will pass the response to the Beautiful Soup object along with the HTML parser.

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content, "html.parser")

Getting the Text from the Response

To retrieve the text from the web page, we will use the get_text() function provided by Beautiful Soup.

text = soup.get_text()
print(text)

Retrieving Contact Information

Using Regular Expressions for Email IDs

Email IDs usually follow a specific format, such as having the character "@" and the domain name after that. We can use regular expressions to search for these patterns in the text.

import re

sample_text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Email: example@example.com"

email_pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
emails = re.findall(email_pattern, sample_text)

print(emails)

Using Regular Expressions for Phone Numbers

Similarly, phone numbers also have specific patterns, such as starting with a certain number of digits and including special characters like hyphens or plus signs. We can use regular expressions to search for these patterns in the text.

phone_number_pattern = r"\b[0-9]{3}[-.]?[0-9]{3}[-.]?[0-9]{4}\b"
phone_numbers = re.findall(phone_number_pattern, sample_text)

print(phone_numbers)

Retrieving URLs

URLs can be retrieved in a similar way using regular expressions. However, the pattern for matching URLs can be quite complex. Here is an example:

url_pattern = r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
urls = re.findall(url_pattern, sample_text)

print(urls)

Conclusion

In this tutorial, we have learned how to read text from web pages and retrieve contact information using Beautiful Soup and regular expressions. This knowledge can be applied to various web scraping tasks. Feel free to customize the code to suit your specific needs. If you have any questions or issues, don't hesitate to Seek assistance. Happy coding!

Unlocking the secrets of fine tuning Open AI Models

Build an AI Text to Speech App with Hugging Face & Next.js

Are you spending too much time looking for ai tools?