Web Scraping Using Selenium Python Tutorial

Data Science

Photo by Uriel SC on Unsplash

In the first cak bimbingan, we explored how to leverage requests and BeautifulSoup in web scraping.

Today we are going to take a look at Selenium (with Python) in a step-by-step tutorial. This is
part 1
— where we’ll scrap trending YouTube using Selenium and send results adv lewat email using SMTP. In
part 2, we’ll understand the AWS Lambda Python function, add Layers for Selenium and Chromium, and set up a recurring job using AWS CloudWatch. You can follow along with this latihan by reading the code on GitHub.

What is Selenium?

Selenium was initially a tool created to test a website’s behavior, but it quickly became a general web browser automation tool used in web scraping and other automation tasks.

Selenium is capable of automating different browsers like Chrome, Firefox, and even IE through middleware controlled called Selenium web driver. Selenium web driver is essentially a middleware protocol service that sits between the client and the browser, that translates client commands to web browser actions.

Installing Selenium

Selenium web driver for python can be installed through
pip
command:

        $ pip install selenium
      

In this project, I’ve used ChromeDriver for Chrome. We’ll be scraping the YouTube trending movies page.

To start with our scraper code let’s import the selenium web driver.

        from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

Now, let’s create a selenium web driver object and launch a Chrome browser:

        from selenium import webdriver
driver = webdriver.Chrome()
trending_youtube_url = 'https://www.youtube.com/feed/trending?bp=4gIKGgh0cmFpbGVycw%3D%3D'
driver.get( trending_youtube_url )

If we run this script, we’ll see a browser window that takes us to the YouTube trending URL. However, often when web-scraping we don’n want to have our screen taken up with all the GUI elements, for this we can use something called
headless tendensi
which strips the browser of all GUI elements and lets it run silently in the background. In Selenium, we can enable it through the
options
keyword argument:

        driver = get_driver()
        def get_driver():
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("--disable-setuid-sandbox")
driver = webdriver.Chrome(options=chrome_options)
return driver

Next, we need to extract details from the
trending YouTube movies page. The details that we’ll extract are
title, URL, thumbnail URL, channel name, number of views, uploaded date, and description.

  1. Inspect the page, to see the video details under the ‘div’ ‘ytd-video-renderer’.

        def get_videos(driver):
video_div_class = 'ytd-video-renderer'
driver.get(trending_youtube_url)
videos = driver.find_elements(By.TAG_NAME, video_div_class)
return videos
print(f'Found {len(videos)} videos.')

We have all 30 trending videos from YouTube. But, we’re only interested in getting the
Top 10 Trending Videos.

2. The
Title
is can be retrieved using the ‘find_element’ by ID on ‘video-title’.

        title_tag = video.find_element(By.ID, 'video-title')
title = title_tag.text

3. Next, to get the video URL, use the ‘get_attribute’ on ‘href’ after clicking right-click and inspect at the video.

        url = title_tag.get_attribute('href')
      

4. We get the channel name from the class ‘ytd-channel-name’.

        channel_div = video.find_element(By.CLASS_NAME, 'ytd-channel-name')
channel = channel_div.text

5. To get the number of views and date posted, right-click and inspect on the video description correspondingly using the XPATH.

        views = video.find_element(By.XPATH,
'//*[@id="metadata-line"]/span[1]').text
uploaded = video.find_element(By.XPATH,
'//*[@id="metadata-line"]/span[2]').text

6. And, finally to get the video description, we go to the video and right-click on the description and inspect to get the ‘find_element’ by ‘ID’ on ‘description-text’.

        description_tag = video.find_element(By.ID, 'description-text')
description = description_tag.text

Putting it all together, we can define a reusable function.

        def parse_videos(video):
#title, url, thumbnail, channel name, views, uploaded date and description
title_tag = video.find_element(By.ID, 'video-title')
title = title_tag.text
url = title_tag.get_attribute('href')
thumbnail_tag = video.find_element(By.TAG_NAME, 'img')
thumbnail_url = thumbnail_tag.get_attribute('src')
channel_div = video.find_element(By.CLASS_NAME, 'ytd-channel-name')
channel = channel_div.text
views = video.find_element(By.XPATH,
'//*[@id="metadata-line"]/span[1]').text
uploaded = video.find_element(By.XPATH,
'//*[@id="metadata-line"]/span[2]').text
description_tag = video.find_element(By.ID, 'description-text')
description = description_tag.text
return {
'title ': title,
'url ': url,
'thumbnail_url ': thumbnail_url,
'channel name ': channel,
'views ': views,
'uploaded ': uploaded,
'description ': description
}

We return the result in the form of a dictionary. I’ll go ahead and export the data into a .csv so that I can use the file for further analysis.

        videos_df.to_csv('YT_trending_movies.csv', index=None)
      

The result .csv looks below with details title, URL, thumbnail_url, channel name, number of views, date uploaded, and description.

Send results oper email using SMTP

While there are many ways to send emails using Python, in this project we’ll be focusing on sending an email through a protocol like SMTP.

SMTP
(Simple Mail Transfer Protocol) is an application-level protocol (on top of TCP), a delivery-only protocol, used to communicate with mail servers from external services, like an email client on your phone.

Open the connection

Python conveniently comes with the
smtplib, which handles all of the different parts of the protocol, like connecting, authenticating, validation, and of course, sending emails.

        import smtplib
          
try:
peladen = smtplib.SMTP('smtp.gmail.com', 587)
peladen.ehlo()
except:
print 'Something went wrong...'

This connection is insecure, unencrypted, and defaults to port 25.

Use a secure connection

When an SMTP connection is secured via TLS/SSL, it is done over port 465 and is typically called SMTPS. I’ve used an SSL connection
.SMTP_SSL()to upgrade to secure.

        import smtplib
          
try:
server_ssl = smtplib.SMTP_SSL('smtp.gmail.com', 465)
server.ehlo()
except:
print 'Something went wrong...'

Create Email

Emails are nothing but texts with “From”, “To”, “Subject” and a “body”. For example, let’s consider the body of the email is

        body = ‘Hey, what’s up?\n\n- You’
subject = 'YouTube Trending Movies'
email_text = 'Subject:' + subject + '\n\horizon' + body

All you have to do is pass the
email_text
string to
smtplib.

Authenticate Gmail

There are a couple of configurations/setups we would need to follow before we can send an email through Gmail securely using SMTP.

Enable IMAP

From Gmail settings, Forwarding and POP/IMAP, Enable IMAP.

What changed?

I takat a working code earlier this year (before May 2022) to send emails via SMTP. When I re-run the code this week, I was getting the below error.

        
          smtp error Error: Invalid login: 535–5.7.8 Username and Password not accepted.
        
      

Upon troubleshooting, I saw that after May 30, 2022, Google is not allowing log-in via
smtplib
because it has flagged these sorts of login as “less secure” (by just using username and password).

So, what do we do now?

The solution is
to Create App Password
.

  • First of all Login into Your Gmail Account.
  • And then Go To MyAccount Section By visiting
    https://myaccount.google.com
  • Then Open Security Tab in the Sidebar As Shown in the Image.

  • Then You can see There is a
    Signing in to Google
    section — Make Sure you have turned on two steps verification if Titinada Then
    Turn On two steps verification.

  • When You Turn On Your
    2-Step Verification
    then you’ll be able to see the
    App Passwords
    option.

  • And Now Click on
    App Passwords.
  • Then select the app as
    Mail
    and select your corresponding device. Then Click on
    Generate
    to create App Password.

  • And You are Done. Now your app Password is been created and You are now able to use this password in Your SMTP.

  • Use this password in SMTP and now, your error must be solved.

If you are doing it for the first time, you’re at an advantage and just follow the above steps.

Send email

With the email_text we’ve created, we can call the
.sendmail()
method. Putting it all together,

        def send_email(body):
SENDER = '[email protected]'
RECEIVER = '[email protected]'
PASSWORD = os.environ['gmail_password']
subject = 'YouTube Trending Movies' email_text = 'Subject:' + subject + '\kaki langit\falak' + body
try:
server_ssl = smtplib.SMTP_SSL('smtp.gmail.com', 465)
server_ssl.ehlo()
server_ssl.login(SENDER, PASSWORD)
server_ssl.sendmail(SENDER, RECEIVER, email_text)
server_ssl.close()
print('Email sent!')
except:
print('Email not sent, something went wrong...')
server_ssl.close()

That’s it! Check the email to validate if the email is received.

Thank you for reading. If you liked this blog, click the 👏 and help others find this article.

Connect with me on

Linkedin
.

References

I’ve used many references to do the project as well as write this blog. Below are the major references I’ve used for this blog.

  1. Selenium
    — https://www.scrapingbee.com/blog/selenium-python/
  2. How to send an email with python
    — https://stackabuse.com/how-to-send-emails-with-gmail-using-python/
  3. SMTP
    — https://en.wikipedia.org/wiki/Simple_Mail_Transfer_Protocol
  4. Gmail configurations
    needed to authenticate Gmail and send email via SMTP — https://help.warmupinbox.com/en/articles/4934806-configure-for-google-workplace-with-two-factor-authentication-2fa
  5. The inspiration for this project is Aakash Cakrawala S (Jovian) and the workshop on Web scraping using Selenium — https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-with-selenium-aws

Source: https://blog.jovian.ai/web-scraping-using-selenium-2a3ffa1f03f4