I'm trying to learn how to use selenium and python , I want to crawl the news title & news date of this website but I have a problem that I don't know how to solve.
This is my code :
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
import json
driver = webdriver.Chrome("./chromedriver")
driver.implicitly_wait(10)
driver.get("https://www.thestandnews.com/search/?q=%E6%96%B0%E5%86%A0%E8%82%BA%E7%82%8E")
soup = BeautifulSoup(driver.page_source, "lxml")
pages_remaining = True
page_num = 1
My_array = []
while pages_remaining:
print("Page Number:", page_num)
soup = BeautifulSoup(driver.page_source, "lxml")
""" #undoned
tags_lis = soup.find_all("li")
for tag in tags_lis:
tag_a = tag.find("a")
tag_span = tag.find("span")
title = tag_a.text
date = tag_span.text
temp = {"title": title , "date": date}
print(temp)
My_array.append(temp)
"""try:
#Press button of next page#next_link =driver.find_element_by_xpath()
nextPg = '//*[@id="___gcse_1"]/div/div/div/div[5]/div[2]/div/div/div[2]/div/div[%d]' % (page_num + 1)
print(nextPg)
next_link = driver.find_element_by_xpath(nextPg)
next_link.click()
time.sleep(5)
if page_num < 10:
page_num = page_num + 1else:
pages_remaining = Falseexcept Exception:
pages_remaining = False
driver.close()
and this is the error message, can anyone give a hint ,thanks!
DevTools listening on ws://127.0.0.1:49952/devtools/browser/749fcb19-d13a-4f38-9d7c-3da58726e10a
[13744:13732:0517/214816.873:ERROR:browser_switcher_service.cc(238)] XXX Init()
Page Number: 1
//*[@id="___gcse_1"]/div/div/div/div[5]/div[2]/div/div/div[2]/div/div[2]
[13744:13732:0517/214824.321:ERROR:device_event_log_impl.cc(162)] [21:48:24.321] Bluetooth:
bluetooth_adapter_winrt.cc:1055 Getting Default Adapter failed.
Page Number: 2
//*[@id="___gcse_1"]/div/div/div/div[5]/div[2]/div/div/div[2]/div/div[3]
You are trying to scrape the title and date of each news item on the page.
To fix the error you are getting, you will need to modify your code as follows:
Instead of using the find_all method to get a list of all li tags, use the find_all method to get a list of all div tags with class gsc-webResult. These div tags contain the news items you are interested in.
Inside the loop that iterates over the list of div tags, use the find method to get the title and date of each news item. The title can be found in an a tag within the div, and the date can be found in a div tag with class gsc-webResult-date.
After you have extracted the title and date for each news item, you can add them to the My_array list as a dictionary with keys 'title' and 'date'.
Here is some sample code that demonstrates how to do this:
from selenium import webdriver from bs4 import BeautifulSoup import pandas as pd import time import json driver = webdriver.Chrome("./chromedriver") driver.implicitly_wait(10) driver.get("https://www.thestandnews.com/search/?q=%E6%96%B0%E5%86%A0%E8