Sometimes you need to get data for your own purpose or functions extension from the website which provides the data in different way. Data analysis or ML or build your app with data aggrecated from different sites… A lot of use case. But you need to make sure that this behaivior isn’t violate the license agreement, guiding etc.

1. What you need?

  • Python (whatever language is ok. I choose python just because of the Beautiful Soup)
  • Beautiful Soup 4 – This is the tool to extract the data
  • requests – Handle https stuffs
  • Sellenium – handy way to enter sites with login page
  • Google ChromeDriver – needed to work together with Sellenium
  • Browser development menu – read the source to find the target place. Tools available as well. Simply use search works well.

2. Setup and Install

python -m venv env_ws
pip install beautifulsoup4
pip install requests 
pip install selenium
https://sites.google.com/chromium.org/driver/downloads

3. Login

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def login():
    login_name = input("User ID: ")
    login_password = getpass.getpass("Password: ")

    options = Options()
    options.add_argument('--headless')

    # For Chrome:
    #driver = webdriver.Chrome(executable_path='path/to/webdriver')
    driver = webdriver.Chrome()
    # For firebox:
    #driver = webdriver.Firefox()

    # Navigate to the login page
    driver.get(Login_URL)

    # Find the login form elements
    email_input = driver.find_element(By.ID, "login_email")
    password_input = driver.find_element(By.ID, "login_pass")
    login_button = driver.find_element(By.XPATH, "//button[contains(text(), 'ログイン')]")

    # Fill in the login credentials
    email_input.send_keys(login_name)
    password_input.send_keys(login_password)
    login_button.click()

    # Wait for the login to complete (change the timeout as needed)
    try:
        WebDriverWait(driver, 10).until(EC.url_contains(Home_URL))
    except:
        print("Login failed")
        return

    cookies = driver.get_cookies()
    driver.quit()
    return(cookies)    ...

Use getpass package to avoid showing the password during input.

4. Create Soup object through cookies

    # Transfer cookies to a requests session
    session = requests.Session()
    for cookie in cookies:
        session.cookies.set(cookie['name'], cookie['value'])

    response = session.get(Base_URL)
    soup = BeautifulSoup(response.content, "html.parser")

5. Use soup to extract data

Parse the html is basically use soup.find_all() or soup.find()

  • find_all() for repeating items
  • soup.find for unique items
    Important
  • Need to keep in mind all the time about the scope of the tag defined.
  • Need to care about the null case when the target items is not found.
    product_items = soup.find_all("article", class_ = "product-item")
    for product_item in product_items:
        product = {}
        product_no = product_item['id'].rsplit("--")[1]
        product['product_no'] = product_no
        product_name = product_item.find("h3", class_="product-name").text.strip()
        product['product_name'] = product_name
        product_detail = product_item.find("div", class_="product-detail").text.strip()
        if product_item.find("p", class_="product-name"):
            product_available = False
        else:
            product_available = True
        product["product_available"] = product_available
        product_price = product_item.find("dd", class_="product-price").text.strip()
        product['product_price'] = product_price
        parse_product(session, Product_Base_URL+product_no, product)
        product_list.append(product)

6. Tips on performance

  1. Only used Selenium for login. After get logged in, pass the cookie to the session and close the window.
  2. Use multi-thread. Be careful not to crash the website.
import threading

threads = []
    for i in range (number_of_pages):
        thread = threading.Thread(target = thread_parse_page, args = (i + 1, session))
        threads.append(thread)
        thread.start()
    
    for thread in threads:
        thread.join()