Python

Scraping Indeed jobs with Beautiful Soup and Python

Google+ Pinterest LinkedIn Tumblr

Hello Friends, Today we are going to learn some automation. As you all aware about automating things can help you to save your time and it is smarter way to get things done.

We all have gone through job searching struggle time or some of you may be still looking for job. Everybody has to be updated daily and keep looking for different job portals and websites like naukri.com , indeed and monstor jobs. So as we know capability of python and other amazing libraries. why can’t we automate this process ?

This blog post is about making automatic process where we can get updated jobs available at our desktop without browsing website. We have created script to fetch updated jobs from indeed with multiple pages. Let’s explore more via code.

from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup

We are using Selenium and Beautiful Soup libraries here. Selenium can automate your browsing and getting information from website. Beautiful soup can parse data and get useful information from scrapped HTML. We are using pandas library to get information in csv format.

# download firefox driver and place in any directory use PATH here
driver = webdriver.Firefox(
    executable_path='PATH/geckodriver')

data = pd.DataFrame(columns=['Title', 'Location',
                             'Company', 'Salary', 'Desc'])

Selenium requires browser’s drivers. for firefox we required suitable geckodriver. Download driver and provide path as mentioned above. We are making dataframe header with titles.

driver.get(
        'https://www.indeed.co.in/jobs?q=python&sort=date&l=Ahmedabad&start='+str(i))

Selenium driver object provide us get method which fetch information from URL. Here we have indeed link format where we need to pass ?q={keyword} and sort= parameter to sort information and l={location} with start={page no}.

result_html = job.get_attribute('innerHTML')
soup = BeautifulSoup(result_html, 'html.parser')

We get innerHTML as DOM object and we can pass it in beautiful Soup with html.parser. Basically Beautiful soup will take html data and parse into traversal DOMs.

title = soup.find(name="a", attrs={
                "data-tn-element": "jobTitle"}).text.replace('\n', '').strip()

Soup object has methods like find and other which we can use to get nodes with class name or attribute names

close_button = driver.find_elements_by_class_name(
                'popover-x-button-close')[0]
            close_button.click()

Sometimes we get an popup when page refresh. It will block our script to run. So we have used this method to close that popup and go ahead with our scrapping.

data = data.append({'Title': title, 'Location': location,
                            'Company': company, 'Salary': salary, 'Desc': job_desc}, ignore_index=True)

data.to_csv('data.csv')

Here we are using pandas DataFrame and appending data from every page. Once we have scrapped enough pages, we are exporting dataframe into CSV format. Let’s explore complete code so you can get more idea about script.

Write A Comment