Job Listing Data Extraction

JSoup এবং Web Scraping এর বাস্তব উদাহরণ - জেসুপ (JSoup) - Java Technologies

300

JSoup is a powerful Java library that allows you to extract data from HTML pages. It is particularly useful for scraping job listings from websites. In this section, we will discuss how you can use JSoup to extract job listing data from an HTML page, including job titles, company names, locations, and job descriptions.

Steps for Job Listing Data Extraction

1. Analyze the HTML Structure

Before starting, you need to examine the HTML structure of the job listings page to identify the HTML tags and classes used for job titles, company names, locations, etc. This can be done using browser developer tools (e.g., right-click and inspect the page).

2. Setup JSoup for Parsing the HTML

Once you know the HTML structure, you can use JSoup to parse the HTML page and extract the required data. Typically, the structure will consist of a list of job entries, where each job will have details like the title, company name, location, and a link to the full description.

Example: Extracting Job Listings Data

Suppose we have a job listings page with HTML similar to the following structure:

<html>
  <body>
    <div class="job-listing">
      <div class="job-item">
        <h2 class="job-title">Software Engineer</h2>
        <p class="company-name">Tech Corp</p>
        <p class="job-location">New York, NY</p>
        <a href="job-link1" class="job-link">View Details</a>
      </div>
      <div class="job-item">
        <h2 class="job-title">Data Scientist</h2>
        <p class="company-name">Data Solutions</p>
        <p class="job-location">San Francisco, CA</p>
        <a href="job-link2" class="job-link">View Details</a>
      </div>
    </div>
  </body>
</html>

You can use the following JSoup code to extract job title, company, location, and job link:

3. Java Code for Data Extraction

Example Code:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class JobListingExtractor {

    public static void main(String[] args) {
        try {
            // URL of the job listing page (replace with the actual URL)
            String url = "https://example.com/job-listings";
            
            // Fetch the HTML from the URL
            Document doc = Jsoup.connect(url).get();
            
            // Select all job listing items
            Elements jobListings = doc.select(".job-item");
            
            // Loop through each job item and extract data
            for (Element job : jobListings) {
                String jobTitle = job.select(".job-title").text();
                String companyName = job.select(".company-name").text();
                String jobLocation = job.select(".job-location").text();
                String jobLink = job.select(".job-link").attr("href");
                
                // Print the extracted data
                System.out.println("Job Title: " + jobTitle);
                System.out.println("Company: " + companyName);
                System.out.println("Location: " + jobLocation);
                System.out.println("Job Link: " + jobLink);
                System.out.println("-----------------------------");
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Explanation:

Connecting to the URL: Jsoup.connect(url).get() fetches the HTML content from the provided job listings URL.
Selecting Job Listings: The doc.select(".job-item") method selects all elements with the class job-item, which represents individual job listings.
Extracting Data: For each job listing, we extract:
- Job Title: .select(".job-title").text() extracts the text content from the job title element.
- Company Name: .select(".company-name").text() extracts the text content of the company name.
- Job Location: .select(".job-location").text() extracts the text content of the job location.
- Job Link: .select(".job-link").attr("href") extracts the href attribute, which contains the link to the job details page.
Displaying the Data: Each extracted job data (title, company, location, and link) is printed.

Handling Pagination

Many job listing websites have multiple pages of job listings. To handle pagination, you can follow a similar approach and loop through multiple pages using JSoup.

Example for Pagination Handling:

public class JobListingExtractor {

    public static void main(String[] args) {
        try {
            // Starting page URL (replace with actual URL)
            String baseUrl = "https://example.com/job-listings?page=";
            
            // Loop through multiple pages (adjust the number of pages as needed)
            for (int page = 1; page <= 5; page++) {
                String url = baseUrl + page;
                Document doc = Jsoup.connect(url).get();
                Elements jobListings = doc.select(".job-item");

                for (Element job : jobListings) {
                    String jobTitle = job.select(".job-title").text();
                    String companyName = job.select(".company-name").text();
                    String jobLocation = job.select(".job-location").text();
                    String jobLink = job.select(".job-link").attr("href");

                    // Print the extracted data
                    System.out.println("Job Title: " + jobTitle);
                    System.out.println("Company: " + companyName);
                    System.out.println("Location: " + jobLocation);
                    System.out.println("Job Link: " + jobLink);
                    System.out.println("-----------------------------");
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

In this case, the code loops through multiple pages by appending the page number to the URL. You can modify the loop range depending on how many pages you want to scrape.

Handling Different HTML Structures

If the HTML structure is different for each website, you may need to adjust the select() queries. The key to successful data extraction lies in identifying the correct CSS selectors (classes, ids, tags) that correspond to the data you want to scrape.

Conclusion

JSoup is an effective and simple library for extracting job listing data from HTML pages. By understanding the structure of the page and using the correct selectors, you can easily extract job titles, company names, locations, and links. Additionally, handling pagination is crucial when dealing with job listings spread across multiple pages.

Content added By

Md Zahid Hasan

Product Data Scraping এর জন্য JSoup ব্যবহার News Websites থেকে Data Extraction Practical উদাহরণ: Real-life Web Scraping Project

Job Listing Data Extraction

Steps for Job Listing Data Extraction

1. Analyze the HTML Structure

2. Setup JSoup for Parsing the HTML

Example: Extracting Job Listings Data

3. Java Code for Data Extraction

Example Code:

Explanation:

Handling Pagination

Example for Pagination Handling:

Handling Different HTML Structures

Conclusion

Promotion

Satt AI

Hi, আমি SATT AI!

Job Listing Data Extraction

Steps for Job Listing Data Extraction

1. Analyze the HTML Structure

2. Setup JSoup for Parsing the HTML

Example: Extracting Job Listings Data

3. Java Code for Data Extraction

Example Code:

Explanation:

Handling Pagination

Example for Pagination Handling:

Handling Different HTML Structures

Conclusion

All Notifications

Promotion

Satt AI

Hi, আমি SATT AI!