Using BeautifulSoup with HTTPX

In this guide, you'll learn how to use the BeautifulSoup library with the HTTPX library in your Apify Actors.

Introduction

BeautifulSoup is a Python library for extracting data from HTML and XML files. It provides simple methods and Pythonic idioms for navigating, searching, and modifying a website's element tree, enabling efficient data extraction.

HTTPX is a modern, high-level HTTP client library for Python. It provides a simple interface for making HTTP requests and supports both synchronous and asynchronous requests.

To create an Actor which uses those libraries, start from the BeautifulSoup & Python Actor template. This template includes the BeautifulSoup and HTTPX libraries preinstalled, allowing you to begin development immediately.

Example Actor

Below is a simple Actor that recursively scrapes titles from all linked websites, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses HTTPX for fetching pages and BeautifulSoup for parsing their content to extract titles and links to other pages.

Run on

import asyncio
from urllib.parse import urljoin

import httpx
from bs4 import BeautifulSoup

from apify import Actor, Request


async def main() -> None:
    # Enter the context of the Actor.
    async with Actor:
        # Retrieve the Actor input, and use default values if not provided.
        actor_input = await Actor.get_input() or {}
        start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}])
        max_depth = actor_input.get('max_depth', 1)

        # Exit if no start URLs are provided.
        if not start_urls:
            Actor.log.info('No start URLs specified in Actor input, exiting...')
            await Actor.exit()

        # Open the default request queue for handling URLs to be processed.
        request_queue = await Actor.open_request_queue()

        # Enqueue the start URLs with an initial crawl depth of 0.
        for start_url in start_urls:
            url = start_url.get('url')
            Actor.log.info(f'Enqueuing {url} ...')
            new_request = Request.from_url(url, user_data={'depth': 0})
            await request_queue.add_request(new_request)

        # Create an HTTPX client to fetch the HTML content of the URLs.
        async with httpx.AsyncClient() as client:
            # Process the URLs from the request queue.
            while request := await request_queue.fetch_next_request():
                url = request.url

                if not isinstance(request.user_data['depth'], (str, int)):
                    raise TypeError('Request.depth is an unexpected type.')

                depth = int(request.user_data['depth'])
                Actor.log.info(f'Scraping {url} (depth={depth}) ...')

                try:
                    # Fetch the HTTP response from the specified URL using HTTPX.
                    response = await client.get(url, follow_redirects=True)

                    # Parse the HTML content using Beautiful Soup.
                    soup = BeautifulSoup(response.content, 'html.parser')

                    # If the current depth is less than max_depth, find nested links
                    # and enqueue them.
                    if depth < max_depth:
                        for link in soup.find_all('a'):
                            link_href = link.get('href')
                            link_url = urljoin(url, link_href)

                            if link_url.startswith(('http://', 'https://')):
                                Actor.log.info(f'Enqueuing {link_url} ...')
                                new_request = Request.from_url(
                                    link_url,
                                    user_data={'depth': depth + 1},
                                )
                                await request_queue.add_request(new_request)

                    # Extract the desired data.
                    data = {
                        'url': url,
                        'title': soup.title.string if soup.title else None,
                        'h1s': [h1.text for h1 in soup.find_all('h1')],
                        'h2s': [h2.text for h2 in soup.find_all('h2')],
                        'h3s': [h3.text for h3 in soup.find_all('h3')],
                    }

                    # Store the extracted data to the default dataset.
                    await Actor.push_data(data)

                except Exception:
                    Actor.log.exception(f'Cannot extract data from {url}.')

                finally:
                    # Mark the request as handled to ensure it is not processed again.
                    await request_queue.mark_request_as_handled(new_request)


if __name__ == '__main__':
    asyncio.run(main())

Conclusion

In this guide, you learned how to use the BeautifulSoup with the HTTPX in your Apify Actors. By combining these libraries, you can efficiently extract data from HTML or XML files, making it easy to build web scraping tasks in Python. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!