Asynchronous Scraping with asyncio and HTTPX
Scraping is dominated by waiting. Each request spends most of its time idle — resolving DNS, opening a connection, and waiting for the server to respond. A sequential requests loop wastes all of that idle time doing nothing. Asynchronous I/O fixes this by keeping many requests in flight at once within a single thread, often turning an hour-long crawl into minutes. This guide shows how to do it correctly and politely with asyncio and HTTPX. It builds on the request fundamentals in Understanding HTTP Requests and Responses and fits into the broader Scaling & Deploying Python Web Scrapers workflow.
Why Async Helps
Scraping is I/O-bound: the bottleneck is the network, not the CPU. asyncio runs an event loop that, while one request waits for a response, switches to start or continue others. The result is high concurrency with very low overhead — no thread-per-request memory cost and no context-switching penalty. For CPU-bound work like heavy parsing, async does not help; reach for multiprocessing there instead.
HTTPX is the natural client for this. It offers a requests-like API, native async/await support, HTTP/2, and connection pooling, making it a drop-in upgrade path from synchronous code.
A Basic Async Scraper
The pattern: create one AsyncClient, build a list of coroutines, and run them concurrently with asyncio.gather.
import asyncio
import httpx
async def fetch(client: httpx.AsyncClient, url: str) -> str | None:
try:
response = await client.get(url, timeout=10)
response.raise_for_status()
return response.text
except httpx.HTTPError as exc:
print(f"Failed {url}: {exc}")
return None
async def scrape(urls: list[str]) -> list[str | None]:
async with httpx.AsyncClient(headers={"User-Agent": "Mozilla/5.0"}) as client:
return await asyncio.gather(*(fetch(client, u) for u in urls))
urls = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 21)]
results = asyncio.run(scrape(urls))
print(f"Fetched {sum(r is not None for r in results)} pages")
Reusing a single client matters: it pools connections, so you are not paying the TCP and TLS handshake cost on every request.
Limiting Concurrency: The Semaphore
The example above launches all requests at once. Against twenty URLs that is fine; against twenty thousand it is a denial-of-service attack on the target and a guaranteed ban. The standard fix is an asyncio.Semaphore that caps how many requests run simultaneously.
import asyncio
import httpx
async def fetch(client, url, semaphore):
async with semaphore: # only N run concurrently
response = await client.get(url, timeout=10)
response.raise_for_status()
await asyncio.sleep(0.2) # small politeness delay
return response.text
async def scrape(urls, concurrency=10):
semaphore = asyncio.Semaphore(concurrency)
async with httpx.AsyncClient(headers={"User-Agent": "Mozilla/5.0"}) as client:
tasks = [fetch(client, u, semaphore) for u in urls]
return await asyncio.gather(*tasks, return_exceptions=True)
return_exceptions=True ensures one failed request does not cancel the whole batch — failures come back as exception objects you can filter and retry. The semaphore plus a short sleep gives you fast throughput while staying within a server's tolerance. This is the same politeness principle enforced automatically by Scrapy's AutoThrottle.
Adding Retries with Backoff
Transient failures (429, 503, timeouts) are routine at scale. Wrap fetches in a retry loop with exponential backoff so a brief hiccup does not drop data.
async def fetch_with_retry(client, url, semaphore, retries=3):
for attempt in range(retries):
try:
async with semaphore:
response = await client.get(url, timeout=10)
response.raise_for_status()
return response.text
except (httpx.HTTPStatusError, httpx.TransportError):
if attempt == retries - 1:
raise
await asyncio.sleep(2 ** attempt) # 1s, 2s, 4s
Async vs Threads vs Multiprocessing
- asyncio — best for high-concurrency, I/O-bound scraping. Hundreds of in-flight requests in one process, minimal overhead. Requires async-compatible libraries.
- Threads (
concurrent.futures.ThreadPoolExecutor) — good for moderate concurrency and when you must use synchronous libraries likerequests. Simpler mental model, higher per-thread overhead. - Multiprocessing — for CPU-bound stages such as parsing huge documents or running heavy regex; it sidesteps the GIL by using separate processes.
A common production shape is async fetching feeding a process pool for CPU-heavy parsing.
Common Mistakes to Avoid
- Unbounded
gather: launching every request at once overwhelms the target and your own machine. Always gate with a semaphore. - Calling blocking code in a coroutine:
time.sleep()or synchronousrequestsinside async code freezes the event loop. Useasyncio.sleepand an async client. - Creating a client per request: that discards connection pooling. Create one
AsyncClientand reuse it. - Letting one failure kill the batch: use
return_exceptions=Trueor per-task try/except so a single error does not cancelgather. - Assuming async speeds up parsing: async only helps with waiting. CPU-bound parsing needs multiprocessing.
Frequently Asked Questions
Should I use HTTPX or aiohttp?
Both are excellent. HTTPX has a requests-like API and supports both sync and async, making migration easy; aiohttp is async-only and battle-tested for high-throughput clients. Either is a solid choice.
How many concurrent requests should I allow?
Start around 5–10 per domain, monitor for 429/503 responses, and increase only if the server tolerates it. The right number depends entirely on the target's capacity and rules.
Is async scraping always faster? Only for I/O-bound work, which most scraping is. If your bottleneck is parsing or data processing rather than network waiting, async will not help — profile first.
Can I use async with Scrapy?
Scrapy is already asynchronous under the hood, so you typically do not manage asyncio yourself — you configure concurrency through its settings. See Web Scraping with Scrapy.