Web Scraping Best Practices: Rate Limiting, Pagination, and Error Handling
Web scraping sounds simple in theory. Send an HTTP request, parse the HTML, extract the data. In practice, any scraper that needs to run reliably at scale quickly turns into an engineering project — rate limiting, pagination, error recovery, proxy management, and constant maintenance when the target site changes its markup.
This guide covers the essential patterns that separate hobby scrapers from production-grade data collection. Whether you're building your own scraper or evaluating whether a managed API makes more sense, these fundamentals apply.
Rate Limiting: Don't Get Banned
The fastest way to get blocked is to hammer a website with requests. Every production scraper needs rate limiting, and the approach matters more than you'd think.
Fixed-Interval Rate Limiting
The simplest approach: wait a fixed amount of time between requests.
async function fetchWithDelay(
urls: string[],
delayMs: number = 1000
): Promise<Response[]> {
const results: Response[] = [];
for (const url of urls) {
const response = await fetch(url);
results.push(response);
await new Promise((resolve) => setTimeout(resolve, delayMs));
}
return results;
}
This works for small-scale scraping, but it's inefficient. You wait 1 second even when the server can handle faster requests, and you don't slow down when the server is struggling.
Adaptive Rate Limiting
A smarter approach adjusts the delay based on server response signals:
class AdaptiveRateLimiter {
private delay: number;
private readonly minDelay: number;
private readonly maxDelay: number;
constructor(initialDelay = 1000, minDelay = 200, maxDelay = 30000) {
this.delay = initialDelay;
this.minDelay = minDelay;
this.maxDelay = maxDelay;
}
async wait(): Promise<void> {
await new Promise((resolve) => setTimeout(resolve, this.delay));
}
onSuccess(): void {
// Gradually speed up on success
this.delay = Math.max(this.minDelay, this.delay * 0.9);
}
onRateLimit(): void {
// Double the delay on rate limit
this.delay = Math.min(this.maxDelay, this.delay * 2);
}
onError(): void {
// Increase delay on errors
this.delay = Math.min(this.maxDelay, this.delay * 1.5);
}
}
When you get a 200 response, gradually speed up. When you get a 429 (Too Many Requests) or server error, back off aggressively. This naturally finds the maximum throughput the server will tolerate.
Respecting robots.txt
Before scraping any site, check its robots.txt. It specifies which paths are off-limits and often includes a Crawl-delay directive:
User-agent: *
Crawl-delay: 10
Disallow: /private/
Ignoring robots.txt won't just get you blocked — it can have legal implications. If a site says don't scrape it, don't scrape it.
Exponential Backoff: Handling Transient Failures
Network requests fail. Servers return 500 errors. Connections time out. A resilient scraper retries with exponential backoff — waiting longer between each attempt.
async function fetchWithRetry(
url: string,
maxRetries: number = 5
): Promise<Response> {
let lastError: Error | null = null;
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const response = await fetch(url, {
signal: AbortSignal.timeout(10000),
});
// Don't retry client errors (4xx) except rate limits
if (response.status === 429) {
const retryAfter = response.headers.get("Retry-After");
const waitTime = retryAfter
? Number.parseInt(retryAfter, 10) * 1000
: Math.pow(2, attempt) * 1000;
await new Promise((resolve) => setTimeout(resolve, waitTime));
continue;
}
if (response.status >= 400 && response.status < 500) {
throw new Error(`Client error: ${response.status}`);
}
if (!response.ok) {
throw new Error(`Server error: ${response.status}`);
}
return response;
} catch (error) {
lastError = error as Error;
if (attempt < maxRetries - 1) {
// Exponential backoff with jitter
const baseDelay = Math.pow(2, attempt) * 1000;
const jitter = Math.random() * 1000;
await new Promise((resolve) => setTimeout(resolve, baseDelay + jitter));
}
}
}
throw new Error(`Failed after ${maxRetries} retries: ${lastError?.message}`);
}
Why Jitter Matters
Without jitter, all your retry attempts happen at the exact same intervals. If you have multiple scraper instances, they'll all retry simultaneously, creating thundering herd problems. Adding random jitter spreads the retries across time, reducing pressure on the target server.
Pagination: Getting All the Data
Most websites don't serve all their data on a single page. You need to handle pagination — and there are several patterns you'll encounter.
Cursor-Based Pagination
The most reliable pagination method. The API returns a cursor (or next_page_token) that you pass in the next request:
async function fetchAllPages<T>(
baseUrl: string,
headers: Record<string, string>
): Promise<T[]> {
const allItems: T[] = [];
let cursor: string | null = null;
do {
const url = cursor ? `${baseUrl}&cursor=${cursor}` : baseUrl;
const response = await fetchWithRetry(url);
const data = await response.json();
allItems.push(...data.items);
cursor = data.next_cursor ?? null;
} while (cursor);
return allItems;
}
Offset-Based Pagination
Common in older APIs and HTML scraping. You specify a page number or offset:
async function fetchWithOffset<T>(
baseUrl: string,
pageSize: number = 20
): Promise<T[]> {
const allItems: T[] = [];
let offset = 0;
let hasMore = true;
while (hasMore) {
const url = `${baseUrl}?offset=${offset}&limit=${pageSize}`;
const response = await fetchWithRetry(url);
const data = await response.json();
allItems.push(...data.items);
offset += pageSize;
hasMore = data.items.length === pageSize;
}
return allItems;
}
The gotcha with offset pagination: if new items are added while you're paginating, you can miss items or get duplicates. Cursor-based pagination doesn't have this problem.
Infinite Scroll Scraping
Many modern sites (including social media platforms) use infinite scroll instead of traditional pagination. Scraping these requires a headless browser that can scroll the page and wait for new content to load:
// Pseudocode — requires Puppeteer or Playwright
async function scrapeInfiniteScroll(page: Page, maxScrolls: number = 10) {
let previousHeight = 0;
for (let i = 0; i < maxScrolls; i++) {
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)");
await page.waitForTimeout(2000);
const currentHeight = await page.evaluate("document.body.scrollHeight");
if (currentHeight === previousHeight) {
break; // No new content loaded
}
previousHeight = currentHeight;
}
}
This is slow, resource-intensive, and fragile. It's one of the main reasons developers switch from DIY scraping to managed APIs — you trade browser automation complexity for a simple HTTP request.
Error Handling: Beyond Try-Catch
Production scrapers encounter a wide variety of errors. Handle them specifically:
Categorizing Errors
enum ScrapingErrorType {
RateLimit = "RATE_LIMIT",
Blocked = "BLOCKED",
NotFound = "NOT_FOUND",
ParseError = "PARSE_ERROR",
NetworkError = "NETWORK_ERROR",
ServerError = "SERVER_ERROR",
CaptchaRequired = "CAPTCHA_REQUIRED",
}
function categorizeError(
response: Response | null,
error: Error | null
): ScrapingErrorType {
if (!response && error) {
return ScrapingErrorType.NetworkError;
}
if (!response) {
return ScrapingErrorType.NetworkError;
}
switch (response.status) {
case 429:
return ScrapingErrorType.RateLimit;
case 403:
return ScrapingErrorType.Blocked;
case 404:
return ScrapingErrorType.NotFound;
default:
if (response.status >= 500) {
return ScrapingErrorType.ServerError;
}
return ScrapingErrorType.ParseError;
}
}
Handling Each Error Type Differently
- Rate limit (429): Back off exponentially. Check the
Retry-Afterheader. - Blocked (403): Rotate proxy/IP. If persistent, the site may have detected your scraping pattern.
- Not found (404): Log and skip. The page may have been removed.
- Parse error: The HTML structure likely changed. Alert immediately — this usually requires code changes.
- Network error: Retry with backoff. Could be a transient DNS or connectivity issue.
- Captcha: Stop and alert. Automated captcha solving is fragile and often against ToS.
Data Validation: Don't Trust the Response
Just because you got a 200 response doesn't mean the data is correct. Validate what you receive:
interface ScrapedProfile {
username: string;
follower_count: number;
post_count: number;
biography: string;
}
function validateProfile(data: unknown): ScrapedProfile {
if (!data || typeof data !== "object") {
throw new Error("Invalid profile data: not an object");
}
const profile = data as Record<string, unknown>;
if (typeof profile.username !== "string" || profile.username.length === 0) {
throw new Error("Invalid profile: missing username");
}
if (
typeof profile.follower_count !== "number" ||
profile.follower_count < 0
) {
throw new Error("Invalid profile: bad follower count");
}
return profile as unknown as ScrapedProfile;
}
Common validation catches:
- Empty responses that returned 200 status (login walls, soft blocks)
- Stale data where the page served a cached version
- Partial data where some fields are missing due to rendering issues
- Completely wrong pages (redirects to login or error pages that still return 200)
Proxy Management
For scraping at scale, you'll need proxies. A single IP making thousands of requests gets blocked fast.
Proxy Rotation Basics
class ProxyRotator {
private proxies: string[];
private currentIndex: number = 0;
private failCounts: Map<string, number> = new Map();
constructor(proxies: string[]) {
this.proxies = proxies;
}
getNext(): string {
// Skip proxies that have failed too many times
let attempts = 0;
while (attempts < this.proxies.length) {
const proxy = this.proxies[this.currentIndex];
this.currentIndex = (this.currentIndex + 1) % this.proxies.length;
const fails = this.failCounts.get(proxy) ?? 0;
if (fails < 5) {
return proxy;
}
attempts++;
}
// Reset fail counts if all proxies are exhausted
this.failCounts.clear();
return this.proxies[0];
}
markFailed(proxy: string): void {
const current = this.failCounts.get(proxy) ?? 0;
this.failCounts.set(proxy, current + 1);
}
markSuccess(proxy: string): void {
this.failCounts.delete(proxy);
}
}
Proxy management alone is a significant operational burden. Residential proxies cost $5-15 per GB. Datacenter proxies are cheaper but more easily detected. Managing proxy pools, monitoring their health, and rotating them effectively is essentially running a small infrastructure service.
When to Use a Managed API Instead
After reading all of the above, you might be thinking: "This is a lot of engineering for data collection." You're right.
Building and maintaining a production scraper means handling:
- Rate limiting and adaptive backoff
- Proxy rotation and IP management
- Browser automation for JavaScript-rendered pages
- CAPTCHA detection and handling
- HTML parsing that breaks when the site updates
- Session management and cookie handling
- Data validation and error recovery
- Infrastructure monitoring and alerting
For social media platforms specifically, these challenges are amplified. Instagram and TikTok aggressively block scrapers, rotate their internal APIs, and change their page structure regularly.
A managed API like EternalSocial handles all of this behind a single endpoint. You send an HTTP request, you get clean JSON back:
# Instead of all the above complexity:
curl "https://api.eternalsocial.dev/v1/instagram/posts?username=brand&limit=20" \
-H "Authorization: Bearer YOUR_API_KEY"
No proxies. No browser automation. No HTML parsing. No maintenance when Instagram changes their frontend. The tradeoff is cost — you're paying for someone else to solve these problems. For most teams, that tradeoff is worth it.
Summary: The Decision Framework
| Factor | DIY Scraping | Managed API | | ----------- | --------------------------------- | ------------------------------- | | Setup time | Days to weeks | Minutes | | Maintenance | Ongoing, unpredictable | None | | Reliability | Depends on your engineering | Provider's SLA | | Cost | Infrastructure + engineering time | API pricing | | Flexibility | Complete control | Limited to available endpoints | | Scale | You manage infrastructure | Provider manages infrastructure |
If you're scraping a niche site that no API covers, DIY scraping is your only option — and the patterns in this guide will serve you well.
If you're scraping major social media platforms, the engineering cost of doing it yourself almost always exceeds the cost of a managed API. The time you save goes into building the product features that actually differentiate your business.
Get Started
If you're collecting data from Instagram or TikTok, try EternalSocial's API — get a free API key and start making requests in minutes. The API documentation covers every available endpoint with request and response examples. You can evaluate it against your DIY scraper and decide which approach fits your use case.