Scrapling - Advanced Web Scraping API

A powerful web scraping API with AI-powered content extraction, session management, and multiple scraping modes (HTTP, JavaScript rendering, and stealthy browser automation).

Features

  • πŸš€ REST API - FastAPI-based endpoints for programmatic access
  • πŸ€– AI-Powered Extraction - Natural language queries for content extraction
  • πŸ” Session Management - Persistent sessions for efficient batch processing
  • 🌐 Multiple Scraping Modes:
    • Standard HTTP (fast, low protection)
    • Dynamic fetching (JavaScript support)
    • Stealthy browser (anti-bot bypass)
  • πŸ“Š Structured Output - Returns data in JSON, Markdown, HTML, or Text formats
  • 🎨 Gradio UI - Interactive web interface for testing

API Endpoints

Base URL

https://grazieprego-scrapling.hf.space

Quick Reference

Endpoint Method Description
/health GET Check API status
/api/scrape POST Stateless scrape request
/api/session POST Create persistent session
/api/session/{id}/scrape POST Scrape using session
/api/session/{id} DELETE Close session
/docs GET API documentation (HTML)
/api-docs GET API documentation (JSON)

Usage Examples

1. Stateless Scrape (One-off requests)

curl -X POST https://grazieprego-scrapling.hf.space/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "query": "Extract all product prices",
    "model_name": "alias-fast"
  }'

2. Session-Based Scraping (Multiple requests)

import requests

# Create session
session = requests.post(
    'https://grazieprego-scrapling.hf.space/api/session',
    json={'model_name': 'alias-fast'}
)
session_id = session.json()['session_id']

try:
    # Multiple scrapes using the same session
    urls = [
        'https://example.com/page1',
        'https://example.com/page2',
        'https://example.com/page3'
    ]
    
    for url in urls:
        result = requests.post(
            f'https://grazieprego-scrapling.hf.space/api/session/{session_id}/scrape',
            json={'url': url, 'query': 'Extract product data'}
        )
        print(f"Scraped {url}: {result.json()}")
finally:
    # Always close the session
    requests.delete(f'https://grazieprego-scrapling.hf.space/api/session/{session_id}')

3. Using the Gradio UI

Visit the space URL and use the interactive interface:

  • Fetch (HTTP) tab: For standard HTTP scraping
  • Stealthy Fetch (Browser) tab: For sites with bot protection

API Documentation

Request Parameters

/api/scrape & /api/session/{id}/scrape

{
  "url": "https://example.com",
  "query": "Extract all headings and prices",
  "model_name": "alias-fast"
}

Parameters:

  • url (string, required): The URL to scrape
  • query (string, required): Natural language extraction instruction
  • model_name (string, optional): AI model to use (default: "alias-fast")

/api/session

{
  "model_name": "alias-fast"
}

Response Format

{
  "url": "https://example.com",
  "query": "Extract prices",
  "response": {
    "status": 200,
    "content": ["# Product 1: $19.99", "# Product 2: $29.99"],
    "url": "https://example.com"
  }
}

Best Practices

  1. Use stateless endpoints for one-off requests
  2. Use sessions for batch processing multiple URLs
  3. Always close sessions when finished to free resources
  4. Implement error handling - 500 errors may occur on complex sites
  5. Add retry logic for production use
  6. Respect rate limits - use responsibly

Error Handling

  • 404: Session not found
  • 500: Internal server error (check detail field for specifics)
  • Common issues:
    • URL unreachable or timeout
    • JavaScript-heavy sites may need stealthy_fetch
    • Bot protection may block requests

Deployment

This space uses Docker with:

  • Python 3.11
  • FastAPI + Uvicorn
  • Gradio 5.x
  • Playwright for browser automation
  • Scrapling for advanced scraping

License

MIT License - See LICENSE file for details

Credits

Built with Scrapling - Advanced web scraping library


Note: This is a demonstration space. For production use, consider self-hosting with appropriate rate limiting and authentication.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support