web-scraper — quality + safety report

Name: web-scraper — quality + safety report
Item: web-scraper
Rating: 92
Author: Skillproof

In the Skillier index (antigravity__web-scraper) · scanned 2026-06-03 · engine: builtin+triage

Quality

92/100

Safety

1 heuristic flag to review

Heuristic flags from the builtin scanner, which is known to over-flag (it trips on legitimate env-reading integrations, security skills, and library .eval calls). This is NOT an authoritative malicious verdict — re-scan with SkillSpector for the authoritative result. Run the authoritative scan →

📇 This skill is in the Skillier index (curated · deduped · quality-filtered). Install Skillier to route & load it into your AI client.

Quality notes

Skill is large (~7109 tokens)

medium · quality · body

→ Tighten to the essential procedure; move long reference material to linked files.

About this skill

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web tabelas, listas, precos . Paginacao, monitoramento e export CSV/JSON.

📄 Read the SKILL.md

---
name: web-scraper
description: Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.
risk: safe
source: community
date_added: '2026-03-06'
author: renat
tags:
- scraping
- data-extraction
- automation
- csv
tools:
- claude-code
- antigravity
- cursor
- gemini-cli
- codex-cli
---

# Web Scraper

## Overview

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

## When to Use This Skill

- When the user mentions "scraper" or related topics
- When the user mentions "scraping" or related topics
- When the user mentions "extrair dados web" or related topics
- When the user mentions "web scraping" or related topics
- When the user mentions "raspar dados" or related topics
- When the user mentions "coletar dados site" or related topics

## Do Not Use This Skill When

- The task is unrelated to web scraper
- A simpler, more specific tool can handle the request
- The user needs general-purpose assistance without domain expertise

## How It Works

Execute phases in strict order. Each phase feeds the next.

```
1. CLARIFY  ->  2. RECON  ->  3. STRATEGY  ->  4. EXTRACT  ->  5. TRANSFORM  ->  6. VALIDATE  ->  7. FORMAT
```

Never skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.

**Fast path**: If user provides URL + clear data target + the request is simple
(single page, one data type), compress Phases 1-3 into a single action:
fetch, classify, and extract in one WebFetch call. Still validate and format.

---

## Capabilities

- **Multi-strategy**: WebFetch (static), Browser automation (JS-rendered), Bash/curl (APIs), WebSearch (discovery)
- **Extraction modes**: table, list, article, product, contact, FAQ, pricing, events, jobs, custom
- **Output formats**: Markdown tables (default), JSON, CSV
- **Pagination**: auto-detect and follow (page numbers, infinite scroll, load-more)
- **Multi-URL**: extract same structure across sources with comparison and diff
- **Validation**: confidence ratings (HIGH/MEDIUM/LOW) on every extraction
- **Auto-escalation**: WebFetch fails silently -> automatic Browser fallback
- **Data transforms**: cleaning, normalization, deduplication, enrichment
- **Differential mode**: detect changes between scraping runs

## Web Scraper

Multi-strategy web data extraction with intelligent approach selection,
automatic fallback escalation, data transformation, and structured output.

## Phase 1: Clarify

Establish extraction parameters before touching any URL.

## Required Parameters

| Parameter     | Resolve                              | Default        |
|:--------------|:-------------------------------------|:---------------|
| Target URL(s) | Which page(s) to scrape?             | *(required)*   |
| Data Target   | What specific data to extract?       | *(required)*   |
| Output Format | Markdown table, JSON, CSV, or text?  | Markdown table |
| Scope         | Single page, paginated, or multi-URL?| Single page    |

## Optional Parameters

| Parameter     | Resolve                                | Default      |
|:--------------|:---------------------------------------|:-------------|
| Pagination    | Follow pagination? Max pages?          | No, 1 page   |
| Max Items     | Maximum number of items to collect?    | Unlimited    |
| Filters       | Data to exclude or include?            | None         |
| Sort Order    | How to sort results?                   | Source order  |
| Save Path     | Save to file? Which path?              | Display only |
| Language      | Respond in which language?             | User's lang  |
| Diff Mode     | Compare with previous run?             | No           |

## Clarification Rules

- If user provides a URL and clear data target, proceed directly to Phase 2.
  Do NOT ask unnecessary questions.
- If request is ambiguous (e.g. "scrape this site"), ask ONLY:
  "What specific data do you want me to extract from this page?"
- Default to Markdown table output. Mention alternatives only if relevant.
- Accept requests in any language. Always respond in the user's language.
- If user says "everything" or "all data", perform recon first, then present
  what's available and let user choose.

## Discovery Mode

When user has a topic but no specific URL:
1. Use WebSearch to find the most relevant pages
2. Present top 3-5 URLs with descriptions
3. Let user choose which to scrape, or scrape all
4. Proceed to Phase 2 with selected URL(s)

Example: "find and extract pricing data for CRM tools"
-> WebSearch("CRM tools pricing comparison 2026")
-> Present top results -> User selects -> Extract

---

## Phase 2: Reconnaissance

Analyze the target page before extraction.

## Step 2.1: Initial Fetch

Use WebFetch to retrieve and analyze the page structure:

```
WebFetch(
  url = TARGET_URL,
  prompt = "Analyze this page structure and report:
    1. Page type: article, product listing, search results, data table,
       directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
    2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
       accordion/collapsible sections, tabs
    3. Approximate number of distinct data items visible
    4. JavaScript rendering indicators: empty containers, loading spinners,
       SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
    5. Pagination: next/prev links, page numbers, load-more buttons,
       infinite scroll indicators, total results count
    6. Data density: how much structured, extractable data exists
    7. List the main data fields/columns available for extraction
    8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
    9. Available download links: CSV, Excel, PDF, API endpoints"
)
```

## Step 2.2: Evaluate Fetch Quality

| Signal                                      | Interpretation                    | Action                    |
|:--------------------------------------------|:----------------------------------|:--------------------------|
| Rich content with data clearly visible      | Static page                       | Strategy A (WebFetch)     |
| Empty containers, "loading...", minimal text | JS-rendered                       | Strategy B (Browser)      |
| Login wall, CAPTCHA, 403/401 response       | Blocked                           | Report to user            |
| Content present but poorly structured       | Needs precision                   | Strategy B (Browser)      |
| JSON or XML response body                   | API endpoint                      | Strategy C (Bash/curl)    |
| Download links for CSV/Excel available      | Direct data file                  | Strategy C (download)     |

## Step 2.3: Content Classification

Classify into an extraction mode:

| Mode       | Indicators                                 | Examples                          |
|:-----------|:-------------------------------------------|:----------------------------------|
| `table`    | HTML `<table>`, grid layout with headers   | Price comparison, statistics, specs|
| `list`     | Repeated similar elements, card grids      | Search results, product listings  |
| `article`  | Long-form text with headings/paragraphs    | Blog post, news article, docs     |
| `product`  | Product name, price, specs, images, rating | E-commerce product page           |
| `contact`  | Names, emails, phones, addresses, roles    | Team page, staff directory        |
| `faq`      | Question-answer pairs, accordions          | FAQ page, help center             |
| `pricing`  | Plan names, prices, features, tiers        | SaaS pricing page                 |
| `events`   | Dates, locations, titles, descriptions     | Event listings, conferences       |
| `jobs`     | Titles, companies, locations, salaries     | Job boards, career pages          |
| `custom`   | User specified CSS selectors or fields     | Anything not matching above       |

Record: **page type**, **extraction mode**, **JS rendering needed (yes/no)**,
**available fields**, **structured data present (JSON-LD etc.)**.

If user asked for "everything", present the available fields and let them choose.

---

## Phase 3: Strategy Selection

Choose the extraction approach based on recon results.

## Decision Tree

```
Structured data (JSON-LD, microdata) has what we need?
 |
 +-- YES --> STRATEGY E: Extract structured data directly
 |
 +-- NO: Content fully visible in WebFetch?
      |
      +-- YES: Need precise element targeting?
      |    |
      |    +-- NO  --> STRATEGY A: WebFetch + AI extraction
      |    +-- YES --> STRATEGY B: Browser automation
      |
      +-- NO: JavaScript rendering detected?
           |
           +-- YES --> STRATEGY B: Browser automation
           +-- NO:  API/JSON/XML endpoint or download link?
                |
                +-- YES --> STRATEGY C: Bash (curl + jq)
                +-- NO  --> Report access issue to user
```

## Strategy A: Webfetch With Ai Extraction

**Best for**: Static pages, articles, simple tables, well-structured HTML.

Use WebFetch with a targeted extraction prompt tailored to the mode:

```
WebFetch(
  url = URL,
  prompt = "Extract [DATA_TARGET] from this page.
    Return ONLY the extracted data as [FORMAT] with these columns/fields: [FIELDS].
    Rules:
    - If a value is missing or unclear, use 'N/A'
    - Do not include navigation, ads, footers, or unrelated content
    - Preserve original values exactly (numbers, currencies, dates)
    - Include ALL matching items, not just the first few
    - For each item, also extract the URL/link if available"
)
```

**Auto-escalation**: If WebFetch returns suspiciously few items (less than
50% of expected from recon), or mostly empty fields, automatically escalate
to Strategy B without asking user. Log the escalation in notes.

## Strategy B: Browser Automation

**Best for**: JS-rendered pages, SPAs, interactive content, lazy-loaded data.

Sequence:
1. Get tab context: `tabs_context_mcp(createIfEmpty=true)` -> get tabId
2. Navigate to URL: `navigate(url=TARGET_URL, tabId=TAB)`
3. Wait for content to load: `computer(action="wait", duration=3, tabId=TAB)`
4. Check for cookie/consent banners: `find(query="cookie consent or accept button", tabId=TAB)`
   - If found, dismiss it (prefer privacy-preserving option)
5. Read page structure: `read_page(tabId=TAB)` or `get_page_text(tabId=TAB)`
6. Locate target elements: `find(query="[DESCRIPTION]", tabId=TAB)`
7. Extract with JavaScript for precise data via `javascript_tool`

```javascript
// Table extraction
const rows = document.querySelectorAll('TABLE_SELECTOR tr');
const data = Array.from(rows).map(row => {
  const cells = row.querySelectorAll('td, th');
  return Array.from(cells).map(c => c.textContent.trim());
});
JSON.stringify(data);
```

```javascript
// List/card extraction
const items = document.querySelectorAll('ITEM_SELECTOR');
const data = Array.from(items).map(item => ({
  field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,
  field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,
  link: item.querySelector('a')?.href || null,
}));
JSON.stringify(data);
```

8. For lazy-loaded content, scroll and re-extract:
   `computer(action="scroll", scroll_direction="down", tabId=TAB)`
   then `computer(action="wait", duration=2, tabId=TAB)`

## Strategy C: Bash (Curl + Jq)

**Best for**: REST APIs, JSON endpoints, XML feeds, CSV/Excel downloads.

```bash

## Json Api

curl -s "API_URL" | jq '[.items[] | {field1: .key1, field2: .key2}]'

## Csv Download

curl -s "CSV_URL" -o /tmp/scraped_data.csv

## Xml Parsing

curl -s "XML_URL" | python3 -c "
import xml.etree.ElementTree as ET, json, sys
tree = ET.parse(sys.stdin)

## ... Parse And Output Json

"
```

## Strategy D: Hybrid

When a single strategy is insufficient, combine:
1. WebSearch to discover re

… (truncated)

Scan or optimize your own skill →

Want a live grade + an embeddable README badge? Run your skill through the free scanner.

Graded independently by Skillproof — nothing to sell the author. Quality is mechanical + corpus-grounded; safety flags are heuristic (builtin+triage), not a malicious verdict.