liteparse — quality + safety report

Name: liteparse — quality + safety report
Item: liteparse
Rating: 92
Author: Skillproof

In the Skillier index (kdense-scientific__liteparse) · scanned 2026-06-03 · engine: builtin+triage

Quality

92/100

Safety

1 heuristic flag to review

Heuristic flags from the builtin scanner, which is known to over-flag (it trips on legitimate env-reading integrations, security skills, and library .eval calls). This is NOT an authoritative malicious verdict — re-scan with SkillSpector for the authoritative result. Run the authoritative scan →

📇 This skill is in the Skillier index (curated · deduped · quality-filtered). Install Skillier to route & load it into your AI client.

Quality notes

Skill is large (~2142 tokens)

medium · quality · body

→ Tighten to the essential procedure; move long reference material to linked files.

About this skill

Local document and PDF parsing with spatial text and bounding boxes. Use for extracting text from PDFs, DOCX, Office files, and images; OCR on scans; layout-preserved JSON for RAG; batch-ingesting paper folders; or page screenshots for multimodal agents — even when the user does not name liteparse.…

📄 Read the SKILL.md

---
name: liteparse
description: Local document and PDF parsing with spatial text and bounding boxes. Use for extracting text from PDFs, DOCX, Office files, and images; OCR on scans; layout-preserved JSON for RAG; batch-ingesting paper folders; or page screenshots for multimodal agents — even when the user does not name liteparse. Prefer over MarkItDown when you need bboxes, fast local parsing, or PNG page renders; prefer over the pdf skill for merge/split/forms.
license: Apache-2.0
allowed-tools: Read Write Edit Bash
compatibility: Python 3.10+. Optional LibreOffice (Office formats) and ImageMagick (images). Bundled Tesseract for OCR. All processing is local — no cloud API required.
metadata:
  version: "1.0"
  skill-author: K-Dense Inc.
---

# LiteParse — Local Document Parsing

## Overview

LiteParse is a fast, open-source document parser (Rust core, Python/Node bindings) focused on **local, layout-aware text extraction** with bounding boxes. It does not produce Markdown and does not call cloud LLMs. Outputs are **plain text** (layout-preserved) or **structured JSON** with per-page `text_items` (position, font metadata, optional confidence).

**Version note:** Examples target **liteparse 2.0.0** (PyPI, May 2026). The upstream V1 branch is legacy; this skill documents **V2 / main** only.

For parser selection vs MarkItDown, the `pdf` skill, or LlamaParse, see `references/choosing_a_parser.md`.

## When to Use This Skill

Use LiteParse when you need:

- **Fast local parsing** of PDFs or converted Office/image files without cloud dependencies
- **Spatial text** with bounding boxes for layout-aware RAG, citation grounding, or figure/table region logic
- **OCR** on scanned PDFs or images (bundled Tesseract, or a user-run HTTP OCR server)
- **Page screenshots** (PNG) for multimodal agents that must see charts, figures, or handwriting
- **Batch ingestion** of literature folders, supplementary PDFs, or protocol libraries
- **Page subsets** or **password-protected** PDFs

## When Not to Use

| Task | Use instead |
|------|-------------|
| Markdown for LLM ingestion (EPUB, audio, YouTube, HTML) | `markitdown` skill |
| Merge/split PDFs, forms, watermarks, rotation | `pdf` skill |
| Dense tables, handwriting, production cloud pipelines | [LlamaParse](https://docs.cloud.llamaindex.ai/llamaparse/overview) (cloud; sign up separately) |

## Installation

```bash
uv pip install "liteparse==2.0.0"
```

This installs the Python bindings and the **`lit`** CLI. Verify:

```bash
lit --help
python -c "import liteparse; print(liteparse.__version__)"
```

**Optional system tools** (for non-PDF inputs):

- **LibreOffice** — Word, Excel, PowerPoint, OpenDocument, CSV/TSV
- **ImageMagick** — PNG, JPEG, TIFF, WebP, SVG, etc.

Install commands are in `references/ocr_and_formats.md`.

**Node.js / TypeScript** (optional): `npm i @llamaindex/liteparse` — see `references/api_reference.md`.

---

## Quick Start

### Python

```python
from liteparse import LiteParse

parser = LiteParse(quiet=True)
result = parser.parse("paper.pdf")
print(result.text)

for page in result.pages:
    print(f"Page {page.page_num}: {len(page.text_items)} items")
```

### CLI

```bash
# Layout-preserved text (default)
lit parse paper.pdf

# Structured JSON with bounding boxes
lit parse paper.pdf --format json -o paper.json

# Disable OCR on text-native PDFs (faster)
lit parse paper.pdf --no-ocr
```

---

## Core Workflows

### 1. Parse to layout-preserved text

Best for quick full-document text or feeding chunkers that do not need coordinates.

```python
parser = LiteParse(ocr_enabled=True, quiet=True)
result = parser.parse("document.pdf")
full_text = result.text
```

```bash
lit parse document.pdf -o output.txt
```

### 2. Parse to structured JSON (bounding boxes)

Use when building layout-aware RAG, highlighting source regions, or joining text with screenshots.

```python
import json
from liteparse import LiteParse

parser = LiteParse(output_format="json", quiet=True)
result = parser.parse("document.pdf")

# Programmatic access
for page in result.pages:
    for item in page.text_items:
        bbox = (item.x, item.y, item.width, item.height)
        # item.text, item.confidence, item.font_name, item.font_size
```

```bash
lit parse document.pdf --format json -o document.json
```

JSON field layout: `references/output_formats.md`.

### 3. Parse specific pages

```python
parser = LiteParse(target_pages="1-5,10,15-20", quiet=True)
result = parser.parse("long_paper.pdf")
```

```bash
lit parse long_paper.pdf --target-pages "1-5,10"
```

### 4. Parse from bytes or stdin

Useful for uploads, S3 downloads, or piping remote PDFs.

```python
with open("document.pdf", "rb") as f:
    result = parser.parse(f.read())
```

```bash
curl -sL https://example.com/report.pdf | lit parse -
```

### 5. Page screenshots for multimodal agents

Screenshots capture visual content that text extraction alone misses (figures, complex tables, handwriting).

```python
from pathlib import Path

parser = LiteParse(dpi=150, quiet=True)
shots = parser.screenshot("document.pdf", page_numbers=[1, 2, 3])
out = Path("screenshots")
out.mkdir(exist_ok=True)
for s in shots:
    (out / f"page_{s.page_num}.png").write_bytes(s.image_bytes)
```

```bash
lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots
lit screenshot document.pdf --dpi 300 -o ./screenshots
```

Combine **JSON parse + screenshots** when an agent needs both coordinates and pixels for the same pages.

### 6. Batch-parse a directory

For large corpora, prefer the CLI (parallel OCR workers) or the bundled script.

```bash
lit batch-parse ./papers ./parsed --format json --recursive
lit batch-parse ./papers ./parsed --extension .pdf --no-ocr
```

```bash
python scripts/batch_parse_dir.py ./papers ./parsed --format json --recursive
```

See `scripts/batch_parse_dir.py` for a Python batch wrapper without network calls.

### 7. OCR configuration

OCR is **on by default**. Tesseract is bundled; no extra install for basic English OCR.

```python
parser = LiteParse(
    ocr_enabled=True,
    ocr_language="eng",       # Tesseract codes: fra, deu, etc.
    num_workers=4,            # parallel OCR (default: CPU cores - 1)
    dpi=150,                  # higher DPI → better OCR, slower
)
```

```bash
lit parse scan.pdf --ocr-language fra
lit parse scan.pdf --no-ocr
lit parse scan.pdf --ocr-server-url http://localhost:8080/ocr
```

**Offline / air-gapped:** set `TESSDATA_PREFIX` to a directory of `.traineddata` files, or pass `--tessdata-path`. Details: `references/ocr_and_formats.md`.

### 8. Encrypted PDFs

```python
parser = LiteParse(password="secret", quiet=True)
result = parser.parse("protected.pdf")
```

```bash
lit parse protected.pdf --password secret
```

### 9. Search text items by phrase

Merge adjacent items and return combined bounding boxes for a phrase (e.g. section titles).

```python
from liteparse import search_items

page = result.get_page(1)
matches = search_items(page.text_items, "Materials and Methods", case_sensitive=False)
```

---

## Multi-Format Inputs

| Category | Extensions (examples) | Requirement |
|----------|----------------------|-------------|
| PDF | `.pdf` | Native |
| Office | `.docx`, `.xlsx`, `.pptx`, `.doc`, `.odt`, … | LibreOffice |
| Images | `.png`, `.jpg`, `.tiff`, `.webp`, `.svg`, … | ImageMagick |

Files are converted to PDF internally, then parsed. If conversion tools are missing, parsing fails with an actionable error — install the dependency and retry.

---

## Performance Tips

- **`--no-ocr`** on born-digital PDFs — largest speedup
- **`target_pages`** — parse only methods/supplement sections
- **`num_workers`** — scale OCR across CPU cores
- **`max_pages`** — cap very large files (default 1000)
- **`lit batch-parse`** — directory-scale jobs with `--recursive` and `--extension`
- Lower **`dpi`** (e.g. 100) when OCR quality is already sufficient

---

## Reference Files

| File | Read when |
|------|-----------|
| `references/choosing_a_parser.md` | Unsure whether to use LiteParse, MarkItDown, pdf, or LlamaParse |
| `references/api_reference.md` | Python/TypeScript API, types, `search_items` |
| `references/cli_reference.md` | Full `lit` command flags |
| `references/output_formats.md` | JSON schema, bboxes, confidence scores |
| `references/ocr_and_formats.md` | Tesseract, HTTP OCR, LibreOffice, ImageMagick |

---

## Troubleshooting

| Issue | Fix |
|-------|-----|
| Office file fails | Install LibreOffice; ensure `soffice` is on PATH (Windows: add LibreOffice `program` dir) |
| Image fails | Install ImageMagick; verify `convert` or `magick` works |
| OCR poor quality | Increase `--dpi`; try `--ocr-language`; or HTTP OCR server |
| OCR slow | `--no-ocr` if not needed; reduce pages; increase `num_workers` |
| Air-gapped OCR | `export TESSDATA_PREFIX=/path/to/tessdata` or `--tessdata-path` |
| `ParseError` on bytes | Ensure input is valid PDF bytes (Office bytes need a file path + conversion) |

---

## Resources

- **GitHub**: https://github.com/run-llama/liteparse
- **Docs**: https://developers.llamaindex.ai/liteparse/
- **PyPI**: https://pypi.org/project/liteparse/2.0.0/
- **npm**: https://www.npmjs.com/package/@llamaindex/liteparse
- **OCR API spec**: https://github.com/run-llama/liteparse/blob/main/OCR_API_SPEC.md

Scan or optimize your own skill →

Want a live grade + an embeddable README badge? Run your skill through the free scanner.

Graded independently by Skillproof — nothing to sell the author. Quality is mechanical + corpus-grounded; safety flags are heuristic (builtin+triage), not a malicious verdict.