scanpy — quality + safety report
In the Skillier index (kdense-scientific__scanpy) · scanned 2026-06-03 · engine: builtin+triage
1 heuristic flag to review
Heuristic flags from the builtin scanner, which is known to over-flag (it trips on legitimate env-reading integrations, security skills, and library .eval calls). This is NOT an authoritative malicious verdict — re-scan with SkillSpector for the authoritative result. Run the authoritative scan →
📇 This skill is in the Skillier index (curated · deduped · quality-filtered). Install Skillier to route & load it into your AI client.
Quality notes
About this skill
Standard single-cell RNA-seq analysis pipeline. Use for QC, normalization, dimensionality reduction PCA/UMAP/t-SNE , clustering, differential expression, and visualization. Best for exploratory scRNA-seq analysis with established workflows. For deep learning models use scvi-tools; for data format…
📄 Read the SKILL.md
---
name: scanpy
description: Standard single-cell RNA-seq analysis pipeline. Use for QC, normalization, dimensionality reduction (PCA/UMAP/t-SNE), clustering, differential expression, and visualization. Best for exploratory scRNA-seq analysis with established workflows. For deep learning models use scvi-tools; for data format questions use anndata.
license: BSD-3-Clause
metadata:
version: "1.1"
skill-author: K-Dense Inc.
---
# Scanpy: Single-Cell Analysis
## Overview
Scanpy is a scalable Python toolkit for analyzing single-cell RNA-seq data, built on AnnData. Apply this skill for complete single-cell workflows including quality control, normalization, dimensionality reduction, clustering, marker gene identification, visualization, and trajectory analysis. Current stable release: **scanpy 1.12.x** (January 2026).
## Installation
Requires Python **3.12+** (scanpy 1.12 dropped Python ≤3.11) and anndata **≥0.10**.
```bash
uv pip install "scanpy[leiden]"
```
The `[leiden]` extra installs `python-igraph` and `leidenalg`, required for Leiden clustering. For reproducible environments, pin a version: `uv pip install "scanpy[leiden]==1.12.1"`.
For large or out-of-core datasets, many functions support [Dask](https://docs.dask.org/) arrays (experimental):
```bash
uv pip install "scanpy[leiden]" dask
```
See the [Using dask with Scanpy](https://scanpy.scverse.org/en/stable/tutorials/experimental/dask.html) tutorial. For GPU-accelerated scanpy-like operations, use [rapids-singlecell](https://rapids-singlecell.readthedocs.io/) as a separate package.
For AnnData structure and I/O details, use the **anndata** skill. For probabilistic models and batch correction, use **scvi-tools**.
## When to Use This Skill
This skill should be used when:
- Analyzing single-cell RNA-seq data (.h5ad, 10X, CSV formats)
- Performing quality control on scRNA-seq datasets
- Creating UMAP, t-SNE, or PCA visualizations
- Identifying cell clusters and finding marker genes
- Annotating cell types based on gene expression
- Conducting trajectory inference or pseudotime analysis
- Generating publication-quality single-cell plots
## Quick Start
### Basic Import and Setup
```python
import scanpy as sc
import pandas as pd
import numpy as np
# Configure settings
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white')
sc.settings.figdir = './figures/'
sc.settings.autosave = True # Preferred over per-plot save= (deprecated in scanpy 1.12)
```
### Loading Data
```python
# From 10X Genomics
adata = sc.read_10x_mtx('path/to/data/')
adata = sc.read_10x_h5('path/to/data.h5')
# From h5ad (AnnData format)
adata = sc.read_h5ad('path/to/data.h5ad')
# From CSV
adata = sc.read_csv('path/to/data.csv')
```
### Understanding AnnData Structure
The AnnData object is the core data structure in scanpy:
```python
adata.X # Expression matrix (cells × genes)
adata.obs # Cell metadata (DataFrame)
adata.var # Gene metadata (DataFrame)
adata.uns # Unstructured annotations (dict)
adata.obsm # Multi-dimensional cell data (PCA, UMAP)
adata.raw # Raw data backup
# Access cell and gene names
adata.obs_names # Cell barcodes
adata.var_names # Gene names
```
## Standard Analysis Workflow
### 1. Quality Control
Identify and filter low-quality cells and genes:
```python
# Identify mitochondrial genes
adata.var['mt'] = adata.var_names.str.startswith('MT-')
# Calculate QC metrics
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
# Visualize QC metrics
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
jitter=0.4, multi_panel=True)
# Filter cells and genes
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs.pct_counts_mt < 5, :] # Remove high MT% cells
```
**Doublet detection (optional, on raw counts before normalization):**
```python
sc.pp.scrublet(adata) # Core API since scanpy 1.10 (was scanpy.external.pp)
adata = adata[~adata.obs['predicted_doublet'], :].copy()
```
**Use the QC script for automated analysis** (run from the skill directory or pass the full path):
```bash
python skills/scanpy/scripts/qc_analysis.py input_file.h5ad --output filtered.h5ad
```
### 2. Normalization and Preprocessing
```python
# Normalize to 10,000 counts per cell
sc.pp.normalize_total(adata, target_sum=1e4)
# Log-transform
sc.pp.log1p(adata)
# Save raw counts for later
adata.raw = adata
# Identify highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pl.highly_variable_genes(adata)
# Subset to highly variable genes
adata = adata[:, adata.var.highly_variable]
# Regress out unwanted variation
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
# Scale data
sc.pp.scale(adata, max_value=10)
```
### 3. Dimensionality Reduction
```python
# PCA
sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata, log=True) # Check elbow plot
# Compute neighborhood graph
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
# UMAP for visualization
sc.tl.umap(adata)
sc.pl.umap(adata, color='leiden')
# Alternative: t-SNE
sc.tl.tsne(adata)
```
### 4. Clustering
```python
# Leiden clustering (recommended)
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color='leiden', legend_loc='on data')
# Try multiple resolutions to find optimal granularity
for res in [0.3, 0.5, 0.8, 1.0]:
sc.tl.leiden(adata, resolution=res, key_added=f'leiden_{res}')
```
### 5. Marker Gene Identification
Use `rank_genes_groups` for **exploratory cluster markers** only. Per-cell statistical tests inflate p-values because cells are not independent observations. For rigorous differential expression between conditions or samples, pseudobulk first (see below) and use **pydeseq2** or similar tools.
```python
# Find marker genes for each cluster (exploratory)
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
# Visualize results
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
sc.pl.rank_genes_groups_heatmap(adata, n_genes=10)
sc.pl.rank_genes_groups_dotplot(adata, n_genes=5)
# Get results as DataFrame
markers = sc.get.rank_genes_groups_df(adata, group='0')
```
### 6. Cell Type Annotation
```python
# Define marker genes for known cell types
marker_genes = ['CD3D', 'CD14', 'MS4A1', 'NKG7', 'FCGR3A']
# Visualize markers
sc.pl.umap(adata, color=marker_genes, use_raw=True)
sc.pl.dotplot(adata, var_names=marker_genes, groupby='leiden')
# Manual annotation
cluster_to_celltype = {
'0': 'CD4 T cells',
'1': 'CD14+ Monocytes',
'2': 'B cells',
'3': 'CD8 T cells',
}
adata.obs['cell_type'] = adata.obs['leiden'].map(cluster_to_celltype)
# Visualize annotated types
sc.pl.umap(adata, color='cell_type', legend_loc='on data')
```
### 7. Save Results
```python
# Save processed data
adata.write('results/processed_data.h5ad')
# Export metadata
adata.obs.to_csv('results/cell_metadata.csv')
adata.var.to_csv('results/gene_metadata.csv')
```
## Common Tasks
### Creating Publication-Quality Plots
Prefer `sc.settings.autosave` and `sc.settings.figdir` for saving figures. The per-plot `save=` parameter is deprecated in scanpy 1.12.
```python
# Set high-quality defaults
sc.settings.set_figure_params(dpi=300, frameon=False, figsize=(5, 5))
sc.settings.file_format_figs = 'pdf'
sc.settings.figdir = './figures/'
sc.settings.autosave = True
# UMAP with custom styling (saved as figures/umap.pdf via autosave)
sc.pl.umap(adata, color='cell_type',
palette='Set2',
legend_loc='on data',
legend_fontsize=12,
legend_fontoutline=2,
frameon=False)
# Heatmap of marker genes
sc.pl.heatmap(adata, var_names=genes, groupby='cell_type',
swap_axes=True, show_gene_labels=True)
# Dot plot
sc.pl.dotplot(adata, var_names=genes, groupby='cell_type')
```
Refer to `references/plotting_guide.md` for comprehensive visualization examples.
### Trajectory Inference
```python
# PAGA (Partition-based graph abstraction)
sc.tl.paga(adata, groups='leiden')
sc.pl.paga(adata, color='leiden')
# Diffusion pseudotime
adata.uns['iroot'] = np.flatnonzero(adata.obs['leiden'] == '0')[0]
sc.tl.dpt(adata)
sc.pl.umap(adata, color='dpt_pseudotime')
```
### Pseudobulk and Differential Expression Between Conditions
Pseudobulk by sample and cell type, then run proper DE (e.g., pydeseq2) rather than per-cell `rank_genes_groups`:
```python
# Aggregate counts by sample and cell type (dask-compatible in scanpy 1.12)
pb = sc.get.aggregate(
adata,
by=['sample', 'cell_type'],
func='sum',
layer='counts', # Use raw counts layer if available
)
# Downstream: export pb and use pydeseq2 for condition comparisons
```
For quick exploratory comparisons within a cluster, `rank_genes_groups` is acceptable but interpret p-values cautiously:
```python
adata_subset = adata[adata.obs['cell_type'] == 'T cells']
sc.tl.rank_genes_groups(adata_subset, groupby='condition',
groups=['treated'], reference='control')
sc.pl.rank_genes_groups(adata_subset, groups=['treated'])
```
### Gene Set Scoring
```python
# Score cells for gene set expression
gene_set = ['CD3D', 'CD3E', 'CD3G']
sc.tl.score_genes(adata, gene_set, score_name='T_cell_score')
sc.pl.umap(adata, color='T_cell_score')
```
### Batch Correction
```python
# ComBat batch correction
sc.pp.combat(adata, key='batch')
# Alternative: use Harmony or scVI (separate packages)
```
## Key Parameters to Adjust
### Quality Control
- `min_genes`: Minimum genes per cell (typically 200-500)
- `min_cells`: Minimum cells per gene (typically 3-10)
- `pct_counts_mt`: Mitochondrial threshold (typically 5-20%)
### Normalization
- `target_sum`: Target counts per cell (default 1e4)
### Feature Selection
- `n_top_genes`: Number of HVGs (typically 2000-3000)
- `min_mean`, `max_mean`, `min_disp`: HVG selection parameters
### Dimensionality Reduction
- `n_pcs`: Number of principal components (check variance ratio plot)
- `n_neighbors`: Number of neighbors (typically 10-30)
### Clustering
- `resolution`: Clustering granularity (0.4-1.2, higher = more clusters)
## Common Pitfalls and Best Practices
1. **Always save raw counts**: `adata.raw = adata` before filtering genes
2. **Check QC plots carefully**: Adjust thresholds based on dataset quality
3. **Use Leiden clustering**: `sc.tl.louvain` is deprecated in scanpy 1.12
4. **Try multiple clustering resolutions**: Find optimal granularity
5. **Validate cell type annotations**: Use multiple marker genes
6. **Use `use_raw=True` for gene expression plots**: Shows normalized counts from `.raw`
7. **Check PCA variance ratio**: Determine optimal number of PCs
8. **Save intermediate results**: Long workflows can fail partway through
9. **Pseudobulk for DE**: Do not treat `rank_genes_groups` p-values as rigorous DE between conditions
10. **Save plots via settings**: Use `sc.settings.autosave` instead of deprecated `save=` on plot functions
## Bundled Resources
### scripts/qc_analysis.py
Automated quality control script that calculates metrics, generates plots, and filters data:
```bash
python skills/scanpy/scripts/qc_analysis.py input.h5ad --output filtered.h5ad \
--mt-threshold 5 --min-genes 200 --min-cells 3
```
### references/standard_workflow.md
Complete step-by-step workflow with detailed explanations and code examples for:
- Data loading and setup
- Quality control with visualization
- Normalization and scaling
- Feature selection
- Dimensionality reduction (PCA, UMAP, t-SNE)
- Clustering (Leiden)
- Doublet detection (scrublet) and pseudobulk aggregation
- Marker gene identification
- Cell type annotation
- Trajectory inference
- Differential expression
Read this reference when performing a complete analysis from scratch.
### references/api_reference.md
Quick reference guide for scanpy functions organized by module:
- Reading/writing data (`sc.read_*`, `adata.write_*`)
- Preprocessing (`sc.pp.*`)
-
… (truncated)Want a live grade + an embeddable README badge? Run your skill through the free scanner.
Graded independently by Skillproof — nothing to sell the author. Quality is mechanical + corpus-grounded; safety flags are heuristic (builtin+triage), not a malicious verdict.