zarr-python — quality + safety report
In the Skillier index (kdense-scientific__zarr-python) · scanned 2026-06-03 · engine: builtin+triage
3 heuristic flags to review
Heuristic flags from the builtin scanner, which is known to over-flag (it trips on legitimate env-reading integrations, security skills, and library .eval calls). This is NOT an authoritative malicious verdict — re-scan with SkillSpector for the authoritative result. Run the authoritative scan →
📇 This skill is in the Skillier index (curated · deduped · quality-filtered). Install Skillier to route & load it into your AI client.
Quality notes
About this skill
Chunked N-D arrays for cloud storage Zarr-Python 3 . Compressed arrays, parallel I/O, S3/GCS via fsspec, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.
📄 Read the SKILL.md
---
name: zarr-python
description: Chunked N-D arrays for cloud storage (Zarr-Python 3). Compressed arrays, parallel I/O, S3/GCS via fsspec, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.
allowed-tools: Read Write Edit Bash
license: MIT license
compatibility: Requires Python 3.12+ and zarr 3.x. Cloud I/O needs zarr[remote] plus s3fs or gcsfs. Legacy Zarr v2 workflows use zarr==2.* on older Python.
metadata:
version: "1.0"
skill-author: K-Dense Inc.
---
# Zarr Python
## Overview
Zarr is a Python library for storing large N-dimensional arrays with chunking and compression. Apply this skill for efficient parallel I/O, cloud-native workflows, and seamless integration with NumPy, Dask, and Xarray.
**Current upstream:** zarr **3.2.1** (PyPI, May 2026). Docs: [zarr.readthedocs.io](https://zarr.readthedocs.io/en/stable/). New arrays default to **Zarr format 3**; set `zarr_format=2` for legacy interop. This skill is a **community guide** maintained by K-Dense Inc., not an official zarr-developers package.
## Quick Start
### Installation
```bash
uv pip install "zarr>=3.2,<4"
```
Requires **Python 3.12+** (per PyPI metadata for zarr 3.2.x). For remote stores (S3, GCS, HTTP):
```bash
uv pip install "zarr[remote]"
uv pip install s3fs # AWS S3
uv pip install gcsfs # Google Cloud Storage
```
Pin `zarr>=3,<4` in application dependencies. Use `uv pip install "zarr==2.*"` only when you must stay on Zarr-Python 2 / Python 3.10–3.11.
### Basic Array Creation
```python
import zarr
import numpy as np
# Create a 2D array with chunking and compression
z = zarr.create_array(
store="data/my_array.zarr",
shape=(10000, 10000),
chunks=(1000, 1000),
dtype="f4"
)
# Write data using NumPy-style indexing
z[:, :] = np.random.random((10000, 10000))
# Read data
data = z[0:100, 0:100] # Returns NumPy array
```
## Core Operations
### Creating Arrays
Zarr provides multiple convenience functions for array creation:
```python
# Create empty array
z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000), dtype='f4',
store='data.zarr')
# Create filled arrays
z = zarr.ones((5000, 5000), chunks=(500, 500))
z = zarr.full((1000, 1000), fill_value=42, chunks=(100, 100))
# Create from existing data
data = np.arange(10000).reshape(100, 100)
z = zarr.array(data, chunks=(10, 10), store='data.zarr')
# Create like another array
z2 = zarr.zeros_like(z) # Matches shape, chunks, dtype of z
```
### Opening Existing Arrays
```python
# Open array (read/write mode by default)
z = zarr.open_array('data.zarr', mode='r+')
# Read-only mode
z = zarr.open_array('data.zarr', mode='r')
# The open() function auto-detects arrays vs groups
z = zarr.open('data.zarr') # Returns Array or Group
```
### Reading and Writing Data
Zarr arrays support NumPy-like indexing:
```python
# Write entire array
z[:] = 42
# Write slices
z[0, :] = np.arange(100)
z[10:20, 50:60] = np.random.random((10, 10))
# Read data (returns NumPy array)
data = z[0:100, 0:100]
row = z[5, :]
# Advanced indexing
z.vindex[[0, 5, 10], [2, 8, 15]] # Coordinate indexing
z.oindex[0:10, [5, 10, 15]] # Orthogonal indexing
z.blocks[0, 0] # Block/chunk indexing
```
### Resizing and Appending
```python
# Resize array (v3: pass shape as a tuple)
z.resize((15000, 15000))
# Append data along an axis
z.append(np.random.random((1000, 10000)), axis=0) # Adds rows
```
## Chunking Strategies
Chunking is critical for performance. Choose chunk sizes and shapes based on access patterns.
### Chunk Size Guidelines
- **Minimum chunk size**: 1 MB recommended for optimal performance
- **Balance**: Larger chunks = fewer metadata operations; smaller chunks = better parallel access
- **Memory consideration**: Entire chunks must fit in memory during compression
```python
# Configure chunk size (aim for ~1MB per chunk)
# For float32 data: 1MB = 262,144 elements = 512×512 array
z = zarr.zeros(
shape=(10000, 10000),
chunks=(512, 512), # ~1MB chunks
dtype='f4'
)
```
### Aligning Chunks with Access Patterns
**Critical**: Chunk shape dramatically affects performance based on how data is accessed.
```python
# If accessing rows frequently (first dimension)
z = zarr.zeros((10000, 10000), chunks=(10, 10000)) # Chunk spans columns
# If accessing columns frequently (second dimension)
z = zarr.zeros((10000, 10000), chunks=(10000, 10)) # Chunk spans rows
# For mixed access patterns (balanced approach)
z = zarr.zeros((10000, 10000), chunks=(1000, 1000)) # Square chunks
```
**Performance example**: For a (200, 200, 200) array, reading along the first dimension:
- Using chunks (1, 200, 200): ~107ms
- Using chunks (200, 200, 1): ~1.65ms (65× faster!)
### Sharding for Large-Scale Storage
When arrays have millions of small chunks, use sharding to group chunks into larger storage objects:
```python
from zarr.codecs import BloscCodec, BytesCodec, ShardingCodec
# Create array with sharding
z = zarr.create_array(
store='data.zarr',
shape=(100000, 100000),
chunks=(100, 100), # Small chunks for access
shards=(1000, 1000), # Groups 100 chunks per shard
dtype='f4'
)
```
**Benefits**:
- Reduces file system overhead from millions of small files
- Improves cloud storage performance (fewer object requests)
- Prevents filesystem block size waste
**Important**: Entire shards must fit in memory before writing.
## Compression
Zarr applies compression per chunk to reduce storage while maintaining fast access.
### Configuring Compression
```python
from zarr.codecs import BloscCodec, GzipCodec, ZstdCodec, BytesCodec
# Default: Blosc with Zstandard
z = zarr.zeros((1000, 1000), chunks=(100, 100)) # Uses default compression
# Configure Blosc codec
z = zarr.create_array(
store='data.zarr',
shape=(1000, 1000),
chunks=(100, 100),
dtype='f4',
codecs=[BloscCodec(cname='zstd', clevel=5, shuffle='shuffle')]
)
# Available Blosc compressors: 'blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd'
# Use Gzip compression
z = zarr.create_array(
store='data.zarr',
shape=(1000, 1000),
chunks=(100, 100),
dtype='f4',
codecs=[GzipCodec(level=6)]
)
# Disable compression
z = zarr.create_array(
store='data.zarr',
shape=(1000, 1000),
chunks=(100, 100),
dtype='f4',
codecs=[BytesCodec()] # No compression
)
```
### Compression Performance Tips
- **Blosc** (default): Fast compression/decompression, good for interactive workloads
- **Zstandard**: Better compression ratios, slightly slower than LZ4
- **Gzip**: Maximum compression, slower performance
- **LZ4**: Fastest compression, lower ratios
- **Shuffle**: Enable shuffle filter for better compression on numeric data
```python
# Optimal for numeric scientific data
codecs=[BloscCodec(cname='zstd', clevel=5, shuffle='shuffle')]
# Optimal for speed
codecs=[BloscCodec(cname='lz4', clevel=1)]
# Optimal for compression ratio
codecs=[GzipCodec(level=9)]
```
## Storage Backends
Zarr supports multiple storage backends through a flexible storage interface.
### Local Filesystem (Default)
```python
from zarr.storage import LocalStore
# Explicit store creation
store = LocalStore('data/my_array.zarr')
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
# Or use string path (creates LocalStore automatically)
z = zarr.open_array('data/my_array.zarr', mode='w', shape=(1000, 1000),
chunks=(100, 100))
```
### In-Memory Storage
```python
from zarr.storage import MemoryStore
# Create in-memory store
store = MemoryStore()
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
# Data exists only in memory, not persisted
```
### ZIP File Storage
```python
from zarr.storage import ZipStore
# Write to ZIP file
store = ZipStore('data.zip', mode='w')
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
z[:] = np.random.random((1000, 1000))
store.close() # IMPORTANT: Must close ZipStore
# Read from ZIP file
store = ZipStore('data.zip', mode='r')
z = zarr.open_array(store=store)
data = z[:]
store.close()
```
### Cloud Storage (S3, GCS)
Zarr 3 uses **fsspec** backends via URI strings or `FsspecStore` (preferred over legacy `S3Map`/`GCSMap`).
```python
import zarr
# S3 — credentials from standard AWS env vars (scope reads to these keys only)
# AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION
z = zarr.create_array(
store="s3://my-bucket/path/to/array.zarr",
shape=(1000, 1000),
chunks=(100, 100),
dtype="f4",
storage_options={"anon": False},
)
z[:] = data
# GCS — GOOGLE_APPLICATION_CREDENTIALS or gcloud default credentials
z = zarr.open_array(
"gs://my-bucket/path/to/array.zarr",
mode="r",
storage_options={"project": "my-project"},
)
# Explicit store (any fsspec filesystem)
from zarr.storage import FsspecStore
store = FsspecStore.from_url("s3://my-bucket/data.zarr", storage_options={"anon": False})
root = zarr.open_group(store=store, mode="r+")
```
Cloud backends read credentials from provider environment variables locally via fsspec; they are not sent to third-party endpoints outside your configured bucket/project.
**Cloud Storage Best Practices**:
- Use consolidated metadata to reduce latency: `zarr.consolidate_metadata(store)`
- Align chunk sizes with cloud object sizing (typically 5-100 MB optimal)
- Enable parallel writes using Dask for large-scale data
- Consider sharding to reduce number of objects
## Groups and Hierarchies
Groups organize multiple arrays hierarchically, similar to directories or HDF5 groups.
### Creating and Using Groups
```python
# Create root group
root = zarr.group(store='data/hierarchy.zarr')
# Create sub-groups
temperature = root.create_group('temperature')
precipitation = root.create_group('precipitation')
# Create arrays within groups
temp_array = temperature.create_array(
name='t2m',
shape=(365, 720, 1440),
chunks=(1, 720, 1440),
dtype='f4'
)
precip_array = precipitation.create_array(
name='prcp',
shape=(365, 720, 1440),
chunks=(1, 720, 1440),
dtype='f4'
)
# Access using paths
array = root['temperature/t2m']
# Visualize hierarchy
print(root.tree())
# Output:
# /
# ├── temperature
# │ └── t2m (365, 720, 1440) f4
# └── precipitation
# └── prcp (365, 720, 1440) f4
```
### Group API (v3)
Use `create_array` / `require_array` (h5py-style `create_dataset` / `require_dataset` were removed in v3):
```python
root = zarr.group('data.zarr')
arr = root.create_array('my_data', shape=(1000, 1000), chunks=(100, 100), dtype='f4')
grp = root.require_group('subgroup')
arr2 = grp.require_array('array', shape=(500, 500), chunks=(50, 50), dtype='i4')
```
## Attributes and Metadata
Attach custom metadata to arrays and groups using attributes:
```python
# Add attributes to array
z = zarr.zeros((1000, 1000), chunks=(100, 100))
z.attrs['description'] = 'Temperature data in Kelvin'
z.attrs['units'] = 'K'
z.attrs['created'] = '2024-01-15'
z.attrs['processing_version'] = 2.1
# Attributes are stored as JSON
print(z.attrs['units']) # Output: K
# Add attributes to groups
root = zarr.group('data.zarr')
root.attrs['project'] = 'Climate Analysis'
root.attrs['institution'] = 'Research Institute'
# Attributes persist with the array/group
z2 = zarr.open('data.zarr')
print(z2.attrs['description'])
```
**Important**: Attributes must be JSON-serializable (strings, numbers, lists, dicts, booleans, null).
## Integration with NumPy, Dask, and Xarray
### NumPy Integration
Zarr arrays implement the NumPy array interface:
```python
import numpy as np
import zarr
z = zarr.zeros((1000, 1000), chunks=(100, 100))
# Use NumPy functions directly
result = np.sum(z, axis=0) # NumPy operates on Zarr array
mean = np.mean(z[:100, :100])
# Convert to NumPy array
numpy_array = z[:] # Loads entire array into memory
```
### Dask Integration
Dask provides lazy, parallel computation on Za
… (truncated)Want a live grade + an embeddable README badge? Run your skill through the free scanner.
Graded independently by Skillproof — nothing to sell the author. Quality is mechanical + corpus-grounded; safety flags are heuristic (builtin+triage), not a malicious verdict.