Data Storage
Overview
What is DataStorage?
The DataStorage module provides a unified, abstract interface for persisting and retrieving data in HOOPS AI. It supports multiple storage backends (Zarr, JSON, in-memory) while maintaining a consistent API. The system integrates with SchemaBuilder to enable schema-driven validation, metadata routing, and organized data merging.
Tip
Prerequisites: This guide assumes familiarity with:
CAD encoding basics: Understanding what data
BrepEncoderproduces → See CAD Data EncodingSchemaBuilder: Defining data organization schemas → See Datasets - ML-Ready Inputs
Key Architecture
DataStorage implementations follow a plugin pattern where:
The
DataStorageabstract base class defines the interfaceConcrete implementations (
OptStorage,MemoryStorage,JsonStorageHandler) handle specific formatsSchema dictionaries from SchemaBuilder configure storage behavior
Metadata routing automatically organizes information into
.infosetand.attribsetfiles
The system follows a push-based storage pattern:
Data Producer (Encoder) → DataStorage.save_data() → Backend-Specific Persistence
Schema Dictionary → DataStorage.set_schema() → Validation & Routing Logic
Three Storage Implementations
Implementation |
Format |
Use Case |
|---|---|---|
Zarr (compressed) |
Production datasets, large arrays, cloud storage compatibility |
|
RAM (dictionaries) |
Unit testing, prototyping, small datasets |
|
JSON files |
Debugging, human inspection, interoperability |
See also
CAD Data Encoding - How
BrepEncoderuses DataStorageDatasets - ML-Ready Inputs - SchemaBuilder and dataset organization
Data Flow Customisation - Integrating storage into automated workflows
Basic Usage
Here’s a minimal example showing DataStorage with CAD encoding:
from hoops_ai.cadaccess import HOOPSLoader
from hoops_ai.cadencoder import BrepEncoder
from hoops_ai.storage import MemoryStorage
# Load CAD file
loader = HOOPSLoader()
model = loader.create_from_file("part.step")
brep = model.get_brep()
# Create storage handler
storage = MemoryStorage() # In-memory for testing
# Create encoder and extract features
encoder = BrepEncoder(brep, storage)
encoder.push_face_attributes()
encoder.push_edge_attributes()
# Access stored data
face_areas = storage.load_data("face_areas")
print(f"Stored keys: {storage.get_keys()}")
For production use, replace MemoryStorage with OptStorage:
from hoops_ai.storage import OptStorage
storage = OptStorage("output/part001.zarr")
encoder = BrepEncoder(brep, storage)
# ... encoding operations ...
storage.close()
The DataStorage Abstract Base Class
Understanding the Interface
DataStorage is an abstract base class (ABC) that defines the contract all storage backends must implement. It specifies the minimum set of operations required for saving and retrieving data.
from hoops_ai.storage.datastorage import DataStorage
Core Abstract Methods
All DataStorage implementations must provide these methods:
save_data(data_key: str, data: Any) → None
Purpose: Persists data under a unique key.
storage.save_data("face_areas", face_areas_array) storage.save_data("metadata/description", "CAD part analysis")Parameters:
data_key(str): Unique identifier for the data (can be hierarchical using ‘/’)
data(Any): Data to store (numpy arrays, lists, dicts, scalars, strings)Behavior:
Overwrites existing data if key already exists
May validate data against schema if schema is set
Automatically calculates and stores data size in metadata
load_data(data_key: str) → Any
Purpose: Retrieves data associated with a specific key.
face_areas = storage.load_data("face_areas") description = storage.load_data("metadata/description")Parameters:
data_key(str): The key of the data to loadReturns:
Any: The loaded data in its original formatRaises:
KeyError: If the data_key does not exist
save_metadata(key: str, value: Any) → None
Purpose: Stores metadata as key-value pairs, supporting nested structures.
storage.save_metadata("size_cadfile", 1024000) storage.save_metadata("file_sizes_KB/face_areas", 45.2) storage.save_metadata("processing/duration", 12.5)Parameters:
key(str): Metadata key (supports nesting with ‘/’ separator)
value(Any): Metadata value (bool, int, float, string, list, or array)Behavior:
Creates nested dictionary structure based on ‘/’ separators
Merges with existing metadata (doesn’t overwrite entire structure)
When schema is set, routes to
.infosetor.attribsetfiles
load_metadata(key: str) → Any
Purpose: Loads metadata by key, supporting nested access.
file_size = storage.load_metadata("size_cadfile") face_size = storage.load_metadata("file_sizes_KB/face_areas")Parameters:
key(str): Metadata key (supports nested keys with ‘/’ separator)Returns:
Any: The metadata valueRaises:
KeyError: If the key does not exist
get_keys() → list
Purpose: Returns a list of all top-level data keys in storage.
keys = storage.get_keys() # Returns: ['face_indices', 'face_areas', 'edge_indices', 'graph', ...]Returns:
list: All top-level keys (arrays and groups)
get_file_path(data_key: str) → str
Purpose: Gets the file system path for a specific data key.
path = storage.get_file_path("face_areas") # OptStorage: "./encoded_data/my_part.zarr/face_areas" # JsonStorage: "./json_data/face_areas.json" # MemoryStorage: "In-memory storage: No file path for key 'face_areas'"Parameters:
data_key(str): The data keyReturns:
str: File path or descriptive message for in-memory storage
close() → None
Purpose: Cleanup and resource deallocation.
storage.close()Behavior:
OptStorage: Copies visualization files, exports metadata, deletes temporary directory
MemoryStorage: Clears all data from memory
JsonStorage: No-op (JSON operations are stateless)
format() → str
Purpose: Returns the storage format identifier.
fmt = storage.format() # OptStorage: "zarr" # MemoryStorage: "memory" # JsonStorage: "json"Returns:
str: Format identifier string
compress_store() → int
Purpose: Compresses the storage (if applicable).
compressed_size = storage.compress_store() # OptStorage: Creates .data zip file, returns size in bytes # MemoryStorage/JsonStorage: Returns 0 (no compression)Returns:
int: Size of compressed file in bytes, or 0 if not applicable
Schema Support Methods
These methods integrate with SchemaBuilder for validation and routing:
set_schema(schema: dict) → None
Purpose: Configures the storage with a schema definition from SchemaBuilder.
from hoops_ai.storage.datasetstorage import SchemaBuilder builder = SchemaBuilder(domain="CAD_analysis") faces_group = builder.create_group("faces", "face", "Face data") faces_group.create_array("face_areas", ["face"], "float32") schema = builder.build() storage.set_schema(schema) # Schema dictionary applied hereParameters:
schema(dict): Schema dictionary fromSchemaBuilder.build()Behavior:
Default implementation saves schema as metadata under key
"_storage_schema"Subclasses can override for more efficient schema storage
Enables validation and metadata routing
get_schema() → dict
Purpose: Retrieves the currently configured schema.
schema = storage.get_schema() # Returns: Schema dictionary or {} if no schema is setReturns:
dict: The schema definition, or empty dict if no schema
get_group_for_array(array_name: str) → str
Purpose: Determines which group an array belongs to based on schema.
group = storage.get_group_for_array("face_areas") # Returns: "faces" (based on schema definition) group = storage.get_group_for_array("edge_lengths") # Returns: "edges"Parameters:
array_name(str): Name of the arrayReturns:
str: Group name for the array, or None if not found in schema
Use Case: Dataset merging uses this to group arrays correctly
validate_data_against_schema(data_key: str, data: Any) → bool
Purpose: Validates data against the stored schema if present.
import numpy as np # Assuming schema defines face_areas as ["face"] dimension, float32 valid_data = np.array([1.5, 2.3, 4.1], dtype=np.float32) is_valid = storage.validate_data_against_schema("face_areas", valid_data) # Returns: True invalid_data = np.array([[1.5, 2.3], [4.1, 3.2]]) # Wrong dimensions is_valid = storage.validate_data_against_schema("face_areas", invalid_data) # Returns: FalseParameters:
data_key(str): The key under which data will be stored
data(Any): The data to validateReturns:
bool: True if valid or no schema present, False if validation failsValidation Checks:
Dimension count matches schema specification
Data type matches or is convertible to specified dtype
Arrays not in schema are allowed (extensible schema)
Why Separate Data and Metadata?
Data and metadata serve different purposes:
Data: Large arrays, features, graph structures (stored as Zarr arrays, JSON objects)
Metadata: Small descriptive information (file size, timestamps, labels)
Separating them allows:
Efficient querying of metadata without loading large arrays
Different storage formats (arrays vs. key-value pairs)
Automatic routing to
.infoset(file-level) or.attribset(categorical) files
Implementation 1: OptStorage (Production)
OptStorage is the primary storage implementation using Zarr format for efficient, chunked, compressed array storage.
Initialization
from hoops_ai.storage import OptStorage
storage = OptStorage(
store_path="./flow_output/flows/my_flow/encoded/part_001.zarr",
compress_extension=".data"
)
Parameters:
store_path(str): Path to the Zarr directory store
compress_extension(str): Extension for compressed archive (default: “.data”)
Initialization Behavior:
If
.zarr.datafile exists and directory doesn’t: Opens in read-only modeOtherwise: Creates directory structure and initializes writable store
Creates
metadata.jsonfile for metadata storageUses
DirectoryStorefor writing,ZipStorefor reading compressed archives
Data Operations
Saving Data
OptStorage recursively handles nested data structures:
import numpy as np
# Scalars
storage.save_data("num_faces", 42)
# 1D Arrays
storage.save_data("face_areas", np.array([1.5, 2.3, 4.1], dtype=np.float32))
# Multi-dimensional Arrays
storage.save_data("face_normals", np.random.randn(100, 3).astype(np.float32))
# Nested Dictionaries
storage.save_data("graph", {
"edges_source": np.array([0, 1, 2]),
"edges_destination": np.array([1, 2, 3]),
"num_nodes": 4
})
# Strings
storage.save_data("description", "High-complexity CAD part")
Data Type Handling:
NumPy arrays: Stored with compression, chunking, and dimension names
Lists: Converted to NumPy arrays
Dicts: Become Zarr groups with nested structure
Scalars (int, float, bool): Stored as 0-dimensional arrays
Strings: Stored as object arrays with MsgPack codec
Automatic Features:
NaN Detection: Raises error if NaNs found in floating-point arrays
Compression: Zstd level 12 compression applied
Chunking: Automatic chunk sizing (~1M elements per chunk)
Filters: Delta filter for integer arrays
Size Tracking: Data size automatically recorded in metadata
Loading Data
# Load arrays
face_areas = storage.load_data("face_areas")
# Returns: numpy array
# Load nested structures
graph = storage.load_data("graph")
# Returns: {'edges_source': array([...]), 'edges_destination': array([...]), 'num_nodes': 4}
# Load scalars
num_faces = storage.load_data("num_faces")
# Returns: 42
# Load strings
description = storage.load_data("description")
# Returns: "High-complexity CAD part"
Compression
OptStorage supports compression into a single .data file:
# After all data is saved
compressed_size = storage.compress_store()
# Returns: Size of compressed .data file in bytes
# Result: Creates part_001.zarr.data (ZipStore format)
# Original directory remains until close() is called
Compression Process:
Validates no NaNs exist in arrays (safety check)
Copies all data from DirectoryStore to ZipStore
Preserves all array attributes (including dimension names)
Includes metadata.json in the archive
Returns compressed file size
Benefits:
Single-file distribution
Reduced disk space (Zstd compression)
Atomic operations (write-then-rename pattern)
Read-only access to prevent accidental modification
Dimension Naming for xarray
OptStorage sets the _ARRAY_DIMENSIONS attribute on all arrays to enable xarray compatibility:
# When saving "face_areas" array
# OptStorage automatically sets:
# arr.attrs["_ARRAY_DIMENSIONS"] = ["face_areas_dim_0"]
# For nested data "faceface/a3_distance"
# Dimensions become: ["faceface_a3_distance_dim_0", "faceface_a3_distance_dim_1", ...]
Why This Matters:
Enables direct loading with
xarray.open_zarr()Preserves dimension semantics across save/load cycles
Supports multi-dimensional indexing and slicing
Facilitates interoperability with other Zarr tools
Cleanup Behavior
storage.close()
Close Operations:
Copy visualization files (
visu*) tostream_cache/directoryExport metadata to
files_summary/{filename}.jsonDelete temporary directory (if compression was performed)
Thread-safe: Handles concurrent close() calls gracefully
When to Use OptStorage
Ideal For:
Production datasets (> 1GB of encoded CAD data)
Large arrays (NumPy arrays > 100MB)
Compression needed (reduce storage costs by 3-5x)
Local or network filesystem storage
Append operations (adding data incrementally)
Not Suitable For:
Small datasets (< 100MB - overhead not justified)
Human inspection (binary format - use JsonStorageHandler)
Rapid prototyping (MemoryStorage is faster for iteration)
Implementation 2: MemoryStorage (Testing)
MemoryStorage stores all data in RAM using Python dictionaries, ideal for testing and small datasets.
Initialization
from hoops_ai.storage.datastorage import MemoryStorage
storage = MemoryStorage()
No Parameters: Creates empty in-memory storage
How It Works
Internally maintains two dictionaries:
self._data = {} # Stores data arrays
self._metadata = {} # Stores metadata
All save/load operations are dictionary lookups (O(1) complexity).
Data Operations
import numpy as np
storage = MemoryStorage()
# Save data (stored in internal dict)
storage.save_data("face_areas", np.array([1.5, 2.3, 4.1]))
# Load data (retrieved from dict)
face_areas = storage.load_data("face_areas")
# Metadata (separate internal dict)
storage.save_metadata("size_cadfile", 1024000)
size = storage.load_metadata("size_cadfile")
# Get keys
keys = storage.get_keys() # ['face_areas']
# Close (clears all data)
storage.close()
Features:
Instant operations: No disk I/O overhead
Size tracking: Approximates memory usage with
sys.getsizeof()Nested metadata: Supports hierarchical keys like OptStorage
No compression:
compress_store()returns 0
When to Use MemoryStorage
Ideal For:
Unit testing (fast, isolated tests without filesystem I/O)
Prototyping (quick iterations without managing files)
Small datasets (when entire dataset fits in RAM < 1GB)
Temporary storage (data needed only during script execution)
Not Suitable For:
Large datasets (RAM limitations > 10GB data)
Persistence required (data lost when process terminates)
Distributed computing (cannot share memory across processes)
Example: Testing with MemoryStorage
from hoops_ai.storage.datastorage import MemoryStorage
import numpy as np
def test_encoder():
# Use MemoryStorage for fast testing
storage = MemoryStorage()
# Mock encoder operations
storage.save_data("face_indices", np.array([0, 1, 2, 3]))
storage.save_data("face_areas", np.array([1.5, 2.3, 4.1, 3.7]))
storage.save_metadata("num_faces", 4)
# Assertions
assert len(storage.load_data("face_indices")) == 4
assert storage.load_metadata("num_faces") == 4
# Fast cleanup (no disk operations)
storage.close()
test_encoder()
Implementation 3: JsonStorageHandler (Debugging)
JsonStorageHandler stores each data key as a separate JSON file on disk, suitable for human-readable storage.
Initialization
from hoops_ai.storage.datastorage import JsonStorageHandler
storage = JsonStorageHandler(json_dir_path="./json_output")
Parameters:
json_dir_path(str): Directory where JSON files will be stored
Initialization:
Creates directory if it doesn’t exist
Creates
metadata.jsonfor metadata storageEach data key becomes a separate
.jsonfile
File Structure
Creates directory structure:
./json_output/ ├── face_areas.json # Data arrays as JSON lists ├── face_types.json ├── edge_lengths.json └── metadata.json # Metadata key-value pairs
JSON Serialization
JsonStorageHandler handles NumPy types automatically:
import numpy as np
storage = JsonStorageHandler("./json_data")
# NumPy arrays → JSON lists
storage.save_data("face_areas", np.array([1.5, 2.3, 4.1], dtype=np.float32))
# Saved as: face_areas.json → [1.5, 2.3, 4.1]
# Complex numbers → JSON objects
storage.save_data("complex_data", np.array([1+2j, 3+4j]))
# Saved as: [{"real": 1.0, "imag": 2.0, "_numpy_complex": true}, ...]
# Dictionaries → JSON objects
storage.save_data("metadata", {"version": "1.0", "author": "HOOPS AI"})
# Load (automatic deserialization)
face_areas = storage.load_data("face_areas")
# Returns: numpy array([1.5, 2.3, 4.1], dtype=float32)
Serialization Rules:
NumPy arrays → JSON lists (with auto-conversion on load)
NumPy scalars → Python primitives
Complex numbers → JSON objects with
{"real", "imag", "_numpy_complex"}Nested structures → Recursive serialization
File Naming:
Keys are sanitized:
"face/areas"→face_areas.jsonOnly alphanumeric,
-, and_allowed in filenames
When to Use JsonStorageHandler
Ideal For:
Debugging (human-readable output for inspection)
Small datasets (< 1000 entries with simple data types)
Interoperability (data consumed by non-Python tools)
Version control (diff-friendly format for tracking changes)
Not Suitable For:
Large arrays (JSON doesn’t efficiently store NumPy arrays - converts to nested lists)
Binary data (byte data gets base64-encoded, increasing size)
Performance-critical applications (slower than binary formats like Zarr, HDF5)
High file counts (creates one file per key, causing filesystem overhead)
Example: JSON Export for Visualization
from hoops_ai.storage.datastorage import JsonStorageHandler
import numpy as np
# Store as human-readable JSON for external tools
storage = JsonStorageHandler("./json_export")
storage.save_data("face_areas", np.array([1.5, 2.3, 4.1], dtype=np.float32))
storage.save_data("metadata", {
"part_name": "Bracket_V2",
"complexity": "Medium",
"num_features": 42
})
storage.save_metadata("export_timestamp", "2025-10-30T10:30:00")
# Result:
# ./json_export/face_areas.json → [1.5, 2.3, 4.1]
# ./json_export/metadata.json → {"part_name": "Bracket_V2", ...}
# ./json_export/metadata.json (meta) → {"export_timestamp": "..."}
Schema Integration
The DataStorage base class provides schema integration that works across all implementations.
Setting Schemas
from hoops_ai.storage import OptStorage
from hoops_ai.storage.datasetstorage import SchemaBuilder
# Build schema
builder = SchemaBuilder(domain="CAD_analysis", version="1.0")
faces_group = builder.create_group("faces", "face", "Face data")
faces_group.create_array("face_areas", ["face"], "float32")
faces_group.create_array("face_normals", ["face", "coordinate"], "float32")
edges_group = builder.create_group("edges", "edge", "Edge data")
edges_group.create_array("edge_lengths", ["edge"], "float32")
schema = builder.build()
# Apply to storage
storage = OptStorage(store_path="./data.zarr")
storage.set_schema(schema)
What set_schema() Does:
Stores schema dictionary as metadata (key:
"_storage_schema")Enables
validate_data_against_schema()checksEnables
get_group_for_array()lookupsConfigures metadata routing (if schema includes routing rules)
Schema-Driven Validation
When schema is set, DataStorage can validate data before saving:
import numpy as np
# Schema defines face_areas as ["face"] dimension, float32
schema = {...} # From SchemaBuilder
storage.set_schema(schema)
# Valid data
valid_data = np.array([1.5, 2.3, 4.1], dtype=np.float32)
is_valid = storage.validate_data_against_schema("face_areas", valid_data)
# Returns: True
# Invalid: wrong dimensions (2D instead of 1D)
invalid_data = np.array([[1.5, 2.3], [4.1, 3.2]])
is_valid = storage.validate_data_against_schema("face_areas", invalid_data)
# Returns: False
# Invalid: wrong dtype (int instead of float)
invalid_data = np.array([1, 2, 3], dtype=np.int32)
is_valid = storage.validate_data_against_schema("face_areas", invalid_data)
# Returns: False (dtype mismatch)
Validation Logic:
Extracts array specification from schema
Checks number of dimensions matches
Checks dtype matches or is convertible
Returns True if no schema or array not in schema (extensible)
Group Membership
Schema enables storage to determine group membership for arrays:
schema = {...} # Schema with "faces" and "edges" groups
storage.set_schema(schema)
# Lookup group for array
group = storage.get_group_for_array("face_areas")
# Returns: "faces"
group = storage.get_group_for_array("edge_lengths")
# Returns: "edges"
group = storage.get_group_for_array("unknown_array")
# Returns: None (not in schema)
Use Case: Dataset Merging
During merging, the merger uses get_group_for_array() to:
Group arrays from multiple files by their logical group
Concatenate arrays along the correct dimension
Apply special processing (e.g., matrix flattening for “faceface” group)
Metadata Management
File-Level vs Categorical Metadata
DataStorage distinguishes between two types of metadata:
File-Level Metadata (.infoset files):
Information about each individual data file
Examples: file size, processing time, file path, timestamps
One row per file in merged datasets
Routing patterns:
"size_*","duration_*","processing_*","flow_name"
Categorical Metadata (.attribset files):
Classification and labeling information
Examples: part category, material type, complexity rating
Used for grouping and filtering datasets
Routing patterns:
"*_label","category","type","material_*"
Routing Configuration:
Schemas can define routing rules:
builder.set_metadata_routing_rules(
file_level_patterns=["size_*", "duration_*", "processing_*", "flow_name"],
categorical_patterns=["*_label", "category", "type"],
default_numeric="file_level",
default_categorical="categorical"
)
When schema is set, save_metadata() automatically routes based on:
Explicit definitions in schema
Pattern matching
Default rules based on data type
Nested Metadata Keys
All DataStorage implementations support nested metadata using ‘/’ as separator:
# Top-level metadata
storage.save_metadata("size_cadfile", 1024000)
# Nested metadata
storage.save_metadata("file_sizes_KB/face_areas", 45.2)
storage.save_metadata("file_sizes_KB/edge_lengths", 12.3)
storage.save_metadata("processing/duration", 12.5)
storage.save_metadata("processing/timestamp", "2025-10-30T10:30:00")
# Load nested metadata
face_size = storage.load_metadata("file_sizes_KB/face_areas")
# Returns: 45.2
duration = storage.load_metadata("processing/duration")
# Returns: 12.5
# Load entire nested section
file_sizes = storage.load_metadata("file_sizes_KB")
# Returns: {'face_areas': 45.2, 'edge_lengths': 12.3}
Metadata Structure:
{
"size_cadfile": 1024000,
"file_sizes_KB": {
"face_areas": 45.2,
"edge_lengths": 12.3
},
"processing": {
"duration": 12.5,
"timestamp": "2025-10-30T10:30:00"
}
}
Automatic Size Tracking
All DataStorage implementations automatically track data sizes:
storage.save_data("face_areas", large_array)
# Automatically stores size in metadata["file_sizes_KB"]["face_areas"]
# Retrieve size
size_kb = storage.load_metadata("file_sizes_KB/face_areas")
# Returns: Size in kilobytes
Size Calculation:
OptStorage: Actual disk usage (sum of Zarr chunk files)
MemoryStorage: Approximate memory usage (
sys.getsizeof())JsonStorageHandler: File size of
.jsonfile
Complete Usage Examples
CAD Encoding Workflow
from hoops_ai.storage import OptStorage
from hoops_ai.storage.datasetstorage import SchemaBuilder
from hoops_ai.cadaccess import HOOPSLoader
from hoops_ai.cadencoder import BrepEncoder
import time
# 1. Build schema
builder = SchemaBuilder(domain="CAD_analysis", version="1.0")
faces_group = builder.create_group("faces", "face", "Face data")
faces_group.create_array("face_indices", ["face"], "int32")
faces_group.create_array("face_areas", ["face"], "float32")
faces_group.create_array("face_types", ["face"], "int32")
edges_group = builder.create_group("edges", "edge", "Edge data")
edges_group.create_array("edge_indices", ["edge"], "int32")
edges_group.create_array("edge_lengths", ["edge"], "float32")
builder.define_file_metadata("size_cadfile", "int64", "File size")
builder.define_file_metadata("processing_time", "float32", "Processing time")
builder.define_categorical_metadata("file_label", "int32", "Classification")
builder.set_metadata_routing_rules(
file_level_patterns=["size_*", "processing_*"],
categorical_patterns=["*_label"]
)
schema = builder.build()
# 2. Initialize storage with schema
storage = OptStorage(store_path="./encoded/part_001.zarr")
storage.set_schema(schema)
# 3. Encode CAD file
loader = HOOPSLoader()
model = loader.create_from_file("part_001.step")
brep = model.get_brep()
encoder = BrepEncoder(brep_access=brep, storage_handler=storage)
start_time = time.time()
# Push geometric features (validated by schema)
encoder.push_face_indices()
encoder.push_face_attributes()
encoder.push_edge_indices()
encoder.push_edge_attributes()
processing_time = time.time() - start_time
# 4. Save metadata (automatically routed)
import os
storage.save_metadata("size_cadfile", os.path.getsize("part_001.step")) # → .infoset
storage.save_metadata("processing_time", processing_time) # → .infoset
storage.save_metadata("file_label", 2) # → .attribset
# 5. Compress and close
compressed_size = storage.compress_store()
print(f"Compressed to {compressed_size / 1024:.2f} KB")
storage.close()
Schema Validation in Practice
from hoops_ai.storage import OptStorage
from hoops_ai.storage.datasetstorage import SchemaBuilder
import numpy as np
# Build strict schema
builder = SchemaBuilder(domain="production_data")
group = builder.create_group("measurements", "sample")
group.create_array("temperature", ["sample"], "float32")
group.create_array("pressure", ["sample"], "float32")
schema = builder.build()
storage = OptStorage("./validated_data.zarr")
storage.set_schema(schema)
# Valid data passes
valid_temps = np.array([20.5, 25.3, 22.1], dtype=np.float32)
if storage.validate_data_against_schema("temperature", valid_temps):
storage.save_data("temperature", valid_temps)
print("✓ Data validated and saved")
# Invalid data is caught
invalid_temps = np.array([[20.5, 25.3], [22.1, 23.4]]) # Wrong dimensions
if not storage.validate_data_against_schema("temperature", invalid_temps):
print("✗ Validation failed: wrong dimensions")
# Handle error appropriately
Implementation Comparison
Feature |
OptStorage (Zarr) |
MemoryStorage |
JsonStorageHandler |
|---|---|---|---|
Persistence |
Disk (Zarr format) |
RAM (volatile) |
Disk (JSON files) |
Compression |
Yes (Zstd) |
No |
No |
Size Limit |
Disk capacity |
RAM capacity |
Disk capacity |
Speed |
Fast (chunked) |
Fastest (in-memory) |
Slow (JSON parsing) |
Human-Readable |
No |
N/A |
Yes |
Multi-File Output |
Single directory |
N/A |
One file per key |
xarray Support |
Yes (dimension names) |
No |
No |
Chunking |
Automatic |
N/A |
N/A |
Compression Ratio |
~10:1 typical |
N/A |
Minimal |
Concurrent Access |
Read-only after compress |
No |
Read-only |
Best For |
Production, large data |
Testing, small data |
Export, inspection |
Schema Support |
Full |
Full |
Full |
Metadata Files |
metadata.json |
Internal dict |
metadata.json |
NaN Detection |
Yes (automatic) |
No |
No |
Choosing the Right Implementation
Decision Guide
Scenario |
Recommended Implementation |
|---|---|
Unit testing |
|
Debugging single files |
|
Production datasets > 1GB |
|
Large arrays > 100MB |
|
Prototyping < 100 files |
|
Data export to external tools |
|
Cloud storage compatibility |
|
Version control tracking |
|
Best Practices
Storage Backend Selection
Development Phase:
Use
MemoryStoragefor unit tests (fast, isolated)Use
JsonStorageHandlerfor debugging single filesKeep datasets small (< 100 files) during prototyping
Production Phase:
Use
OptStoragefor all encoded dataEnable compression for large arrays (UV grids, point clouds)
Monitor storage costs (compression reduces by 3-5x)
Error Handling
Always Use close() or Context Managers:
# Recommended: Explicit close
storage = OptStorage("output.zarr")
try:
storage.save_data("face_areas", face_areas)
storage.save_metadata("processing_time", 12.5)
finally:
storage.close()
Check for Missing Keys:
# Check before loading
if "face_areas" in storage.get_keys():
face_areas = storage.load_data("face_areas")
else:
print("Warning: face_areas not found")
# Or use try/except
try:
face_areas = storage.load_data("face_areas")
except KeyError:
print("face_areas not present in storage")
Performance Optimization
Batch Operations:
# Good: Batch save operations
storage = OptStorage("output.zarr")
for key, data in encoded_data.items():
storage.save_data(key, data)
# Single compress at the end
storage.compress_store()
storage.close()
# Avoid: Open/close for each operation
# This is inefficient - don't do this
for key, data in encoded_data.items():
storage = OptStorage("output.zarr") # Bad: repeated opens
storage.save_data(key, data)
storage.close()
Compression Settings:
from hoops_ai.storage import OptStorage
# OptStorage uses Zstd compression (level 12) by default
# This provides excellent compression ratio with reasonable speed
storage = OptStorage("output.zarr", compress_extension=".data")
# After saving all data
compressed_size = storage.compress_store()
print(f"Compressed to {compressed_size / (1024**2):.2f} MB")
Summary
The DataStorage system provides HOOPS AI with:
Unified Interface: Consistent API across Zarr, JSON, and in-memory storage
Schema Integration: Validates data and routes metadata using SchemaBuilder dictionaries
Flexible Backends: Choose storage based on use case (production, testing, export)
Automatic Features: Size tracking, compression, dimension naming, NaN detection
Metadata Organization: Separates file-level (
.infoset) and categorical (.attribset) metadata
Integration with SchemaBuilder:
SchemaBuilder.build() → Schema Dictionary
↓
DataStorage.set_schema(schema)
↓
┌──────────────────┴──────────────────┐
↓ ↓
save_data() with validation save_metadata() with routing
↓ ↓
Group-organized storage .infoset / .attribset files
The schema dictionary serves as the configuration contract between data producers and storage, ensuring:
Consistent data organization
Validated data types and dimensions
Predictable metadata routing
Schema-guided dataset merging
Choose the appropriate DataStorage implementation based on your workflow:
OptStorage: Production pipelines, large-scale data, compression needed
MemoryStorage: Unit tests, prototyping, temporary data
JsonStorageHandler: Data export, human inspection, external tool integration
See also
CAD Data Encoding - Using BrepEncoder with DataStorage
Datasets - ML-Ready Inputs - SchemaBuilder for data organization
hoops_ai.storage - Complete API reference