Datasets - ML-Ready Inputs
Introduction
Now that you’ve seen how to access and preprocess CAD data, it’s time to organize the extracted features into datasets suitable for machine learning workflows.
The SchemaBuilder provides a user-friendly, explicit API for defining data storage schemas in HOOPS AI. By defining schemas that specify array dimensions, data types, and logical groupings, you ensure that CAD features are consistently organized across multiple files. This consistency is essential for creating ML-ready inputs: schemas guarantee that face attributes, edge features, and graph data from different CAD files have compatible shapes and types that can be merged into batched tensors for training. The schema validation catches dimension mismatches early, before they cause runtime errors in your ML pipeline.
Schemas define how data should be organized into logical groups and arrays, enabling predictable data merging, validation, and metadata routing. The SchemaBuilder creates Python dictionaries that serve as configuration blueprints for DataStorage implementations.
Key Concept: The SchemaBuilder produces a schema dictionary that tells DataStorage implementations:
How to organize arrays into logical groups
What dimensions each array should have
How to validate incoming data
Where to route metadata (file-level vs. categorical)
The module follows a declarative pattern:
SchemaBuilder → Schema Dictionary → DataStorage.set_schema() → Validated Storage Operations
SchemaBuilder Overview
Purpose
The
SchemaBuilderclass provides a standard, object-oriented API for creating data storage schemas without requiring method chaining.Initialization
from hoops_ai.storage.datasetstorage import SchemaBuilder builder = SchemaBuilder( domain="CAD_analysis", version="1.0", description="Schema for CAD geometric feature extraction" )Parameters
domain(str): Domain name for this schema (e.g., ‘CAD_analysis’, ‘manufacturing_data’)
version(str): Schema version for compatibility tracking (default: ‘1.0’)
description(str, optional): Human-readable description of the schema’s purpose
Understanding Schema Components
The SchemaBuilder organizes data through three core components:
Schema
├── Groups (logical containers)
│ ├── faces group
│ │ ├── face_areas array [face] → float32
│ │ ├── face_types array [face] → int32
│ │ └── face_normals array [face, coordinate] → float32
│ ├── edges group
│ │ ├── edge_lengths array [edge] → float32
│ │ └── edge_types array [edge] → int32
│ └── graph group
│ └── edges_source array [edge] → int32
└── Metadata
├── File-level (.infoset)
└── Categorical (.attribset)
Groups
Groups are logical containers that organize related arrays. Each group has:
Name: Unique identifier (e.g.,
'faces','edges')Primary Dimension: Main indexing dimension (e.g.,
'face','edge','batch')Description: What data this group contains
Special Processing: Optional processing hint (e.g.,
'matrix_flattening','nested_edges')
Creating a Group:
faces_group = builder.create_group(
name="faces",
primary_dimension="face",
description="Face geometric data",
special_processing=None # Optional
)
The method returns a Group object used to define arrays within that group.
Arrays
Arrays are the actual data containers within groups. Each array specifies:
Name: Unique identifier within the group
Dimensions: List of dimension names defining the array’s shape
Dtype: Data type (
'float32','float64','int32','int64','bool','str')Description: What this array represents
Validation Rules: Optional constraints (
min_value,max_value, etc.)
Basic Array Definition:
# 1D array: face areas (N faces)
faces_group.create_array(
name="face_areas",
dimensions=["face"],
dtype="float32",
description="Surface area of each face"
)
Multi-Dimensional Arrays:
# 2D array: face normals (N_faces × 3 coordinates)
faces_group.create_array(
name="face_normals",
dimensions=["face", "coordinate"],
dtype="float32",
description="Normal vectors for each face (N x 3)"
)
# 4D array: UV grid samples (N_faces × U × V × components)
faces_group.create_array(
name="face_uv_grids",
dimensions=["face", "uv_x", "uv_y", "component"],
dtype="float32",
description="Sampled points on face surfaces"
)
Arrays with Validation Rules:
faces_group.create_array(
name="face_areas",
dimensions=["face"],
dtype="float32",
description="Surface area of each face",
min_value=0.0, # Validation: areas must be positive
max_value=1e6 # Validation: reasonable upper bound
)
Managing Arrays:
# Remove an array
success = faces_group.remove_array("face_areas")
# Returns: True if removed, False if not found
# Get array specification
array_spec = faces_group.get_array("face_areas")
# Returns: {'dims': ['face'], 'dtype': 'float32', 'description': '...'}
# List all arrays in group
array_names = faces_group.list_arrays()
# Returns: ['face_areas', 'face_types', 'face_normals', ...]
Metadata
Metadata is divided into two categories based on storage location:
File-level Metadata: Stored in
.infosetfilesInformation about each data file (file size, processing time, file path)
Categorical Metadata: Stored in
.attribsetfilesCategorical classifications (labels, categories, complexity ratings)
Working with SchemaBuilder
The SchemaBuilder provides methods to manage groups, define metadata, and configure routing rules. This section covers the essential operations for building complete schemas.
Managing Groups
Once you have a SchemaBuilder instance, you can create, retrieve, remove, and list groups.
Creating Groups:
# Create a new group for edge data edges_group = builder.create_group( name="edges", primary_dimension="edge", description="Edge-related geometric properties", special_processing=None )
Retrieving Existing Groups:
# Get a previously created group faces_group = builder.get_group("faces") # Returns: Group object or None if not found
Removing Groups:
# Remove a group from the schema success = builder.remove_group("edges") # Returns: True if removed, False if not found
Listing All Groups:
# Get names of all groups in the schema group_names = builder.list_groups() # Returns: ['faces', 'edges', 'graph', 'metadata']
Defining Metadata
Metadata definitions tell DataStorage where to route metadata and how to validate it. You can define both file-level and categorical metadata with optional validation rules.
File-Level Metadata
File-level metadata is stored in .infoset Parquet files and represents information about each data file.
# Define numeric metadata with validation
builder.define_file_metadata(
name="size_cadfile",
dtype="int64",
description="File size in bytes",
required=False,
min_value=0
)
# Define timing information
builder.define_file_metadata(
name="processing_time",
dtype="float32",
description="Processing time in seconds",
required=False
)
# Define required string metadata
builder.define_file_metadata(
name="flow_name",
dtype="str",
description="Name of the flow that processed this file",
required=True
)
Parameters:
name(str): Metadata field name
dtype(str): Data type ('str','int32','int64','float32','float64','bool')
description(str, optional): Field description
required(bool): Whether this field must be present (default: False)
**validation_rules: Additional constraints (min_value,max_value, etc.)
Categorical Metadata
Categorical metadata is stored in .attribset Parquet files and represents categorical classifications.
# Define categorical metadata with labeled values
builder.define_categorical_metadata(
name="machining_category",
dtype="int32",
description="Machining complexity classification",
values=[1, 2, 3, 4, 5],
labels=["Simple", "Easy", "Medium", "Hard", "Complex"],
required=False
)
# Define material classification
builder.define_categorical_metadata(
name="material_type",
dtype="str",
description="Material classification",
values=["steel", "aluminum", "plastic", "composite"],
required=True
)
Parameters:
name(str): Metadata field name
dtype(str): Data type
description(str, optional): Field description
values(List, optional): List of allowed values
labels(List[str], optional): Human-readable labels corresponding to values
required(bool): Whether this field must be present (default: False)
**validation_rules: Additional constraints
Metadata Routing
The SchemaBuilder provides flexible metadata routing using pattern matching and default rules. This determines whether metadata goes to .infoset or .attribset files.
Setting Routing Rules
builder.set_metadata_routing_rules(
file_level_patterns=[
"description",
"flow_name",
"stream *", # Wildcard: matches 'stream .scs', 'stream .prc', etc.
"Item",
"size_*", # Wildcard: matches 'size_cadfile', 'size_compressed', etc.
"duration_*", # Wildcard: matches all duration fields
"processing_*"
],
categorical_patterns=[
"category",
"type",
"*_label", # Wildcard: matches 'file_label', 'part_label', etc.
"material_*",
"complexity"
],
default_numeric="file_level", # Where numeric metadata goes by default
default_categorical="categorical", # Where categorical metadata goes by default
default_string="categorical" # Where string metadata goes by default
)
Pattern Matching Rules:
*wildcard matches any charactersPatterns are case-insensitive
Explicit definitions override pattern matching
Querying Routing Destinations
# Get routing destination for a specific field
destination = builder.get_metadata_routing("file_label")
# Returns: "categorical" or "file_level"
# List all metadata fields by category
fields = builder.list_metadata_fields()
# Returns: {'file_level': ['size_cadfile', 'processing_time', ...],
# 'categorical': ['file_label', 'material_type', ...]}
Validating Metadata
# Validate a specific field's value
is_valid = builder.validate_metadata_field("machining_category", 3)
# Returns: True (3 is in allowed values [1,2,3,4,5])
is_valid = builder.validate_metadata_field("machining_category", 10)
# Returns: False (10 not in allowed values)
# Validate entire schema
errors = builder.validate_metadata_schema()
# Returns: List of error messages, empty if valid
Using Schema Templates
The SchemaBuilder supports predefined templates for common use cases, reducing boilerplate code and providing a quick start for standard data organization patterns.
Predefined Templates
Templates provide complete, ready-to-use schemas for common domains. You can load a template directly or use convenience functions.
Loading Templates
# Start with a complete CAD analysis template
builder = SchemaBuilder().from_template('cad_basic')
# Or use convenience functions
from hoops_ai.storage.datasetstorage import create_cad_schema
builder = create_cad_schema()
Available Templates
The following templates are available out of the box:
cad_basic - Basic CAD analysis with faces, edges, and graph data
Groups: faces, edges, graph, metadata
Arrays: face_areas, face_indices, edge_lengths, etc.
cad_advanced - Advanced CAD with surface properties and relationships
Groups: faces, edges, faceface, graph, performance
Arrays: face_uv_grids, edge_dihedral_angles, extended_adjacency, etc.
manufacturing_basic - Manufacturing data with quality metrics
Groups: production, sensors, materials
Arrays: quality_score, temperature, pressure, composition, etc.
sensor_basic - Sensor data with timestamps and readings
Groups: timeseries, sensors, events
Arrays: timestamp, value, sensor_type, event_type, etc.
Discovering Templates
from hoops_ai.storage.datasetstorage.schema_templates import SchemaTemplates
# List all available templates
templates = SchemaTemplates.list_templates()
# Returns: ['cad_basic', 'cad_advanced', 'manufacturing_basic', 'sensor_basic']
# Get description of a specific template
description = SchemaTemplates.get_template_description('cad_advanced')
# Returns: "Advanced CAD analysis including surface properties and relationships"
Extending Templates
Templates can be extended to add custom groups and arrays while preserving the base template structure. This is useful when you need standard CAD data plus custom application-specific fields.
# Start with CAD basic template and add custom data
builder = SchemaBuilder().extend_template('cad_basic')
# Add custom group for ML predictions
predictions_group = builder.create_group(
"predictions",
"face",
"ML model predictions for faces"
)
predictions_group.create_array("predicted_class", ["face"], "int32")
predictions_group.create_array("confidence_score", ["face"], "float32")
# Add custom metadata
builder.define_categorical_metadata(
"model_version",
"str",
"ML model version used for predictions"
)
# Build the extended schema
schema = builder.build()
Building and Exporting Schemas
Once you’ve defined your schema using the SchemaBuilder, you can build it into a Python dictionary and export it for reuse or documentation.
Schema Dictionary Structure
The build() method produces a Python dictionary that serves as the configuration blueprint for DataStorage implementations. This dictionary contains all the information needed to organize, validate, and route data.
schema = builder.build()
The resulting schema dictionary has the following structure:
{
"domain": "CAD_analysis",
"version": "1.0",
"description": "Schema for CAD geometric feature extraction",
"groups": {
"faces": {
"primary_dimension": "face",
"description": "Face geometric data",
"arrays": {
"face_areas": {
"dims": ["face"],
"dtype": "float32",
"description": "Surface area of each face"
},
"face_normals": {
"dims": ["face", "coordinate"],
"dtype": "float32",
"description": "Normal vectors for each face (N x 3)"
}
# ... more arrays
}
},
"edges": {
"primary_dimension": "edge"
# ... edge arrays
}
# ... more groups
},
"metadata": {
"file_level": {
"size_cadfile": {
"dtype": "int64",
"description": "File size in bytes",
"required": False
}
# ... more file-level metadata
},
"categorical": {
"file_label": {
"dtype": "int32",
"description": "Classification label",
"values": [0, 1, 2, 3, 4],
"required": False
}
# ... more categorical metadata
},
"routing_rules": {
"file_level_patterns": ["description", "flow_name", "size_*"],
"categorical_patterns": ["*_label", "category"],
"default_numeric": "file_level",
"default_categorical": "categorical",
"default_string": "categorical"
}
}
}
Exporting and Loading Schemas
Schemas can be exported to JSON files for version control, documentation, or sharing across projects.
# Export to JSON string
json_string = builder.to_json(indent=2)
# Save to file
builder.save_to_file("my_schema.json")
# Load from file
loaded_builder = SchemaBuilder.load_from_file("my_schema.json")
Integration with DataStorage
The schema dictionary produced by SchemaBuilder is consumed by DataStorage implementations via the set_schema() method. This integration enables validated storage operations and automatic metadata routing.
Schema Flow
The following diagram illustrates how schemas flow from definition to validated operations:
┌─────────────────┐
│ SchemaBuilder │
│ .build() │
└────────┬────────┘
│
▼
Schema Dictionary
(Python dict)
│
▼
┌─────────────────────┐
│ DataStorage │
│ .set_schema(dict) │
└────────┬────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Storage Operations with Validation │
│ • save_data() → validates against dims │
│ • save_metadata() → routes correctly │
│ • get_group_for_array() → uses schema │
└─────────────────────────────────────────┘
Applying Schema to Storage
To apply a schema to a storage instance, use the set_schema() method:
from hoops_ai.storage import OptStorage
from hoops_ai.storage.datasetstorage import SchemaBuilder
# Build schema
builder = SchemaBuilder(domain="CAD_analysis", version="1.0")
faces_group = builder.create_group("faces", "face", "Face data")
faces_group.create_array("face_areas", ["face"], "float32")
schema = builder.build()
# Apply schema to storage
storage = OptStorage(store_path="./encoded_data/my_part.data")
storage.set_schema(schema) # ← Schema dictionary passed here
# Now storage knows:
# 1. "face_areas" belongs to "faces" group
# 2. It should have dimensions ["face"]
# 3. It should be float32 type
Schema-Driven Operations
Once a schema is set, DataStorage can perform several intelligent operations based on the schema definition.
Validating Data Dimensions
import numpy as np
# This will be validated against schema
face_areas = np.array([1.5, 2.3, 4.1, 3.7], dtype=np.float32)
storage.save_data("face_areas", face_areas)
# ✓ Validates: correct dtype, 1D array as expected
Routing Metadata Correctly
# Metadata routing based on schema rules
storage.save_metadata("size_cadfile", 1024000) # → .infoset (file-level)
storage.save_metadata("file_label", 3) # → .attribset (categorical)
storage.save_metadata("flow_name", "my_flow") # → .infoset (file-level pattern match)
Determining Group Membership
group_name = storage.get_group_for_array("face_areas")
# Returns: "faces"
group_name = storage.get_group_for_array("edge_lengths")
# Returns: "edges"
Schema-Aware Merging
During dataset merging, the schema guides:
Which arrays belong to the same group
What dimensions to concatenate along
How to handle special processing (e.g., matrix flattening)
Practical Examples
This section demonstrates complete workflows using SchemaBuilder with real-world CAD encoding scenarios.
Complete CAD Encoding Workflow
from hoops_ai.storage import OptStorage
from hoops_ai.storage.datasetstorage import SchemaBuilder
from hoops_ai.cadaccess import HOOPSLoader
from hoops_ai.cadencoder import BrepEncoder
# 1. Define schema for CAD encoding
builder = SchemaBuilder(domain="CAD_analysis", version="1.0")
# Define faces group
faces_group = builder.create_group("faces", "face", "Face geometric data")
faces_group.create_array("face_indices", ["face"], "int32", "Face IDs")
faces_group.create_array("face_areas", ["face"], "float32", "Face surface areas")
faces_group.create_array("face_types", ["face"], "int32", "Surface type classification")
faces_group.create_array("face_uv_grids", ["face", "uv_x", "uv_y", "component"],
"float32", "UV-sampled points and normals")
# Define edges group
edges_group = builder.create_group("edges", "edge", "Edge geometric data")
edges_group.create_array("edge_indices", ["edge"], "int32", "Edge IDs")
edges_group.create_array("edge_lengths", ["edge"], "float32", "Edge lengths")
edges_group.create_array("edge_types", ["edge"], "int32", "Curve type classification")
# Define graph group
graph_group = builder.create_group("graph", "graphitem", "Topology graph")
graph_group.create_array("edges_source", ["edge"], "int32", "Source face indices")
graph_group.create_array("edges_destination", ["edge"], "int32", "Dest face indices")
graph_group.create_array("num_nodes", ["graphitem"], "int32", "Number of nodes")
# Define metadata
builder.define_file_metadata("size_cadfile", "int64", "CAD file size in bytes")
builder.define_file_metadata("processing_time", "float32", "Encoding time in seconds")
builder.define_categorical_metadata("file_label", "int32", "Part classification label")
# Set routing rules
builder.set_metadata_routing_rules(
file_level_patterns=["size_*", "processing_*", "duration_*"],
categorical_patterns=["*_label", "category", "type"]
)
# Build schema
schema = builder.build()
# 2. Apply schema to storage
storage = OptStorage(store_path="./encoded/part_001.zarr")
storage.set_schema(schema)
# 3. Encode CAD data with schema-validated storage
loader = HOOPSLoader()
model = loader.create_from_file("part_001.step")
brep = model.get_brep()
encoder = BrepEncoder(brep_access=brep, storage_handler=storage)
# These operations are now schema-validated
encoder.push_face_indices() # → "faces" group
encoder.push_face_attributes() # → "faces" group
encoder.push_facegrid(ugrid=5, vgrid=5) # → "faces" group
encoder.push_edge_indices() # → "edges" group
encoder.push_edge_attributes() # → "edges" group
encoder.push_face_adjacency_graph() # → "graph" group
# Metadata is automatically routed
import os
import time
start_time = time.time()
# ... encoding happens ...
storage.save_metadata("size_cadfile", os.path.getsize("part_001.step")) # → .infoset
storage.save_metadata("processing_time", time.time() - start_time) # → .infoset
storage.save_metadata("file_label", 2) # → .attribset
storage.close()
Quick Setup with Templates
from hoops_ai.storage.datasetstorage import create_cad_schema
from hoops_ai.storage import OptStorage
# Quick setup with template
builder = create_cad_schema() # Loads 'cad_basic' template
# Customize as needed
predictions = builder.create_group("predictions", "face", "ML predictions")
predictions.create_array("predicted_label", ["face"], "int32")
predictions.create_array("confidence", ["face"], "float32")
schema = builder.build()
# Apply to storage
storage = OptStorage(store_path="./output/part.data")
storage.set_schema(schema)
Schema Validation in Practice
import numpy as np
# Create schema with validation rules
builder = SchemaBuilder(domain="validated_data")
group = builder.create_group("measurements", "sample")
group.create_array("temperature", ["sample"], "float32",
min_value=-273.15, # Absolute zero
max_value=5000.0) # Reasonable max
schema = builder.build()
storage = OptStorage("./data.zarr")
storage.set_schema(schema)
# Valid data
valid_temps = np.array([20.5, 25.3, 22.1], dtype=np.float32)
storage.save_data("temperature", valid_temps) # ✓ Success
# Invalid data (contains value below min)
invalid_temps = np.array([20.5, -300.0, 22.1], dtype=np.float32)
try:
storage.save_data("temperature", invalid_temps) # ✗ Validation fails
except ValueError as e:
print(f"Validation error: {e}")
Performance and Best Practices
Understanding Schema Performance
Schema Impact on Performance
Minimal Runtime Overhead:
Schema validation is optional (controlled by DataStorage implementation)
Schema lookup is dictionary-based (O(1) operations)
Schema is set once per storage instance
Benefits for Large-Scale Data:
Predictable Merging: Schema-guided dataset merging is deterministic
Type Safety: Prevents type mismatches that cause downstream errors
Memory Efficiency: Dimension information enables efficient chunk sizing
Parallelization: Schema enables safe parallel writes to different groups
Best Practices
Follow these recommendations when working with SchemaBuilder:
Define Schema Early: Set schema before any data operations
Use Templates: Start with templates for common patterns
Validate Once: Schema validation during development, disable in production
Document Dimensions: Clear dimension names improve code readability
Version Schemas: Increment version when making breaking changes
Summary
The SchemaBuilder is HOOPS AI’s declarative interface for defining data organization. It provides:
Schema dictionaries that configure DataStorage behavior
Logical groups to organize related arrays
Array dimensions specifications for validation and merging
Metadata routing to appropriate storage locations (
.infosetvs.attribset)Templates for common use cases (CAD, manufacturing, sensors)
Validation to catch data issues early
The schema dictionary serves as the contract between data producers (encoders) and data consumers (storage, merging, ML pipelines), ensuring consistent, validated, and well-organized data throughout the HOOPS AI system.
See Also
For quick integration examples showing how to use SchemaBuilder with Flow pipelines, see:
Flow module - Quick Reference - Quick reference for Flow-based data pipelines with schema integration examples