Datasets - ML-Ready Inputs

Introduction 

Now that you’ve seen how to access and preprocess CAD data, it’s time to organize the extracted features into datasets suitable for machine learning workflows.

The SchemaBuilder provides a user-friendly, explicit API for defining data storage schemas in HOOPS AI. By defining schemas that specify array dimensions, data types, and logical groupings, you ensure that CAD features are consistently organized across multiple files. This consistency is essential for creating ML-ready inputs: schemas guarantee that face attributes, edge features, and graph data from different CAD files have compatible shapes and types that can be merged into batched tensors for training. The schema validation catches dimension mismatches early, before they cause runtime errors in your ML pipeline.

Schemas define how data should be organized into logical groups and arrays, enabling predictable data merging, validation, and metadata routing. The SchemaBuilder creates Python dictionaries that serve as configuration blueprints for DataStorage implementations.

Key Concept: The SchemaBuilder produces a schema dictionary that tells DataStorage implementations:

How to organize arrays into logical groups

What dimensions each array should have

How to validate incoming data

Where to route metadata (file-level vs. categorical)

The module follows a declarative pattern:

SchemaBuilder → Schema Dictionary → DataStorage.set_schema() → Validated Storage Operations

SchemaBuilder Overview 

Purpose

The SchemaBuilder class provides a standard, object-oriented API for creating data storage schemas without requiring method chaining.

Initialization
from hoops_ai.storage.datasetstorage.schema_builder import SchemaBuilder

builder = SchemaBuilder(
    domain="CAD_analysis",
    version="1.0",
    description="Schema for CAD geometric feature extraction"
)
Parameters

domain (str): Domain name for this schema (e.g., ‘CAD_analysis’, ‘manufacturing_data’)

version (str): Schema version for compatibility tracking (default: ‘1.0’)

description (str, optional): Human-readable description of the schema’s purpose

Understanding Schema Components

The SchemaBuilder organizes data through three core components:

Schema
├── Groups (logical containers)
│   ├── faces group
│   │   ├── face_areas array [face] → float32
│   │   ├── face_types array [face] → int32
│   │   └── face_normals array [face, coordinate] → float32
│   ├── edges group
│   │   ├── edge_lengths array [edge] → float32
│   │   └── edge_types array [edge] → int32
│   └── graph group
│       └── edges_source array [edge] → int32
└── Metadata
    ├── File-level (.infoset)
    └── Categorical (.attribset)

Groups

Groups are logical containers that organize related arrays. Each group has:

Name: Unique identifier (e.g., 'faces', 'edges')

Primary Dimension: Main indexing dimension (e.g., 'face', 'edge', 'batch')

Description: What data this group contains

Special Processing: Optional processing hint (e.g., 'matrix_flattening', 'nested_edges')

Creating a Group:

faces_group = builder.create_group(
    name="faces",
    primary_dimension="face",
    description="Face geometric data",
    special_processing=None  # Optional
)

The method returns a Group object used to define arrays within that group.

Arrays

Arrays are the actual data containers within groups. Each array specifies:

Name: Unique identifier within the group

Dimensions: List of dimension names defining the array’s shape

Dtype: Data type ('float32', 'float64', 'int32', 'int64', 'bool', 'str')

Description: What this array represents

Validation Rules: Optional constraints (min_value, max_value, etc.)

Basic Array Definition:

# 1D array: face areas (N faces)
faces_group.create_array(
    name="face_areas",
    dimensions=["face"],
    dtype="float32",
    description="Surface area of each face"
)

Multi-Dimensional Arrays:

# 2D array: face normals (N_faces × 3 coordinates)
faces_group.create_array(
    name="face_normals",
    dimensions=["face", "coordinate"],
    dtype="float32",
    description="Normal vectors for each face (N x 3)"
)

# 4D array: UV grid samples (N_faces × U × V × components)
faces_group.create_array(
    name="face_uv_grids",
    dimensions=["face", "uv_x", "uv_y", "component"],
    dtype="float32",
    description="Sampled points on face surfaces"
)

Arrays with Validation Rules:

faces_group.create_array(
    name="face_areas",
    dimensions=["face"],
    dtype="float32",
    description="Surface area of each face",
    min_value=0.0,  # Validation: areas must be positive
    max_value=1e6   # Validation: reasonable upper bound
)

Managing Arrays:

# Remove an array
success = faces_group.remove_array("face_areas")
# Returns: True if removed, False if not found

# Get array specification
array_spec = faces_group.get_array("face_areas")
# Returns: {'dims': ['face'], 'dtype': 'float32', 'description': '...'}

# List all arrays in group
array_names = faces_group.list_arrays()
# Returns: ['face_areas', 'face_types', 'face_normals', ...]

Metadata

Metadata is divided into two categories based on storage location:

File-level Metadata: Stored in .infoset files
- Information about each data file (file size, processing time, file path)
Categorical Metadata: Stored in .attribset files
- Categorical classifications (labels, categories, complexity ratings)

Working with SchemaBuilder 

The SchemaBuilder provides methods to manage groups, define metadata, and configure routing rules. This section covers the essential operations for building complete schemas.

Managing Groups

Once you have a SchemaBuilder instance, you can create, retrieve, remove, and list groups.

Creating Groups:

# Create a new group for edge data
edges_group = builder.create_group(
    name="edges",
    primary_dimension="edge",
    description="Edge-related geometric properties",
    special_processing=None
)

Retrieving Existing Groups:

# Get a previously created group
faces_group = builder.get_group("faces")
# Returns: Group object or None if not found

Removing Groups:

# Remove a group from the schema
success = builder.remove_group("edges")
# Returns: True if removed, False if not found

Listing All Groups:

# Get names of all groups in the schema
group_names = builder.list_groups()
# Returns: ['faces', 'edges', 'graph', 'metadata']

Defining Metadata

Metadata definitions tell DataStorage where to route metadata and how to validate it. You can define both file-level and categorical metadata with optional validation rules.

File-Level Metadata

File-level metadata is stored in .infoset Parquet files and represents information about each data file.

# Define numeric metadata with validation
builder.define_file_metadata(
    name="size_cadfile",
    dtype="int64",
    description="File size in bytes",
    required=False,
    min_value=0
)

# Define timing information
builder.define_file_metadata(
    name="processing_time",
    dtype="float32",
    description="Processing time in seconds",
    required=False
)

# Define required string metadata
builder.define_file_metadata(
    name="flow_name",
    dtype="str",
    description="Name of the flow that processed this file",
    required=True
)

Parameters:

name (str): Metadata field name

dtype (str): Data type ('str', 'int32', 'int64', 'float32', 'float64', 'bool')

description (str, optional): Field description

required (bool): Whether this field must be present (default: False)

**validation_rules: Additional constraints (min_value, max_value, etc.)

Categorical Metadata

Categorical metadata is stored in .attribset Parquet files and represents categorical classifications.

# Define categorical metadata with labeled values
builder.define_categorical_metadata(
    name="machining_category",
    dtype="int32",
    description="Machining complexity classification",
    values=[1, 2, 3, 4, 5],
    labels=["Simple", "Easy", "Medium", "Hard", "Complex"],
    required=False
)

# Define material classification
builder.define_categorical_metadata(
    name="material_type",
    dtype="str",
    description="Material classification",
    values=["steel", "aluminum", "plastic", "composite"],
    required=True
)

Parameters:

name (str): Metadata field name

dtype (str): Data type

description (str, optional): Field description

values (List, optional): List of allowed values

labels (List[str], optional): Human-readable labels corresponding to values

required (bool): Whether this field must be present (default: False)

**validation_rules: Additional constraints

Metadata Routing

The SchemaBuilder provides flexible metadata routing using pattern matching and default rules. This determines whether metadata goes to .infoset or .attribset files.

Setting Routing Rules

builder.set_metadata_routing_rules(
    file_level_patterns=[
        "description",
        "flow_name",
        "stream *",      # Wildcard: matches 'stream .scs', 'stream .prc', etc.
        "Item",
        "size_*",        # Wildcard: matches 'size_cadfile', 'size_compressed', etc.
        "duration_*",    # Wildcard: matches all duration fields
        "processing_*"
    ],
    categorical_patterns=[
        "category",
        "type",
        "*_label",       # Wildcard: matches 'file_label', 'part_label', etc.
        "material_*",
        "complexity"
    ],
    default_numeric="file_level",      # Where numeric metadata goes by default
    default_categorical="categorical", # Where categorical metadata goes by default
    default_string="categorical"       # Where string metadata goes by default
)

Pattern Matching Rules:

* wildcard matches any characters

Patterns are case-insensitive

Explicit definitions override pattern matching

Querying Routing Destinations

# Get routing destination for a specific field
destination = builder.get_metadata_routing("file_label")
# Returns: "categorical" or "file_level"

# List all metadata fields by category
fields = builder.list_metadata_fields()
# Returns: {'file_level': ['size_cadfile', 'processing_time', ...],
#           'categorical': ['file_label', 'material_type', ...]}

Validating Metadata

# Validate a specific field's value
is_valid = builder.validate_metadata_field("machining_category", 3)
# Returns: True (3 is in allowed values [1,2,3,4,5])

is_valid = builder.validate_metadata_field("machining_category", 10)
# Returns: False (10 not in allowed values)

# Validate entire schema
errors = builder.validate_metadata_schema()
# Returns: List of error messages, empty if valid

Using Schema Templates 

The SchemaBuilder supports predefined templates for common use cases, reducing boilerplate code and providing a quick start for standard data organization patterns.

Predefined Templates

Templates provide complete, ready-to-use schemas for common domains. You can load a template directly or use convenience functions.

Loading Templates

# Start with a complete CAD analysis template
builder = SchemaBuilder().from_template('cad_basic')

# Or use convenience functions
from hoops_ai.storage.datasetstorage import create_cad_schema
builder = create_cad_schema()

Available Templates

The following templates are available out of the box:

cad_basic - Basic CAD analysis with faces, edges, and graph data

Groups: faces, edges, graph, metadata

Arrays: face_areas, face_indices, edge_lengths, etc.

cad_advanced - Advanced CAD with surface properties and relationships

Groups: faces, edges, faceface, graph, performance

Arrays: face_uv_grids, edge_dihedral_angles, extended_adjacency, etc.

manufacturing_basic - Manufacturing data with quality metrics

Groups: production, sensors, materials

Arrays: quality_score, temperature, pressure, composition, etc.

sensor_basic - Sensor data with timestamps and readings

Groups: timeseries, sensors, events

Arrays: timestamp, value, sensor_type, event_type, etc.

Discovering Templates

from hoops_ai.storage.datasetstorage.schema_templates import SchemaTemplates

# List all available templates
templates = SchemaTemplates.list_templates()
# Returns: ['cad_basic', 'cad_advanced', 'manufacturing_basic', 'sensor_basic']

# Get description of a specific template
description = SchemaTemplates.get_template_description('cad_advanced')
# Returns: "Advanced CAD analysis including surface properties and relationships"

Extending Templates

Templates can be extended to add custom groups and arrays while preserving the base template structure. This is useful when you need standard CAD data plus custom application-specific fields.

# Start with CAD basic template and add custom data
builder = SchemaBuilder().extend_template('cad_basic')

# Add custom group for ML predictions
predictions_group = builder.create_group(
    "predictions",
    "face",
    "ML model predictions for faces"
)
predictions_group.create_array("predicted_class", ["face"], "int32")
predictions_group.create_array("confidence_score", ["face"], "float32")

# Add custom metadata
builder.define_categorical_metadata(
    "model_version",
    "str",
    "ML model version used for predictions"
)

# Build the extended schema
schema = builder.build()

Building and Exporting Schemas 

Once you’ve defined your schema using the SchemaBuilder, you can build it into a Python dictionary and export it for reuse or documentation.

Schema Dictionary Structure

The build() method produces a Python dictionary that serves as the configuration blueprint for DataStorage implementations. This dictionary contains all the information needed to organize, validate, and route data.

schema = builder.build()

The resulting schema dictionary has the following structure:

{
    "domain": "CAD_analysis",
    "version": "1.0",
    "description": "Schema for CAD geometric feature extraction",
    "groups": {
        "faces": {
            "primary_dimension": "face",
            "description": "Face geometric data",
            "arrays": {
                "face_areas": {
                    "dims": ["face"],
                    "dtype": "float32",
                    "description": "Surface area of each face"
                },
                "face_normals": {
                    "dims": ["face", "coordinate"],
                    "dtype": "float32",
                    "description": "Normal vectors for each face (N x 3)"
                }
                # ... more arrays
            }
        },
        "edges": {
            "primary_dimension": "edge"
            # ... edge arrays
        }
        # ... more groups
    },
    "metadata": {
        "file_level": {
            "size_cadfile": {
                "dtype": "int64",
                "description": "File size in bytes",
                "required": False
            }
            # ... more file-level metadata
        },
        "categorical": {
            "file_label": {
                "dtype": "int32",
                "description": "Classification label",
                "values": [0, 1, 2, 3, 4],
                "required": False
            }
            # ... more categorical metadata
        },
        "routing_rules": {
            "file_level_patterns": ["description", "flow_name", "size_*"],
            "categorical_patterns": ["*_label", "category"],
            "default_numeric": "file_level",
            "default_categorical": "categorical",
            "default_string": "categorical"
        }
    }
}

Exporting and Loading Schemas

Schemas can be exported to JSON files for version control, documentation, or sharing across projects.

# Export to JSON string
json_string = builder.to_json(indent=2)

# Save to file
builder.save_to_file("my_schema.json")

# Load from file
loaded_builder = SchemaBuilder.load_from_file("my_schema.json")

Integration with DataStorage 

The schema dictionary produced by SchemaBuilder is consumed by DataStorage implementations via the set_schema() method. This integration enables validated storage operations and automatic metadata routing.

Schema Flow

The following diagram illustrates how schemas flow from definition to validated operations:

        ┌─────────────────┐
        │  SchemaBuilder  │
        │   .build()      │
        └────────┬────────┘
                 │
                 ▼
         Schema Dictionary
         (Python dict)
                 │
                 ▼
        ┌─────────────────────┐
        │   DataStorage       │
        │   .set_schema(dict) │
        └────────┬────────────┘
                 │
                 ▼
┌─────────────────────────────────────────┐
│  Storage Operations with Validation     │
│  • save_data() → validates against dims │
│  • save_metadata() → routes correctly   │
│  • get_group_for_array() → uses schema  │
└─────────────────────────────────────────┘

Applying Schema to Storage

To apply a schema to a storage instance, use the set_schema() method:

from hoops_ai.storage import OptStorage
from hoops_ai.storage.datasetstorage.schema_builder import SchemaBuilder

# Build schema
builder = SchemaBuilder(domain="CAD_analysis", version="1.0")
faces_group = builder.create_group("faces", "face", "Face data")
faces_group.create_array("face_areas", ["face"], "float32")
schema = builder.build()

# Apply schema to storage
storage = OptStorage(store_path="./encoded_data/my_part.data")
storage.set_schema(schema)  # ← Schema dictionary passed here

# Now storage knows:
# 1. "face_areas" belongs to "faces" group
# 2. It should have dimensions ["face"]
# 3. It should be float32 type

Schema-Driven Operations

Once a schema is set, DataStorage can perform several intelligent operations based on the schema definition.

Validating Data Dimensions

import numpy as np

# This will be validated against schema
face_areas = np.array([1.5, 2.3, 4.1, 3.7], dtype=np.float32)
storage.save_data("face_areas", face_areas)
# ✓ Validates: correct dtype, 1D array as expected

Routing Metadata Correctly

# Metadata routing based on schema rules
storage.save_metadata("size_cadfile", 1024000)     # → .infoset (file-level)
storage.save_metadata("file_label", 3)              # → .attribset (categorical)
storage.save_metadata("flow_name", "my_flow")       # → .infoset (file-level pattern match)

Determining Group Membership

group_name = storage.get_group_for_array("face_areas")
# Returns: "faces"

group_name = storage.get_group_for_array("edge_lengths")
# Returns: "edges"

Schema-Aware Merging

During dataset merging, the schema guides:

Which arrays belong to the same group

What dimensions to concatenate along

How to handle special processing (e.g., matrix flattening)

Practical Examples 

This section demonstrates complete workflows using SchemaBuilder with real-world CAD encoding scenarios.

Complete CAD Encoding Workflow

from hoops_ai.storage import OptStorage
from hoops_ai.storage.datasetstorage.schema_builder import SchemaBuilder
from hoops_ai.cadaccess import HOOPSLoader
from hoops_ai.cadencoder import BrepEncoder

# 1. Define schema for CAD encoding
builder = SchemaBuilder(domain="CAD_analysis", version="1.0")

# Define faces group
faces_group = builder.create_group("faces", "face", "Face geometric data")
faces_group.create_array("face_indices", ["face"], "int32", "Face IDs")
faces_group.create_array("face_areas", ["face"], "float32", "Face surface areas")
faces_group.create_array("face_types", ["face"], "int32", "Surface type classification")
faces_group.create_array("face_uv_grids", ["face", "uv_x", "uv_y", "component"],
                        "float32", "UV-sampled points and normals")

# Define edges group
edges_group = builder.create_group("edges", "edge", "Edge geometric data")
edges_group.create_array("edge_indices", ["edge"], "int32", "Edge IDs")
edges_group.create_array("edge_lengths", ["edge"], "float32", "Edge lengths")
edges_group.create_array("edge_types", ["edge"], "int32", "Curve type classification")

# Define graph group
graph_group = builder.create_group("graph", "graphitem", "Topology graph")
graph_group.create_array("edges_source", ["edge"], "int32", "Source face indices")
graph_group.create_array("edges_destination", ["edge"], "int32", "Dest face indices")
graph_group.create_array("num_nodes", ["graphitem"], "int32", "Number of nodes")

# Define metadata
builder.define_file_metadata("size_cadfile", "int64", "CAD file size in bytes")
builder.define_file_metadata("processing_time", "float32", "Encoding time in seconds")
builder.define_categorical_metadata("file_label", "int32", "Part classification label")

# Set routing rules
builder.set_metadata_routing_rules(
    file_level_patterns=["size_*", "processing_*", "duration_*"],
    categorical_patterns=["*_label", "category", "type"]
)

# Build schema
schema = builder.build()

# 2. Apply schema to storage
storage = OptStorage(store_path="./encoded/part_001.zarr")
storage.set_schema(schema)

# 3. Encode CAD data with schema-validated storage
loader = HOOPSLoader()
model = loader.create_from_file("part_001.step")
brep = model.get_brep()

encoder = BrepEncoder(brep_access=brep, storage_handler=storage)

# These operations are now schema-validated
encoder.push_face_indices()        # → "faces" group
encoder.push_face_attributes()     # → "faces" group
encoder.push_face_discretization(pointsamples=25)  # 25 sample points (5x5 equivalent)  # → "faces" group

encoder.push_edge_indices()        # → "edges" group
encoder.push_edge_attributes()     # → "edges" group

encoder.push_face_adjacency_graph()  # → "graph" group

# Metadata is automatically routed
import os
import time
start_time = time.time()
# ... encoding happens ...
storage.save_metadata("size_cadfile", os.path.getsize("part_001.step"))  # → .infoset
storage.save_metadata("processing_time", time.time() - start_time)       # → .infoset
storage.save_metadata("file_label", 2)                                    # → .attribset

storage.close()

Quick Setup with Templates

from hoops_ai.storage.datasetstorage import create_cad_schema
from hoops_ai.storage import OptStorage

# Quick setup with template
builder = create_cad_schema()  # Loads 'cad_basic' template

# Customize as needed
predictions = builder.create_group("predictions", "face", "ML predictions")
predictions.create_array("predicted_label", ["face"], "int32")
predictions.create_array("confidence", ["face"], "float32")

schema = builder.build()

# Apply to storage
storage = OptStorage(store_path="./output/part.data")
storage.set_schema(schema)

Schema Validation in Practice

import numpy as np

# Create schema with validation rules
builder = SchemaBuilder(domain="validated_data")
group = builder.create_group("measurements", "sample")
group.create_array("temperature", ["sample"], "float32",
                  min_value=-273.15,  # Absolute zero
                  max_value=5000.0)   # Reasonable max

schema = builder.build()
storage = OptStorage("./data.zarr")
storage.set_schema(schema)

# Valid data
valid_temps = np.array([20.5, 25.3, 22.1], dtype=np.float32)
storage.save_data("temperature", valid_temps)  # ✓ Success

# Invalid data (contains value below min)
invalid_temps = np.array([20.5, -300.0, 22.1], dtype=np.float32)
try:
    storage.save_data("temperature", invalid_temps)  # ✗ Validation fails
except ValueError as e:
    print(f"Validation error: {e}")

Performance and Best Practices 

Understanding Schema Performance

Schema Impact on Performance

Minimal Runtime Overhead:

Schema validation is optional (controlled by DataStorage implementation)

Schema lookup is dictionary-based (O(1) operations)

Schema is set once per storage instance

Benefits for Large-Scale Data:

Predictable Merging: Schema-guided dataset merging is deterministic

Type Safety: Prevents type mismatches that cause downstream errors

Memory Efficiency: Dimension information enables efficient chunk sizing

Parallelization: Schema enables safe parallel writes to different groups

Best Practices

Follow these recommendations when working with SchemaBuilder:

Define Schema Early: Set schema before any data operations

Use Templates: Start with templates for common patterns

Validate Once: Schema validation during development, disable in production

Document Dimensions: Clear dimension names improve code readability

Version Schemas: Increment version when making breaking changes

Summary 

The SchemaBuilder is HOOPS AI’s declarative interface for defining data organization. It provides:

Schema dictionaries that configure DataStorage behavior

Logical groups to organize related arrays

Array dimensions specifications for validation and merging

Metadata routing to appropriate storage locations (.infoset vs .attribset)

Templates for common use cases (CAD, manufacturing, sensors)

Validation to catch data issues early

The schema dictionary serves as the contract between data producers (encoders) and data consumers (storage, merging, ML pipelines), ensuring consistent, validated, and well-organized data throughout the HOOPS AI system.