Datasets - ML-Ready Inputs

Introduction

Now that you’ve seen how to access and preprocess CAD data, it’s time to organize the extracted features into datasets suitable for machine learning workflows.

The SchemaBuilder provides a user-friendly, explicit API for defining data storage schemas in HOOPS AI. By defining schemas that specify array dimensions, data types, and logical groupings, you ensure that CAD features are consistently organized across multiple files. This consistency is essential for creating ML-ready inputs: schemas guarantee that face attributes, edge features, and graph data from different CAD files have compatible shapes and types that can be merged into batched tensors for training. The schema validation catches dimension mismatches early, before they cause runtime errors in your ML pipeline.

Schemas define how data should be organized into logical groups and arrays, enabling predictable data merging, validation, and metadata routing. The SchemaBuilder creates Python dictionaries that serve as configuration blueprints for DataStorage implementations.

Key Concept: The SchemaBuilder produces a schema dictionary that tells DataStorage implementations:

  • How to organize arrays into logical groups

  • What dimensions each array should have

  • How to validate incoming data

  • Where to route metadata (file-level vs. categorical)

The module follows a declarative pattern:

SchemaBuilder → Schema Dictionary → DataStorage.set_schema() → Validated Storage Operations

SchemaBuilder Overview

Purpose

The SchemaBuilder class provides a standard, object-oriented API for creating data storage schemas without requiring method chaining.

Initialization

from hoops_ai.storage.datasetstorage import SchemaBuilder

builder = SchemaBuilder(
    domain="CAD_analysis",
    version="1.0",
    description="Schema for CAD geometric feature extraction"
)

Parameters

  • domain (str): Domain name for this schema (e.g., ‘CAD_analysis’, ‘manufacturing_data’)

  • version (str): Schema version for compatibility tracking (default: ‘1.0’)

  • description (str, optional): Human-readable description of the schema’s purpose

Understanding Schema Components

The SchemaBuilder organizes data through three core components:

Schema
├── Groups (logical containers)
│   ├── faces group
│   │   ├── face_areas array [face] → float32
│   │   ├── face_types array [face] → int32
│   │   └── face_normals array [face, coordinate] → float32
│   ├── edges group
│   │   ├── edge_lengths array [edge] → float32
│   │   └── edge_types array [edge] → int32
│   └── graph group
│       └── edges_source array [edge] → int32
└── Metadata
    ├── File-level (.infoset)
    └── Categorical (.attribset)

Groups

Groups are logical containers that organize related arrays. Each group has:

  • Name: Unique identifier (e.g., 'faces', 'edges')

  • Primary Dimension: Main indexing dimension (e.g., 'face', 'edge', 'batch')

  • Description: What data this group contains

  • Special Processing: Optional processing hint (e.g., 'matrix_flattening', 'nested_edges')

Creating a Group:

faces_group = builder.create_group(
    name="faces",
    primary_dimension="face",
    description="Face geometric data",
    special_processing=None  # Optional
)

The method returns a Group object used to define arrays within that group.

Arrays

Arrays are the actual data containers within groups. Each array specifies:

  • Name: Unique identifier within the group

  • Dimensions: List of dimension names defining the array’s shape

  • Dtype: Data type ('float32', 'float64', 'int32', 'int64', 'bool', 'str')

  • Description: What this array represents

  • Validation Rules: Optional constraints (min_value, max_value, etc.)

Basic Array Definition:

# 1D array: face areas (N faces)
faces_group.create_array(
    name="face_areas",
    dimensions=["face"],
    dtype="float32",
    description="Surface area of each face"
)

Multi-Dimensional Arrays:

# 2D array: face normals (N_faces × 3 coordinates)
faces_group.create_array(
    name="face_normals",
    dimensions=["face", "coordinate"],
    dtype="float32",
    description="Normal vectors for each face (N x 3)"
)

# 4D array: UV grid samples (N_faces × U × V × components)
faces_group.create_array(
    name="face_uv_grids",
    dimensions=["face", "uv_x", "uv_y", "component"],
    dtype="float32",
    description="Sampled points on face surfaces"
)

Arrays with Validation Rules:

faces_group.create_array(
    name="face_areas",
    dimensions=["face"],
    dtype="float32",
    description="Surface area of each face",
    min_value=0.0,  # Validation: areas must be positive
    max_value=1e6   # Validation: reasonable upper bound
)

Managing Arrays:

# Remove an array
success = faces_group.remove_array("face_areas")
# Returns: True if removed, False if not found

# Get array specification
array_spec = faces_group.get_array("face_areas")
# Returns: {'dims': ['face'], 'dtype': 'float32', 'description': '...'}

# List all arrays in group
array_names = faces_group.list_arrays()
# Returns: ['face_areas', 'face_types', 'face_normals', ...]

Metadata

Metadata is divided into two categories based on storage location:

  • File-level Metadata: Stored in .infoset files

    • Information about each data file (file size, processing time, file path)

  • Categorical Metadata: Stored in .attribset files

    • Categorical classifications (labels, categories, complexity ratings)

Working with SchemaBuilder

The SchemaBuilder provides methods to manage groups, define metadata, and configure routing rules. This section covers the essential operations for building complete schemas.

Managing Groups

Once you have a SchemaBuilder instance, you can create, retrieve, remove, and list groups.

Creating Groups:

# Create a new group for edge data
edges_group = builder.create_group(
    name="edges",
    primary_dimension="edge",
    description="Edge-related geometric properties",
    special_processing=None
)

Retrieving Existing Groups:

# Get a previously created group
faces_group = builder.get_group("faces")
# Returns: Group object or None if not found

Removing Groups:

# Remove a group from the schema
success = builder.remove_group("edges")
# Returns: True if removed, False if not found

Listing All Groups:

# Get names of all groups in the schema
group_names = builder.list_groups()
# Returns: ['faces', 'edges', 'graph', 'metadata']

Defining Metadata

Metadata definitions tell DataStorage where to route metadata and how to validate it. You can define both file-level and categorical metadata with optional validation rules.

File-Level Metadata

File-level metadata is stored in .infoset Parquet files and represents information about each data file.

# Define numeric metadata with validation
builder.define_file_metadata(
    name="size_cadfile",
    dtype="int64",
    description="File size in bytes",
    required=False,
    min_value=0
)

# Define timing information
builder.define_file_metadata(
    name="processing_time",
    dtype="float32",
    description="Processing time in seconds",
    required=False
)

# Define required string metadata
builder.define_file_metadata(
    name="flow_name",
    dtype="str",
    description="Name of the flow that processed this file",
    required=True
)

Parameters:

  • name (str): Metadata field name

  • dtype (str): Data type ('str', 'int32', 'int64', 'float32', 'float64', 'bool')

  • description (str, optional): Field description

  • required (bool): Whether this field must be present (default: False)

  • **validation_rules: Additional constraints (min_value, max_value, etc.)

Categorical Metadata

Categorical metadata is stored in .attribset Parquet files and represents categorical classifications.

# Define categorical metadata with labeled values
builder.define_categorical_metadata(
    name="machining_category",
    dtype="int32",
    description="Machining complexity classification",
    values=[1, 2, 3, 4, 5],
    labels=["Simple", "Easy", "Medium", "Hard", "Complex"],
    required=False
)

# Define material classification
builder.define_categorical_metadata(
    name="material_type",
    dtype="str",
    description="Material classification",
    values=["steel", "aluminum", "plastic", "composite"],
    required=True
)

Parameters:

  • name (str): Metadata field name

  • dtype (str): Data type

  • description (str, optional): Field description

  • values (List, optional): List of allowed values

  • labels (List[str], optional): Human-readable labels corresponding to values

  • required (bool): Whether this field must be present (default: False)

  • **validation_rules: Additional constraints

Metadata Routing

The SchemaBuilder provides flexible metadata routing using pattern matching and default rules. This determines whether metadata goes to .infoset or .attribset files.

Setting Routing Rules

builder.set_metadata_routing_rules(
    file_level_patterns=[
        "description",
        "flow_name",
        "stream *",      # Wildcard: matches 'stream .scs', 'stream .prc', etc.
        "Item",
        "size_*",        # Wildcard: matches 'size_cadfile', 'size_compressed', etc.
        "duration_*",    # Wildcard: matches all duration fields
        "processing_*"
    ],
    categorical_patterns=[
        "category",
        "type",
        "*_label",       # Wildcard: matches 'file_label', 'part_label', etc.
        "material_*",
        "complexity"
    ],
    default_numeric="file_level",      # Where numeric metadata goes by default
    default_categorical="categorical", # Where categorical metadata goes by default
    default_string="categorical"       # Where string metadata goes by default
)

Pattern Matching Rules:

  • * wildcard matches any characters

  • Patterns are case-insensitive

  • Explicit definitions override pattern matching

Querying Routing Destinations

# Get routing destination for a specific field
destination = builder.get_metadata_routing("file_label")
# Returns: "categorical" or "file_level"

# List all metadata fields by category
fields = builder.list_metadata_fields()
# Returns: {'file_level': ['size_cadfile', 'processing_time', ...],
#           'categorical': ['file_label', 'material_type', ...]}

Validating Metadata

# Validate a specific field's value
is_valid = builder.validate_metadata_field("machining_category", 3)
# Returns: True (3 is in allowed values [1,2,3,4,5])

is_valid = builder.validate_metadata_field("machining_category", 10)
# Returns: False (10 not in allowed values)

# Validate entire schema
errors = builder.validate_metadata_schema()
# Returns: List of error messages, empty if valid

Using Schema Templates

The SchemaBuilder supports predefined templates for common use cases, reducing boilerplate code and providing a quick start for standard data organization patterns.

Predefined Templates

Templates provide complete, ready-to-use schemas for common domains. You can load a template directly or use convenience functions.

Loading Templates

# Start with a complete CAD analysis template
builder = SchemaBuilder().from_template('cad_basic')

# Or use convenience functions
from hoops_ai.storage.datasetstorage import create_cad_schema
builder = create_cad_schema()

Available Templates

The following templates are available out of the box:

  1. cad_basic - Basic CAD analysis with faces, edges, and graph data

    • Groups: faces, edges, graph, metadata

    • Arrays: face_areas, face_indices, edge_lengths, etc.

  2. cad_advanced - Advanced CAD with surface properties and relationships

    • Groups: faces, edges, faceface, graph, performance

    • Arrays: face_uv_grids, edge_dihedral_angles, extended_adjacency, etc.

  3. manufacturing_basic - Manufacturing data with quality metrics

    • Groups: production, sensors, materials

    • Arrays: quality_score, temperature, pressure, composition, etc.

  4. sensor_basic - Sensor data with timestamps and readings

    • Groups: timeseries, sensors, events

    • Arrays: timestamp, value, sensor_type, event_type, etc.

Discovering Templates

from hoops_ai.storage.datasetstorage.schema_templates import SchemaTemplates

# List all available templates
templates = SchemaTemplates.list_templates()
# Returns: ['cad_basic', 'cad_advanced', 'manufacturing_basic', 'sensor_basic']

# Get description of a specific template
description = SchemaTemplates.get_template_description('cad_advanced')
# Returns: "Advanced CAD analysis including surface properties and relationships"

Extending Templates

Templates can be extended to add custom groups and arrays while preserving the base template structure. This is useful when you need standard CAD data plus custom application-specific fields.

# Start with CAD basic template and add custom data
builder = SchemaBuilder().extend_template('cad_basic')

# Add custom group for ML predictions
predictions_group = builder.create_group(
    "predictions",
    "face",
    "ML model predictions for faces"
)
predictions_group.create_array("predicted_class", ["face"], "int32")
predictions_group.create_array("confidence_score", ["face"], "float32")

# Add custom metadata
builder.define_categorical_metadata(
    "model_version",
    "str",
    "ML model version used for predictions"
)

# Build the extended schema
schema = builder.build()

Building and Exporting Schemas

Once you’ve defined your schema using the SchemaBuilder, you can build it into a Python dictionary and export it for reuse or documentation.

Schema Dictionary Structure

The build() method produces a Python dictionary that serves as the configuration blueprint for DataStorage implementations. This dictionary contains all the information needed to organize, validate, and route data.

schema = builder.build()

The resulting schema dictionary has the following structure:

{
    "domain": "CAD_analysis",
    "version": "1.0",
    "description": "Schema for CAD geometric feature extraction",
    "groups": {
        "faces": {
            "primary_dimension": "face",
            "description": "Face geometric data",
            "arrays": {
                "face_areas": {
                    "dims": ["face"],
                    "dtype": "float32",
                    "description": "Surface area of each face"
                },
                "face_normals": {
                    "dims": ["face", "coordinate"],
                    "dtype": "float32",
                    "description": "Normal vectors for each face (N x 3)"
                }
                # ... more arrays
            }
        },
        "edges": {
            "primary_dimension": "edge"
            # ... edge arrays
        }
        # ... more groups
    },
    "metadata": {
        "file_level": {
            "size_cadfile": {
                "dtype": "int64",
                "description": "File size in bytes",
                "required": False
            }
            # ... more file-level metadata
        },
        "categorical": {
            "file_label": {
                "dtype": "int32",
                "description": "Classification label",
                "values": [0, 1, 2, 3, 4],
                "required": False
            }
            # ... more categorical metadata
        },
        "routing_rules": {
            "file_level_patterns": ["description", "flow_name", "size_*"],
            "categorical_patterns": ["*_label", "category"],
            "default_numeric": "file_level",
            "default_categorical": "categorical",
            "default_string": "categorical"
        }
    }
}

Exporting and Loading Schemas

Schemas can be exported to JSON files for version control, documentation, or sharing across projects.

# Export to JSON string
json_string = builder.to_json(indent=2)

# Save to file
builder.save_to_file("my_schema.json")

# Load from file
loaded_builder = SchemaBuilder.load_from_file("my_schema.json")

Integration with DataStorage

The schema dictionary produced by SchemaBuilder is consumed by DataStorage implementations via the set_schema() method. This integration enables validated storage operations and automatic metadata routing.

Schema Flow

The following diagram illustrates how schemas flow from definition to validated operations:

        ┌─────────────────┐
        │  SchemaBuilder  │
        │   .build()      │
        └────────┬────────┘
                 │
                 ▼
         Schema Dictionary
         (Python dict)
                 │
                 ▼
        ┌─────────────────────┐
        │   DataStorage       │
        │   .set_schema(dict) │
        └────────┬────────────┘
                 │
                 ▼
┌─────────────────────────────────────────┐
│  Storage Operations with Validation     │
│  • save_data() → validates against dims │
│  • save_metadata() → routes correctly   │
│  • get_group_for_array() → uses schema  │
└─────────────────────────────────────────┘

Applying Schema to Storage

To apply a schema to a storage instance, use the set_schema() method:

from hoops_ai.storage import OptStorage
from hoops_ai.storage.datasetstorage import SchemaBuilder

# Build schema
builder = SchemaBuilder(domain="CAD_analysis", version="1.0")
faces_group = builder.create_group("faces", "face", "Face data")
faces_group.create_array("face_areas", ["face"], "float32")
schema = builder.build()

# Apply schema to storage
storage = OptStorage(store_path="./encoded_data/my_part.data")
storage.set_schema(schema)  # ← Schema dictionary passed here

# Now storage knows:
# 1. "face_areas" belongs to "faces" group
# 2. It should have dimensions ["face"]
# 3. It should be float32 type

Schema-Driven Operations

Once a schema is set, DataStorage can perform several intelligent operations based on the schema definition.

Validating Data Dimensions

import numpy as np

# This will be validated against schema
face_areas = np.array([1.5, 2.3, 4.1, 3.7], dtype=np.float32)
storage.save_data("face_areas", face_areas)
# ✓ Validates: correct dtype, 1D array as expected

Routing Metadata Correctly

# Metadata routing based on schema rules
storage.save_metadata("size_cadfile", 1024000)     # → .infoset (file-level)
storage.save_metadata("file_label", 3)              # → .attribset (categorical)
storage.save_metadata("flow_name", "my_flow")       # → .infoset (file-level pattern match)

Determining Group Membership

group_name = storage.get_group_for_array("face_areas")
# Returns: "faces"

group_name = storage.get_group_for_array("edge_lengths")
# Returns: "edges"

Schema-Aware Merging

During dataset merging, the schema guides:

  • Which arrays belong to the same group

  • What dimensions to concatenate along

  • How to handle special processing (e.g., matrix flattening)

Practical Examples

This section demonstrates complete workflows using SchemaBuilder with real-world CAD encoding scenarios.

Complete CAD Encoding Workflow

from hoops_ai.storage import OptStorage
from hoops_ai.storage.datasetstorage import SchemaBuilder
from hoops_ai.cadaccess import HOOPSLoader
from hoops_ai.cadencoder import BrepEncoder

# 1. Define schema for CAD encoding
builder = SchemaBuilder(domain="CAD_analysis", version="1.0")

# Define faces group
faces_group = builder.create_group("faces", "face", "Face geometric data")
faces_group.create_array("face_indices", ["face"], "int32", "Face IDs")
faces_group.create_array("face_areas", ["face"], "float32", "Face surface areas")
faces_group.create_array("face_types", ["face"], "int32", "Surface type classification")
faces_group.create_array("face_uv_grids", ["face", "uv_x", "uv_y", "component"],
                        "float32", "UV-sampled points and normals")

# Define edges group
edges_group = builder.create_group("edges", "edge", "Edge geometric data")
edges_group.create_array("edge_indices", ["edge"], "int32", "Edge IDs")
edges_group.create_array("edge_lengths", ["edge"], "float32", "Edge lengths")
edges_group.create_array("edge_types", ["edge"], "int32", "Curve type classification")

# Define graph group
graph_group = builder.create_group("graph", "graphitem", "Topology graph")
graph_group.create_array("edges_source", ["edge"], "int32", "Source face indices")
graph_group.create_array("edges_destination", ["edge"], "int32", "Dest face indices")
graph_group.create_array("num_nodes", ["graphitem"], "int32", "Number of nodes")

# Define metadata
builder.define_file_metadata("size_cadfile", "int64", "CAD file size in bytes")
builder.define_file_metadata("processing_time", "float32", "Encoding time in seconds")
builder.define_categorical_metadata("file_label", "int32", "Part classification label")

# Set routing rules
builder.set_metadata_routing_rules(
    file_level_patterns=["size_*", "processing_*", "duration_*"],
    categorical_patterns=["*_label", "category", "type"]
)

# Build schema
schema = builder.build()

# 2. Apply schema to storage
storage = OptStorage(store_path="./encoded/part_001.zarr")
storage.set_schema(schema)

# 3. Encode CAD data with schema-validated storage
loader = HOOPSLoader()
model = loader.create_from_file("part_001.step")
brep = model.get_brep()

encoder = BrepEncoder(brep_access=brep, storage_handler=storage)

# These operations are now schema-validated
encoder.push_face_indices()        # → "faces" group
encoder.push_face_attributes()     # → "faces" group
encoder.push_facegrid(ugrid=5, vgrid=5)  # → "faces" group

encoder.push_edge_indices()        # → "edges" group
encoder.push_edge_attributes()     # → "edges" group

encoder.push_face_adjacency_graph()  # → "graph" group

# Metadata is automatically routed
import os
import time
start_time = time.time()
# ... encoding happens ...
storage.save_metadata("size_cadfile", os.path.getsize("part_001.step"))  # → .infoset
storage.save_metadata("processing_time", time.time() - start_time)       # → .infoset
storage.save_metadata("file_label", 2)                                    # → .attribset

storage.close()

Quick Setup with Templates

from hoops_ai.storage.datasetstorage import create_cad_schema
from hoops_ai.storage import OptStorage

# Quick setup with template
builder = create_cad_schema()  # Loads 'cad_basic' template

# Customize as needed
predictions = builder.create_group("predictions", "face", "ML predictions")
predictions.create_array("predicted_label", ["face"], "int32")
predictions.create_array("confidence", ["face"], "float32")

schema = builder.build()

# Apply to storage
storage = OptStorage(store_path="./output/part.data")
storage.set_schema(schema)

Schema Validation in Practice

import numpy as np

# Create schema with validation rules
builder = SchemaBuilder(domain="validated_data")
group = builder.create_group("measurements", "sample")
group.create_array("temperature", ["sample"], "float32",
                  min_value=-273.15,  # Absolute zero
                  max_value=5000.0)   # Reasonable max

schema = builder.build()
storage = OptStorage("./data.zarr")
storage.set_schema(schema)

# Valid data
valid_temps = np.array([20.5, 25.3, 22.1], dtype=np.float32)
storage.save_data("temperature", valid_temps)  # ✓ Success

# Invalid data (contains value below min)
invalid_temps = np.array([20.5, -300.0, 22.1], dtype=np.float32)
try:
    storage.save_data("temperature", invalid_temps)  # ✗ Validation fails
except ValueError as e:
    print(f"Validation error: {e}")

Performance and Best Practices

Understanding Schema Performance

Schema Impact on Performance

Minimal Runtime Overhead:

  • Schema validation is optional (controlled by DataStorage implementation)

  • Schema lookup is dictionary-based (O(1) operations)

  • Schema is set once per storage instance

Benefits for Large-Scale Data:

  • Predictable Merging: Schema-guided dataset merging is deterministic

  • Type Safety: Prevents type mismatches that cause downstream errors

  • Memory Efficiency: Dimension information enables efficient chunk sizing

  • Parallelization: Schema enables safe parallel writes to different groups

Best Practices

Follow these recommendations when working with SchemaBuilder:

  1. Define Schema Early: Set schema before any data operations

  2. Use Templates: Start with templates for common patterns

  3. Validate Once: Schema validation during development, disable in production

  4. Document Dimensions: Clear dimension names improve code readability

  5. Version Schemas: Increment version when making breaking changes

Summary

The SchemaBuilder is HOOPS AI’s declarative interface for defining data organization. It provides:

  • Schema dictionaries that configure DataStorage behavior

  • Logical groups to organize related arrays

  • Array dimensions specifications for validation and merging

  • Metadata routing to appropriate storage locations (.infoset vs .attribset)

  • Templates for common use cases (CAD, manufacturing, sensors)

  • Validation to catch data issues early

The schema dictionary serves as the contract between data producers (encoders) and data consumers (storage, merging, ML pipelines), ensuring consistent, validated, and well-organized data throughout the HOOPS AI system.

See Also

For quick integration examples showing how to use SchemaBuilder with Flow pipelines, see: