Data Flow Management

Introduction

This section of the programming guide covers the concepts and best practices for managing data flow within HOOPS AI applications. It includes CAD data access, feature encoding, dataset creation and merging, storage and persistence, as well as data flow orchestration and customization.

HOOPS AI is a flow-based data processing framework that transforms CAD files into machine learning-ready datasets. The Data Flow Management layer provides the foundation for this transformation, handling everything from loading CAD files to organizing encoded data into structured, queryable datasets.

Architecture Overview

The Data Flow Management system consists of five integrated modules that work together to process CAD data:

┌──────────────────────────────────────────────────────────────┐
│            HOOPS AI Data Flow Architecture                   │
└──────────────────────────────────────────────────────────────┘

MODULE 1: CAD ACCESS
┌────────────────────────────────────────────────────────────┐
│  CAD Files → HOOPSLoader → HOOPSModel → HOOPSBrep          │
│  • Load CAD models with HOOPS Exchange                     │
│  • Access B-Rep geometry and topology                      │
│  • Query faces, edges, vertices, shells                    │
│  • Extract geometric properties                            │
└────────────────────────────────────────────────────────────┘
                         ↓
MODULE 2: CAD ENCODING
┌────────────────────────────────────────────────────────────┐
│  HOOPSBrep → BrepEncoder → Feature Arrays                  │
│  • Extract geometric features (areas, normals, UV grids)   │
│  • Extract topological features (adjacency, connectivity)  │
│  • Compute shape descriptors (histograms)                  │
│  • Push features to storage system                         │
└────────────────────────────────────────────────────────────┘
                         ↓
MODULE 3: STORAGE
┌────────────────────────────────────────────────────────────┐
│  DataStorage → Schema Validation → Zarr Compression        │
│  • Schema-driven data organization                         │
│  • Compressed Zarr format storage                          │
│  • Metadata routing (file-level vs categorical)            │
│  • Automatic dimension naming for xarray                   │
└────────────────────────────────────────────────────────────┘
                         ↓
MODULE 4: FLOW ORCHESTRATION
┌────────────────────────────────────────────────────────────┐
│  @flowtask Decorators → Flow Manager → Parallel Execution  │
│  • Define tasks declaratively                              │
│  • Automatic parallel execution                            │
│  • Progress tracking and error handling                    │
│  • Generate visualization assets                           │
└────────────────────────────────────────────────────────────┘
                         ↓
MODULE 5: DATASET MANAGEMENT
┌────────────────────────────────────────────────────────────┐
│  DatasetMerger → Unified Dataset → (.dataset, .infoset)    │
│  • Merge thousands of .data files                          │
│  • Provenance tracking with file IDs                       │
│  • Schema-driven consolidation                             │
│  • Parquet metadata for efficient queries                  │
└────────────────────────────────────────────────────────────┘

Key Design Principles

The Data Flow Management system is built on several core principles that ensure efficient, maintainable, and scalable CAD data processing:

Declarative Over Imperative

Use @flowtask decorators to define what to process, not how to parallelize it. The framework handles threading, process pools, and error management automatically.

# You write this (declarative)
@flowtask.transform(
    name="encode_cad",
    inputs=["cad_file", "cad_loader", "storage"],
    outputs=["face_count", "edge_count"]
)
def encode_cad(cad_file, cad_loader, storage):
    # Just process one file
    encoder = BrepEncoder(cad_loader, storage)
    return encoder.encode(cad_file)

# Framework handles this (imperative)
# - Process pool creation
# - Task distribution across workers
# - Error handling and retries
# - Progress tracking
# - Result aggregation

Schema-Driven Data Organization

Define your data structure once using SchemaBuilder, and it propagates through storage, validation, merging, and querying. No manual bookkeeping of array dimensions or metadata routing.

# Define schema once
builder = SchemaBuilder(domain="CAD_analysis", version="1.0")
faces_group = builder.create_group("faces", "face", "Face data")
faces_group.create_array("face_areas", ["face"], "float32")
faces_group.create_array("face_types", ["face"], "int32")
schema = builder.build()

# Schema governs:
# - Storage validation (correct dimensions, data types)
# - Metadata routing (file-level vs categorical)
# - Dataset merging (group-based concatenation)
# - Query operations (array discovery, filtering)

Flow-Based Processing

All operations are organized into Flows – pipelines of tasks that transform data step-by-step. Flows handle dependency resolution, logging, and output management.

# Define tasks
@flowtask.extract(...)
def gather_files(source): ...

@flowtask.transform(...)
def encode_cad(cad_file, ...): ...

# Create flow (automatic dependency resolution)
flow = hoops_ai.create_flow(
    name="my_pipeline",
    tasks=[gather_files, encode_cad],
    auto_dataset_export=True  # Automatic merging
)

# Execute (parallel, tracked, logged)
flow_output, summary, flow_file = flow.process(inputs={...})

Modular Separation of Concerns

Each module has a clear, single responsibility:

CAD Access Module

Load CAD files and provide low-level geometry/topology access. No feature extraction or storage concerns.

CAD Encoding Module

Extract features from B-Rep structures. No file I/O or storage management.

Storage Module

Persist data with schema validation. No encoding logic or CAD file handling.

Flow Module

Orchestrate tasks and manage execution. No knowledge of CAD-specific operations.

Dataset Module

Merge and query datasets. No encoding or flow orchestration logic.

Documentation Structure

This section is organized into five focused guides covering the complete data flow pipeline:

CAD Data Access

CAD Data Access - Unified interface for loading CAD files and extracting geometric/topological data.

What you’ll learn:

  • HOOPSLoader singleton for CAD file loading with HOOPS Exchange

  • Loading 100+ CAD file formats (STEP, IGES, CATIA, SolidWorks, Parasolid, etc.)

  • HOOPSModel interface for accessing loaded CAD model properties

  • HOOPSBrep interface for B-Rep geometry and topology queries

  • Querying faces, edges, vertices, shells, and topological relationships

  • Extracting geometric properties (areas, lengths, bounding boxes, normals)

  • Understanding the B-Rep data structure and component hierarchy

  • Configuring loading options for feature extraction and solid loading

  • Resource management and lifecycle patterns

When to use this guide:

  • You need to load CAD files programmatically from multiple formats

  • You want to extract geometric or topological properties directly

  • You’re implementing custom feature extraction logic

  • You need to understand the B-Rep structure and CAD Access architecture

  • You’re debugging CAD file loading issues

CAD Data Encoding

CAD Data Encoding - Transforming CAD geometry into numeric feature vectors for machine learning.

What you’ll learn:

  • BrepEncoder push-based architecture for feature extraction

  • Geometric features (face areas, face attributes, edge lengths, edge attributes)

  • Surface sampling (UV grids for faces, U grids for edges, face discretization)

  • Topological features (face adjacency graphs, extended adjacency, face neighbors count, face-pair edge paths)

  • Shape descriptors (D2 distance histograms, A3 angle histograms between face pairs)

  • Mathematical formulations for all encoding methods

  • Integration with DataStorage for persisting extracted features

  • Complete encoding workflow examples

  • Performance considerations for large-scale encoding

When to use this guide:

  • You need to convert CAD models into numerical features for ML training

  • You’re building custom encoders for specific ML tasks or domains

  • You want to understand the mathematical formulation of feature extraction

  • You’re optimizing encoding performance for large datasets

  • You need to debug feature extraction issues or validate encoded data

Datasets - ML-Ready Inputs

Datasets - ML-Ready Inputs - Defining data organization schemas with SchemaBuilder.

What you’ll learn:

  • SchemaBuilder declarative API for defining data storage schemas

  • Understanding schema components (groups, arrays, metadata)

  • Creating logical groups with primary dimensions

  • Defining arrays with explicit dimensions and data types

  • Managing file-level vs categorical metadata

  • Metadata routing with pattern matching rules

  • Using predefined schema templates for common use cases

  • Extending templates with custom groups and arrays

  • Exporting and loading schemas for version control

  • Integration with DataStorage for schema-driven validation

When to use this guide:

  • You need to define custom data organization structures

  • You want consistent, validated data across multiple CAD files

  • You’re setting up ML pipelines requiring predictable data shapes

  • You need to understand how schemas enable dataset merging

  • You’re debugging dimension mismatches or validation errors

Storage and Persistence

Data Storage - Unified interface for persisting and retrieving data with multiple backends.

What you’ll learn:

  • DataStorage abstract interface and plugin architecture

  • OptStorage (Zarr-based) for production use with compression

  • MemoryStorage for testing and prototyping workflows

  • JsonStorageHandler for human-readable debugging output

  • Schema integration for validation and metadata routing

  • Compression strategies and performance tuning

  • Dimension naming for xarray/Dask compatibility

  • File-level vs categorical metadata management

  • Complete usage examples with CAD encoding workflows

  • Choosing the right storage implementation for your use case

When to use this guide:

  • You need to persist encoded CAD features to disk

  • You’re implementing custom storage backends

  • You want to understand how data is organized and compressed

  • You need to optimize storage performance or reduce file sizes

  • You’re debugging storage issues or metadata routing problems

Data Flow Customisation

Data Flow Customisation - Building and executing modular, parallel CAD processing pipelines.

What you’ll learn:

  • @flowtask decorators for defining processing steps declaratively

  • Task types (extract, transform, custom) and when to use each

  • Automatic parallel execution with ProcessPoolExecutor

  • Flow creation with hoops_ai.create_flow() and configuration options

  • HOOPSLoader lifecycle management per worker process

  • Comprehensive error handling, logging, and progress tracking

  • Automatic dataset merging with auto_dataset_export=True

  • Windows-specific requirements for multiprocessing

  • Complete workflow examples from CAD files to merged datasets

  • Performance monitoring and optimization strategies

When to use this guide:

  • You’re building end-to-end CAD data processing pipelines

  • You need to process thousands of CAD files efficiently in parallel

  • You want to customize data extraction, transformation, or validation logic

  • You need to integrate HOOPS AI into existing workflows or systems

  • You’re debugging parallel execution, error handling, or performance issues

Quick Start

New to Data Flow Management?

Start with Data Flow Customisation to understand the orchestration layer that ties everything together. This will give you context for how the other modules fit into the pipeline.

Building Custom Encoders?

Start here:

CAD Data Access - Learn how to load CAD files and access geometry

Then:

CAD Data Encoding - Learn how to extract features from B-Rep structures

Finally:

Data Storage - Learn how to persist encoded data with schema validation

Working with Existing Datasets?

Start here:

Data Merging in HOOPS AI - Learn how Flow-generated .data files are merged into unified datasets

Then:

→ See Dataset Exploration and Mining in the ML section for querying and analysis

Building Production Pipelines?

Study all guides in order:

  1. CAD Data Access - Understand CAD file loading

  2. CAD Data Encoding - Understand feature extraction

  3. Datasets - ML-Ready Inputs - Define schemas for data organization

  4. Data Storage - Configure storage backends and persistence

  5. Data Flow Customisation - Build orchestrated pipelines with automatic merging

  6. Data Merging in HOOPS AI - Understand how datasets are consolidated

Additional Resources

Machine Learning Integration

See Machine Learning Model for guides on training and deploying ML models using the datasets created by this pipeline.

Dataset Exploration

See Dataset Exploration and Mining for comprehensive tools to query, analyze, and prepare merged datasets for ML training.

Visualization

See Data Visualization Experience for tools to visualize CAD data and ML predictions throughout the pipeline.

API Reference

See hoops_ai for complete API documentation of all modules and classes.