Data Flow Management

Introduction

This section of the programming guide covers the concepts and best practices for managing data flow within HOOPS AI applications. It includes CAD data access, feature encoding, dataset creation and merging, storage and persistence, as well as data flow orchestration and customization.

HOOPS AI is a flow-based data processing framework that transforms CAD files into machine learning-ready datasets. The Data Flow Management layer provides the foundation for this transformation, handling everything from loading CAD files to organizing encoded data into structured, queryable datasets.

Architecture Overview

The Data Flow Management system consists of five integrated modules that work together to process CAD data:

┌──────────────────────────────────────────────────────────────┐
│            HOOPS AI Data Flow Architecture                   │
└──────────────────────────────────────────────────────────────┘

MODULE 1: CAD ACCESS
┌────────────────────────────────────────────────────────────┐
│  CAD Files → HOOPSLoader → HOOPSModel → HOOPSBrep          │
│  • Load CAD models with HOOPS Exchange                     │
│  • Access B-Rep geometry and topology                      │
│  • Query faces, edges, vertices, shells                    │
│  • Extract geometric properties                            │
└────────────────────────────────────────────────────────────┘
                         ↓
MODULE 2: CAD ENCODING
┌────────────────────────────────────────────────────────────┐
│  HOOPSBrep → BrepEncoder → Feature Arrays                  │
│  • Extract geometric features (areas, normals, UV grids)   │
│  • Extract topological features (adjacency, connectivity)  │
│  • Compute shape descriptors (histograms)                  │
│  • Push features to storage system                         │
└────────────────────────────────────────────────────────────┘
                         ↓
MODULE 3: STORAGE
┌────────────────────────────────────────────────────────────┐
│  DataStorage → Schema Validation → Zarr Compression        │
│  • Schema-driven data organization                         │
│  • Compressed Zarr format storage                          │
│  • Metadata routing (file-level vs categorical)            │
│  • Automatic dimension naming for xarray                   │
└────────────────────────────────────────────────────────────┘
                         ↓
MODULE 4: FLOW ORCHESTRATION
┌────────────────────────────────────────────────────────────┐
│  @flowtask Decorators → Flow Manager → Parallel Execution  │
│  • Define tasks declaratively                              │
│  • Automatic parallel execution                            │
│  • Progress tracking and error handling                    │
│  • Generate visualization assets                           │
└────────────────────────────────────────────────────────────┘
                         ↓
MODULE 5: DATASET MANAGEMENT
┌────────────────────────────────────────────────────────────┐
│  DatasetMerger → Unified Dataset → (.dataset, .infoset)    │
│  • Merge thousands of .data files                          │
│  • Provenance tracking with file IDs                       │
│  • Schema-driven consolidation                             │
│  • Parquet metadata for efficient queries                  │
└────────────────────────────────────────────────────────────┘

Key Design Principles

The Data Flow Management system is built on several core principles that ensure efficient, maintainable, and scalable CAD data processing:

Declarative Over Imperative

Use @flowtask decorators to define what to process, not how to parallelize it. The framework handles threading, process pools, and error management automatically.

# You write this (declarative)
@flowtask.transform(
    name="encode_cad",
    inputs=["cad_file", "cad_loader", "storage"],
    outputs=["face_count", "edge_count"]
)
def encode_cad(cad_file, cad_loader, storage):
    # Just process one file
    encoder = BrepEncoder(cad_loader, storage)
    return encoder.encode(cad_file)

# Framework handles this (imperative)
# - Process pool creation
# - Task distribution across workers
# - Error handling and retries
# - Progress tracking
# - Result aggregation

Schema-Driven Data Organization

Define your data structure once using SchemaBuilder, and it propagates through storage, validation, merging, and querying. No manual bookkeeping of array dimensions or metadata routing.

# Define schema once
builder = SchemaBuilder(domain="CAD_analysis", version="1.0")
faces_group = builder.create_group("faces", "face", "Face data")
faces_group.create_array("face_areas", ["face"], "float32")
faces_group.create_array("face_types", ["face"], "int32")
schema = builder.build()

# Schema governs:
# - Storage validation (correct dimensions, data types)
# - Metadata routing (file-level vs categorical)
# - Dataset merging (group-based concatenation)
# - Query operations (array discovery, filtering)

Flow-Based Processing

All operations are organized into Flows – pipelines of tasks that transform data step-by-step. Flows handle dependency resolution, logging, and output management.

# Define tasks
@flowtask.extract(...)
def gather_files(source): ...

@flowtask.transform(...)
def encode_cad(cad_file, ...): ...

# Create flow (automatic dependency resolution)
flow = hoops_ai.create_flow(
    name="my_pipeline",
    tasks=[gather_files, encode_cad],
    auto_dataset_export=True  # Automatic merging
)

# Execute (parallel, tracked, logged)
flow_output, summary, flow_file = flow.process(inputs={...})

Modular Separation of Concerns

Each module has a clear, single responsibility:

CAD Access Module: Load CAD files and provide low-level geometry/topology access. No feature extraction or storage concerns.
CAD Encoding Module: Extract features from B-Rep structures. No file I/O or storage management.
Storage Module: Persist data with schema validation. No encoding logic or CAD file handling.
Flow Module: Orchestrate tasks and manage execution. No knowledge of CAD-specific operations.
Dataset Module: Merge and query datasets. No encoding or flow orchestration logic.

Documentation Structure

This section is organized into five focused guides covering the complete data flow pipeline:

CAD Data Access

CAD Data Access - Unified interface for loading CAD files and extracting geometric/topological data.

What you’ll learn:

HOOPSLoader singleton for CAD file loading with HOOPS Exchange

Loading 100+ CAD file formats (STEP, IGES, CATIA, SolidWorks, Parasolid, etc.)

HOOPSModel interface for accessing loaded CAD model properties

HOOPSBrep interface for B-Rep geometry and topology queries

Querying faces, edges, vertices, shells, and topological relationships

Extracting geometric properties (areas, lengths, bounding boxes, normals)

Understanding the B-Rep data structure and component hierarchy

Configuring loading options for feature extraction and solid loading

Resource management and lifecycle patterns

When to use this guide:

You need to load CAD files programmatically from multiple formats

You want to extract geometric or topological properties directly

You’re implementing custom feature extraction logic

You need to understand the B-Rep structure and CAD Access architecture

You’re debugging CAD file loading issues

CAD Data Encoding

CAD Data Encoding - Transforming CAD geometry into numeric feature vectors for machine learning.

What you’ll learn:

BrepEncoder push-based architecture for feature extraction

Geometric features (face areas, face attributes, edge lengths, edge attributes)

Surface sampling (UV grids for faces, U grids for edges, face discretization)

Topological features (face adjacency graphs, extended adjacency, face neighbors count, face-pair edge paths)

Shape descriptors (D2 distance histograms, A3 angle histograms between face pairs)

Mathematical formulations for all encoding methods

Integration with DataStorage for persisting extracted features

Complete encoding workflow examples

Performance considerations for large-scale encoding

When to use this guide:

You need to convert CAD models into numerical features for ML training

You’re building custom encoders for specific ML tasks or domains

You want to understand the mathematical formulation of feature extraction

You’re optimizing encoding performance for large datasets

You need to debug feature extraction issues or validate encoded data

Datasets - ML-Ready Inputs

Datasets - ML-Ready Inputs - Defining data organization schemas with SchemaBuilder.

What you’ll learn:

SchemaBuilder declarative API for defining data storage schemas

Understanding schema components (groups, arrays, metadata)

Creating logical groups with primary dimensions

Defining arrays with explicit dimensions and data types

Managing file-level vs categorical metadata

Metadata routing with pattern matching rules

Using predefined schema templates for common use cases

Extending templates with custom groups and arrays

Exporting and loading schemas for version control

Integration with DataStorage for schema-driven validation

When to use this guide:

You need to define custom data organization structures

You want consistent, validated data across multiple CAD files

You’re setting up ML pipelines requiring predictable data shapes

You need to understand how schemas enable dataset merging

You’re debugging dimension mismatches or validation errors

Storage and Persistence

Data Storage - Unified interface for persisting and retrieving data with multiple backends.

What you’ll learn:

DataStorage abstract interface and plugin architecture

OptStorage (Zarr-based) for production use with compression

MemoryStorage for testing and prototyping workflows

JsonStorageHandler for human-readable debugging output

Schema integration for validation and metadata routing

Compression strategies and performance tuning

Dimension naming for xarray/Dask compatibility

File-level vs categorical metadata management

Complete usage examples with CAD encoding workflows

Choosing the right storage implementation for your use case

When to use this guide:

You need to persist encoded CAD features to disk

You’re implementing custom storage backends

You want to understand how data is organized and compressed

You need to optimize storage performance or reduce file sizes

You’re debugging storage issues or metadata routing problems

Data Flow Customisation

Data Flow Customisation - Building and executing modular, parallel CAD processing pipelines.

What you’ll learn:

@flowtask decorators for defining processing steps declaratively

Task types (extract, transform, custom) and when to use each

Automatic parallel execution with ProcessPoolExecutor

Flow creation with hoops_ai.create_flow() and configuration options

HOOPSLoader lifecycle management per worker process

Comprehensive error handling, logging, and progress tracking

Automatic dataset merging with auto_dataset_export=True

Windows-specific requirements for multiprocessing

Complete workflow examples from CAD files to merged datasets

Performance monitoring and optimization strategies

When to use this guide:

You’re building end-to-end CAD data processing pipelines

You need to process thousands of CAD files efficiently in parallel

You want to customize data extraction, transformation, or validation logic

You need to integrate HOOPS AI into existing workflows or systems

You’re debugging parallel execution, error handling, or performance issues

Quick Start

New to Data Flow Management?

Start with Data Flow Customisation to understand the orchestration layer that ties everything together. This will give you context for how the other modules fit into the pipeline.

Building Custom Encoders?

Start here:: → CAD Data Access - Learn how to load CAD files and access geometry
Then:: → CAD Data Encoding - Learn how to extract features from B-Rep structures
Finally:: → Data Storage - Learn how to persist encoded data with schema validation

Working with Existing Datasets?

Start here:: → Data Merging in HOOPS AI - Learn how Flow-generated .data files are merged into unified datasets
Then:: → See Dataset Exploration and Mining in the ML section for querying and analysis

Building Production Pipelines?

Study all guides in order:

CAD Data Access - Understand CAD file loading

CAD Data Encoding - Understand feature extraction

Datasets - ML-Ready Inputs - Define schemas for data organization

Data Storage - Configure storage backends and persistence

Data Flow Customisation - Build orchestrated pipelines with automatic merging

Data Merging in HOOPS AI - Understand how datasets are consolidated

Additional Resources

Machine Learning Integration: See Machine Learning Model for guides on training and deploying ML models using the datasets created by this pipeline.
Dataset Exploration: See Dataset Exploration and Mining for comprehensive tools to query, analyze, and prepare merged datasets for ML training.
Visualization: See Data Visualization Experience for tools to visualize CAD data and ML predictions throughout the pipeline.
API Reference: See hoops_ai for complete API documentation of all modules and classes.

Data Flow Management

Introduction

Architecture Overview

Key Design Principles

Declarative Over Imperative

Schema-Driven Data Organization

Flow-Based Processing

Modular Separation of Concerns

Documentation Structure

CAD Data Access

CAD Data Encoding

Datasets - ML-Ready Inputs

Storage and Persistence

Data Flow Customisation

Quick Start

New to Data Flow Management?

Building Custom Encoders?

Working with Existing Datasets?

Building Production Pipelines?

Additional Resources

Hello! I'm HOOPSY