Data Flow Management
Introduction
This section of the programming guide covers the concepts and best practices for managing data flow within HOOPS AI applications. It includes CAD data access, feature encoding, dataset creation and merging, storage and persistence, as well as data flow orchestration and customization.
HOOPS AI is a flow-based data processing framework that transforms CAD files into machine learning-ready datasets. The Data Flow Management layer provides the foundation for this transformation, handling everything from loading CAD files to organizing encoded data into structured, queryable datasets.
Architecture Overview
The Data Flow Management system consists of five integrated modules that work together to process CAD data:
┌──────────────────────────────────────────────────────────────┐
│ HOOPS AI Data Flow Architecture │
└──────────────────────────────────────────────────────────────┘
MODULE 1: CAD ACCESS
┌────────────────────────────────────────────────────────────┐
│ CAD Files → HOOPSLoader → HOOPSModel → HOOPSBrep │
│ • Load CAD models with HOOPS Exchange │
│ • Access B-Rep geometry and topology │
│ • Query faces, edges, vertices, shells │
│ • Extract geometric properties │
└────────────────────────────────────────────────────────────┘
↓
MODULE 2: CAD ENCODING
┌────────────────────────────────────────────────────────────┐
│ HOOPSBrep → BrepEncoder → Feature Arrays │
│ • Extract geometric features (areas, normals, UV grids) │
│ • Extract topological features (adjacency, connectivity) │
│ • Compute shape descriptors (histograms) │
│ • Push features to storage system │
└────────────────────────────────────────────────────────────┘
↓
MODULE 3: STORAGE
┌────────────────────────────────────────────────────────────┐
│ DataStorage → Schema Validation → Zarr Compression │
│ • Schema-driven data organization │
│ • Compressed Zarr format storage │
│ • Metadata routing (file-level vs categorical) │
│ • Automatic dimension naming for xarray │
└────────────────────────────────────────────────────────────┘
↓
MODULE 4: FLOW ORCHESTRATION
┌────────────────────────────────────────────────────────────┐
│ @flowtask Decorators → Flow Manager → Parallel Execution │
│ • Define tasks declaratively │
│ • Automatic parallel execution │
│ • Progress tracking and error handling │
│ • Generate visualization assets │
└────────────────────────────────────────────────────────────┘
↓
MODULE 5: DATASET MANAGEMENT
┌────────────────────────────────────────────────────────────┐
│ DatasetMerger → Unified Dataset → (.dataset, .infoset) │
│ • Merge thousands of .data files │
│ • Provenance tracking with file IDs │
│ • Schema-driven consolidation │
│ • Parquet metadata for efficient queries │
└────────────────────────────────────────────────────────────┘
Key Design Principles
The Data Flow Management system is built on several core principles that ensure efficient, maintainable, and scalable CAD data processing:
Declarative Over Imperative
Use @flowtask decorators to define what to process, not how to parallelize it. The framework handles threading, process pools, and error management automatically.
# You write this (declarative)
@flowtask.transform(
name="encode_cad",
inputs=["cad_file", "cad_loader", "storage"],
outputs=["face_count", "edge_count"]
)
def encode_cad(cad_file, cad_loader, storage):
# Just process one file
encoder = BrepEncoder(cad_loader, storage)
return encoder.encode(cad_file)
# Framework handles this (imperative)
# - Process pool creation
# - Task distribution across workers
# - Error handling and retries
# - Progress tracking
# - Result aggregation
Schema-Driven Data Organization
Define your data structure once using SchemaBuilder, and it propagates through storage, validation, merging, and querying. No manual bookkeeping of array dimensions or metadata routing.
# Define schema once
builder = SchemaBuilder(domain="CAD_analysis", version="1.0")
faces_group = builder.create_group("faces", "face", "Face data")
faces_group.create_array("face_areas", ["face"], "float32")
faces_group.create_array("face_types", ["face"], "int32")
schema = builder.build()
# Schema governs:
# - Storage validation (correct dimensions, data types)
# - Metadata routing (file-level vs categorical)
# - Dataset merging (group-based concatenation)
# - Query operations (array discovery, filtering)
Flow-Based Processing
All operations are organized into Flows – pipelines of tasks that transform data step-by-step. Flows handle dependency resolution, logging, and output management.
# Define tasks
@flowtask.extract(...)
def gather_files(source): ...
@flowtask.transform(...)
def encode_cad(cad_file, ...): ...
# Create flow (automatic dependency resolution)
flow = hoops_ai.create_flow(
name="my_pipeline",
tasks=[gather_files, encode_cad],
auto_dataset_export=True # Automatic merging
)
# Execute (parallel, tracked, logged)
flow_output, summary, flow_file = flow.process(inputs={...})
Modular Separation of Concerns
Each module has a clear, single responsibility:
- CAD Access Module
Load CAD files and provide low-level geometry/topology access. No feature extraction or storage concerns.
- CAD Encoding Module
Extract features from B-Rep structures. No file I/O or storage management.
- Storage Module
Persist data with schema validation. No encoding logic or CAD file handling.
- Flow Module
Orchestrate tasks and manage execution. No knowledge of CAD-specific operations.
- Dataset Module
Merge and query datasets. No encoding or flow orchestration logic.
Documentation Structure
This section is organized into five focused guides covering the complete data flow pipeline:
CAD Data Access
CAD Data Access - Unified interface for loading CAD files and extracting geometric/topological data.
What you’ll learn:
HOOPSLoader singleton for CAD file loading with HOOPS Exchange
Loading 100+ CAD file formats (STEP, IGES, CATIA, SolidWorks, Parasolid, etc.)
HOOPSModel interface for accessing loaded CAD model properties
HOOPSBrep interface for B-Rep geometry and topology queries
Querying faces, edges, vertices, shells, and topological relationships
Extracting geometric properties (areas, lengths, bounding boxes, normals)
Understanding the B-Rep data structure and component hierarchy
Configuring loading options for feature extraction and solid loading
Resource management and lifecycle patterns
When to use this guide:
You need to load CAD files programmatically from multiple formats
You want to extract geometric or topological properties directly
You’re implementing custom feature extraction logic
You need to understand the B-Rep structure and CAD Access architecture
You’re debugging CAD file loading issues
CAD Data Encoding
CAD Data Encoding - Transforming CAD geometry into numeric feature vectors for machine learning.
What you’ll learn:
BrepEncoder push-based architecture for feature extraction
Geometric features (face areas, face attributes, edge lengths, edge attributes)
Surface sampling (UV grids for faces, U grids for edges, face discretization)
Topological features (face adjacency graphs, extended adjacency, face neighbors count, face-pair edge paths)
Shape descriptors (D2 distance histograms, A3 angle histograms between face pairs)
Mathematical formulations for all encoding methods
Integration with DataStorage for persisting extracted features
Complete encoding workflow examples
Performance considerations for large-scale encoding
When to use this guide:
You need to convert CAD models into numerical features for ML training
You’re building custom encoders for specific ML tasks or domains
You want to understand the mathematical formulation of feature extraction
You’re optimizing encoding performance for large datasets
You need to debug feature extraction issues or validate encoded data
Datasets - ML-Ready Inputs
Datasets - ML-Ready Inputs - Defining data organization schemas with SchemaBuilder.
What you’ll learn:
SchemaBuilder declarative API for defining data storage schemas
Understanding schema components (groups, arrays, metadata)
Creating logical groups with primary dimensions
Defining arrays with explicit dimensions and data types
Managing file-level vs categorical metadata
Metadata routing with pattern matching rules
Using predefined schema templates for common use cases
Extending templates with custom groups and arrays
Exporting and loading schemas for version control
Integration with DataStorage for schema-driven validation
When to use this guide:
You need to define custom data organization structures
You want consistent, validated data across multiple CAD files
You’re setting up ML pipelines requiring predictable data shapes
You need to understand how schemas enable dataset merging
You’re debugging dimension mismatches or validation errors
Storage and Persistence
Data Storage - Unified interface for persisting and retrieving data with multiple backends.
What you’ll learn:
DataStorage abstract interface and plugin architecture
OptStorage (Zarr-based) for production use with compression
MemoryStorage for testing and prototyping workflows
JsonStorageHandler for human-readable debugging output
Schema integration for validation and metadata routing
Compression strategies and performance tuning
Dimension naming for xarray/Dask compatibility
File-level vs categorical metadata management
Complete usage examples with CAD encoding workflows
Choosing the right storage implementation for your use case
When to use this guide:
You need to persist encoded CAD features to disk
You’re implementing custom storage backends
You want to understand how data is organized and compressed
You need to optimize storage performance or reduce file sizes
You’re debugging storage issues or metadata routing problems
Data Flow Customisation
Data Flow Customisation - Building and executing modular, parallel CAD processing pipelines.
What you’ll learn:
@flowtaskdecorators for defining processing steps declarativelyTask types (extract, transform, custom) and when to use each
Automatic parallel execution with ProcessPoolExecutor
Flow creation with
hoops_ai.create_flow()and configuration optionsHOOPSLoader lifecycle management per worker process
Comprehensive error handling, logging, and progress tracking
Automatic dataset merging with
auto_dataset_export=TrueWindows-specific requirements for multiprocessing
Complete workflow examples from CAD files to merged datasets
Performance monitoring and optimization strategies
When to use this guide:
You’re building end-to-end CAD data processing pipelines
You need to process thousands of CAD files efficiently in parallel
You want to customize data extraction, transformation, or validation logic
You need to integrate HOOPS AI into existing workflows or systems
You’re debugging parallel execution, error handling, or performance issues
Quick Start
New to Data Flow Management?
Start with Data Flow Customisation to understand the orchestration layer that ties everything together. This will give you context for how the other modules fit into the pipeline.
Building Custom Encoders?
- Start here:
→ CAD Data Access - Learn how to load CAD files and access geometry
- Then:
→ CAD Data Encoding - Learn how to extract features from B-Rep structures
- Finally:
→ Data Storage - Learn how to persist encoded data with schema validation
Working with Existing Datasets?
- Start here:
→ Data Merging in HOOPS AI - Learn how Flow-generated .data files are merged into unified datasets
- Then:
→ See Dataset Exploration and Mining in the ML section for querying and analysis
Building Production Pipelines?
Study all guides in order:
CAD Data Access - Understand CAD file loading
CAD Data Encoding - Understand feature extraction
Datasets - ML-Ready Inputs - Define schemas for data organization
Data Storage - Configure storage backends and persistence
Data Flow Customisation - Build orchestrated pipelines with automatic merging
Data Merging in HOOPS AI - Understand how datasets are consolidated
Additional Resources
- Machine Learning Integration
See Machine Learning Model for guides on training and deploying ML models using the datasets created by this pipeline.
- Dataset Exploration
See Dataset Exploration and Mining for comprehensive tools to query, analyze, and prepare merged datasets for ML training.
- Visualization
See Data Visualization Experience for tools to visualize CAD data and ML predictions throughout the pipeline.
- API Reference
See hoops_ai for complete API documentation of all modules and classes.