Data Merging in HOOPS AI

Overview 

The DatasetMerger is a critical component in HOOPS AI’s data pipeline that consolidates multiple individual encoded CAD files (.data files) into a single, unified, compressed dataset (.dataset file). It performs parallel array concatenation, metadata aggregation, and schema-driven organization to create ML-ready datasets.

Key Purpose: The DatasetMerger transforms a collection of per-file encoded data into a unified, queryable, analysis-ready dataset suitable for machine learning training and exploration.

The system follows an automated pipeline pattern:

Individual .data files → DatasetMerger → Unified .dataset → DatasetExplorer/DatasetLoader → ML Pipeline

Important

This document focuses on the merging process (consolidation of individual files). For information on using the merged datasets (analysis, querying, ML preparation), see:

Dataset Exploration and Mining - Comprehensive guide for dataset exploration and ML training preparation

Architecture & Role 

Position in HOOPS AI Pipeline

The DatasetMerger operates in the ETL (Extract-Transform-Load) phase of the HOOPS AI pipeline:

┌──────────────────────────────────────────────────────────────────────┐
│                     HOOPS AI Data Pipeline                           │
└──────────────────────────────────────────────────────────────────────┘

1. ENCODING PHASE (Per-File)
    ┌─────────────────────────────────────────────────────┐
    │  CAD File → Encoder → DataStorage → .data file      │
    │  (Repeated for each file in parallel)               │
    └─────────────────────────────────────────────────────┘
                                ↓
2. MERGING PHASE (DatasetMerger) ← YOU ARE HERE
    ┌─────────────────────────────────────────────────────┐
    │  Multiple .data files → DatasetMerger → .dataset    │
    │  Multiple .json files → DatasetInfo → .infoset      │
    │                                      → .attribset   │
    └─────────────────────────────────────────────────────┘
                                ↓
3. ANALYSIS/ML PHASE
    ┌─────────────────────────────────────────────────────┐
    │  .dataset → DatasetExplorer → Query/Filter/Analyze  │
    │  .dataset → DatasetLoader → Train/Val/Test Splits   │
    │                           → ML Model Training       │
    └─────────────────────────────────────────────────────┘

Why Merging is Essential:

Unified Access: ML models need a single dataset, not thousands of individual files

Efficient Queries: Merged data enables fast filtering and statistical analysis

Memory Efficiency: Dask-based chunked operations handle datasets larger than RAM

Metadata Correlation: File-level and categorical metadata linked to data arrays

Core Concepts

Data Organization

Individual encoded files have a group-based structure:

part_001.data (ZipStore Zarr)
├── faces/
│   ├── face_indices (array)
│   ├── face_areas (array)
│   ├── face_types (array)
│   └── face_uv_grids (array)
├── edges/
│   ├── edge_indices (array)
│   ├── edge_lengths (array)
│   └── edge_types (array)
├── graph/
│   ├── edges_source (array)
│   ├── edges_destination (array)
│   └── num_nodes (array)
└── metadata.json

After merging, the dataset consolidates all files:

my_flow.dataset (ZipStore Zarr)
├── faces/
│   ├── face_indices (concatenated from all files)
│   ├── face_areas (concatenated from all files)
│   └── file_id (provenance tracking: which face came from which file)
├── edges/
│   └── ... (similarly concatenated)
└── graph/
    └── ... (similarly concatenated)

Group-Based Merging

The DatasetMerger organizes arrays into logical groups for concatenation:

Mathematical Representation:

For a group \(G\) with primary dimension \(d\) (e.g., “face”), given \(N\) files with arrays:

\[A_i^G = \{a_{i,1}, a_{i,2}, \ldots, a_{i,n_i}\}\]

where \(a_{i,j}\) is the \(j\)-th element of array \(A\) from file \(i\), and \(n_i\) is the count of dimension \(d\) in file \(i\).

The merged array is:

\[A_{\text{merged}}^G = A_1^G \oplus A_2^G \oplus \cdots \oplus A_N^G\]

where \(\oplus\) denotes concatenation along dimension \(d\):

\[A_{\text{merged}}^G = \{a_{1,1}, \ldots, a_{1,n_1}, a_{2,1}, \ldots, a_{2,n_2}, \ldots, a_{N,1}, \ldots, a_{N,n_N}\}\]

Example:

File 1: face_areas = [1.5, 2.3, 4.1]  (3 faces)
File 2: face_areas = [3.7, 5.2]        (2 faces)
File 3: face_areas = [2.1, 1.8, 6.3]  (3 faces)

Merged: face_areas = [1.5, 2.3, 4.1, 3.7, 5.2, 2.1, 1.8, 6.3]  (8 faces total)

Provenance Tracking:

A file_id array is added to track origin:

file_id = [0, 0, 0, 1, 1, 2, 2, 2]
         └─File 1─┘ └File 2┘ └─File 3─┘

Schema-Driven vs Heuristic Discovery

The DatasetMerger supports two modes for discovering data structure:

1. Schema-Driven Discovery (Preferred)

When a schema is available (from SchemaBuilder.set_schema()), the merger uses explicit group definitions:

schema = {
    "groups": {
        "faces": {
            "primary_dimension": "face",
            "arrays": {
                "face_indices": {"dims": ["face"], "dtype": "int32"},
                "face_areas": {"dims": ["face"], "dtype": "float32"}
            }
        },
        "edges": {
            "primary_dimension": "edge",
            "arrays": {
                "edge_lengths": {"dims": ["edge"], "dtype": "float32"}
            }
        }
    }
}

merger.set_schema(schema)  # Use explicit structure

Benefits:

Predictable: Structure is explicit and documented

Validated: Arrays are checked against schema dimensions

Extensible: New groups/arrays follow defined patterns

2. Heuristic Discovery (Fallback)

Without a schema, the merger scans files and uses naming patterns to infer structure:

# Heuristic rules:
# - Arrays with "face" in name → "faces" group
# - Arrays with "edge" in name → "edges" group
# - Arrays with "graph" in name → "graph" group
# - Arrays in "graph/edges/" subgroup → flattened to "graph/edges_source"

discovered_groups = merger.discover_groups_from_files(max_files_to_scan=5)

Pattern Matching:

if "face" in array_name.lower():
    group = "faces"
elif "edge" in array_name.lower():
    group = "edges"
elif "duration" in array_name.lower() or "size" in array_name.lower():
    group = "file"  # File-level metadata

Using the DatasetMerger Class 

The DatasetMerger class provides a high-level interface for consolidating individual encoded CAD files into unified datasets. This section demonstrates how to initialize the merger, discover data structures, and execute the merging process with various configurations.

Initialization

Creating a DatasetMerger instance requires specifying the source files to merge, the output location, and configuration parameters for parallel processing. The merger uses Dask for distributed computation, enabling efficient processing of large datasets that may not fit in memory.

from hoops_ai.storage import DatasetMerger

# Create merger instance with configuration
merger = DatasetMerger(
    zip_files=["path/to/part_001.data", "path/to/part_002.data", ...],
    merged_store_path="merged_dataset.dataset",
    file_id_codes={"part_001": 0, "part_002": 1, ...},  # From DatasetInfo
    dask_client_params={'n_workers': 4, 'threads_per_worker': 4},
    delete_source_files=True  # Clean up after merging
)

The initialization parameters control both the data sources and the merge behavior:

Parameters:

zip_files (List[str]): List of paths to individual .data files to merge. These are typically the output from parallel CAD encoding tasks.

merged_store_path (str): Path where the merged .dataset file will be created. This becomes the single unified dataset used by DatasetExplorer and DatasetLoader.

file_id_codes (Dict[str, int]): Mapping from file stem (filename without extension) to integer ID. This dictionary provides provenance tracking, allowing you to identify which source file contributed each piece of data in the merged dataset. Typically provided by DatasetInfo.

dask_client_params (Dict): Configuration for Dask’s parallel processing engine. Controls the number of worker processes (n_workers) and threads per worker (threads_per_worker). Adjust based on available CPU cores and memory.

delete_source_files (bool): If True, automatically deletes individual .data files after successful merging. This saves disk space but should only be enabled if the source files are no longer needed.

The file_id_codes dictionary is crucial for maintaining data provenance. During merging, a file_id array is automatically added to each group, tracking which source file each element came from. This enables filtering and analysis by source file using DatasetExplorer.

Group Discovery

Before merging data, the DatasetMerger must understand the structure of the encoded files. This discovery process can be automatic (scanning files and inferring structure) or explicit (using a predefined schema).

Automatic Discovery

The merger can automatically discover data groups and arrays by scanning a sample of input files and analyzing their structure:

# Discover groups by scanning input files
discovered_groups = merger.discover_groups_from_files(max_files_to_scan=5)

# Print the discovered structure for verification
merger.print_discovered_structure()

Example output:

Discovered Data Structure:
==================================================
Group: faces
  Arrays: ['face_indices', 'face_areas', 'face_types', 'face_uv_grids']
Group: edges
  Arrays: ['edge_indices', 'edge_lengths', 'edge_types']
Group: graph
  Arrays: ['edges_source', 'edges_destination', 'num_nodes']

The max_files_to_scan parameter controls how many files are examined during discovery. Scanning more files ensures a complete picture of the data structure, but increases discovery time. For large datasets with consistent structure, scanning 5-10 files is typically sufficient.

Automatic discovery uses heuristic rules to group arrays (e.g., arrays with “face” in the name go to the “faces” group). While convenient, this approach may misclassify arrays with unconventional naming or fail to capture semantic relationships between arrays.

Manual Schema Setting

For production pipelines, explicitly setting a schema is recommended. This provides precise control over data organization and validates that all encoded files conform to the expected structure:

from hoops_ai.storage.datasetstorage.schema_builder import SchemaBuilder

# Build schema definition
builder = SchemaBuilder(domain="CAD_analysis", version="1.0")

# Define faces group
faces_group = builder.create_group("faces", "face", "Face-level geometric data")
faces_group.create_array("face_areas", ["face"], "float32", "Surface area per face")
faces_group.create_array("face_types", ["face"], "int32", "Face type codes")

# Define edges group
edges_group = builder.create_group("edges", "edge", "Edge-level data")
edges_group.create_array("edge_lengths", ["edge"], "float32", "Edge lengths")

# Build and apply schema
schema = builder.build()
merger.set_schema(schema)

Using a schema provides several benefits:

Validation: The merger verifies that all arrays in the source files match the schema definitions, catching encoding errors early

Documentation: Schema serves as explicit documentation of the dataset structure

Consistency: Guarantees that all source files have the same structure, preventing merge failures

Type Safety: Ensures data types match expectations, avoiding silent type coercion issues

The schema-based approach is particularly important when merging files from different encoding runs or when integrating data from multiple sources.

Merging Execution

Once the DatasetMerger is initialized and the data structure is discovered or defined, you can execute the merge operation. The merger supports two execution modes: single-pass merging for smaller datasets and batch merging for large datasets that exceed available memory.

Single-Pass Merge

For datasets where all source files can be loaded into memory simultaneously, single-pass merging provides the fastest execution:

merger.merge_data(
    face_chunk=500_000,           # Chunk size for face arrays
    edge_chunk=500_000,           # Chunk size for edge arrays
    faceface_flat_chunk=100_000,  # Chunk size for face-face relationships
    batch_size=None,              # None = merge all files at once
    consolidate_metadata=True,    # Consolidate Zarr metadata after merge
    force_compression=True        # Output as compressed .dataset file
)

The chunk size parameters control how Dask partitions arrays during processing. Larger chunks reduce overhead but increase memory usage. The optimal chunk size depends on your dataset characteristics:

face_chunk/edge_chunk: Set based on the typical number of faces/edges per file. For CAD models with 1000-10000 faces, 500,000 is a reasonable default.

faceface_flat_chunk: For face adjacency matrices, use smaller chunks (100,000) to manage the quadratic memory growth.

Process Overview:

The single-pass merge follows these steps:

Load: All .data files are loaded into xarray Datasets, organized by group (faces, edges, graph, etc.)

Concatenate: Arrays within each group are concatenated along their primary dimension (face, edge, etc.)

Provenance: A file_id array is added to each group, tracking which source file contributed each element

Write: The merged data is written to the .dataset file with optional compression

This approach is memory-intensive but fast, making it ideal for datasets with hundreds or thousands of files where the total data size fits comfortably in available RAM.

Batch Merge

For large datasets that exceed available memory, batch merging processes files in smaller groups and progressively builds the final dataset:

merger.merge_data(
    face_chunk=500_000,
    edge_chunk=500_000,
    faceface_flat_chunk=100_000,
    batch_size=200,  # Process 200 files at a time
    consolidate_metadata=True,
    force_compression=True
)

Setting batch_size to a positive integer activates batch mode. The merger processes files in batches, creating partial .dataset files, then merges these partial datasets into the final result.

Process Overview:

Batch merging follows a two-stage approach:

Partial Merges: The source files are divided into batches of batch_size files. Each batch is merged into a partial .dataset file. If delete_source_files=True, source files are deleted after each batch is successfully merged, progressively freeing disk space.

Final Merge: All partial .dataset files are merged into the final unified .dataset file. The partial datasets are then deleted (if delete_source_files=True).

Mathematical Formulation:

For \(N\) total files with batch size \(B\), the number of batches is:

\[\text{Number of batches} = \lceil N / B \rceil\]

Batch \(i\) contains files:

\[\text{Batch}_i = \{f_{iB+1}, f_{iB+2}, \ldots, f_{\min((i+1)B, N)}\}\]

Each batch is merged into a partial dataset \(D_{\text{batch}_i}\). The final merge concatenates these partial results:

\[D_{\text{final}} = D_{\text{batch}_1} \oplus D_{\text{batch}_2} \oplus \cdots \oplus D_{\text{batch}_k}\]

where \(\oplus\) denotes the concatenation operation along the primary dimension of each group.

Benefits of Batch Merging:

Memory Efficiency: Only one batch needs to be in memory at a time, enabling processing of arbitrarily large datasets

Progress Tracking: Partial merges provide natural checkpoints, making it easier to resume if the process is interrupted

Disk Space Management: Source files can be deleted incrementally, reducing peak disk space requirements

Choosing Batch Size:

The optimal batch size balances memory usage against merge overhead:

Too Small (e.g., 10-50 files): Many partial merges increase overhead and total processing time

Too Large (e.g., 1000+ files): May exceed available memory, causing slow disk swapping

Recommended: 100-300 files per batch for typical CAD datasets with 1000-100,000 faces per file

Monitor memory usage during the first batch and adjust batch_size if you see excessive swapping or out-of-memory errors.

Special Processing: Matrix Flattening

Some encoded data includes face-to-face relationship matrices, such as distance distributions or interaction features. These matrices have dimensions \([\text{face} \times \text{face} \times \text{bins}]\), where the number of faces varies per file. The DatasetMerger applies special flattening logic to enable efficient concatenation.

Per-File Structure:

Each file contains a matrix representing relationships between all pairs of faces:

File 1: a3_distance [3 faces × 3 faces × 64 bins]
File 2: a3_distance [2 faces × 2 faces × 64 bins]

Direct concatenation is impossible because the first two dimensions differ between files.

Flattening Process:

For each file \(i\) with \(n_i\) faces and relationship matrix \(A_i \in \mathbb{R}^{n_i \times n_i \times B}\), where \(B\) is the number of bins, the flattening operation reshapes the matrix:

\[\text{Flattened}_i = \text{reshape}(A_i, [n_i \times n_i, B])\]

This converts the 3D matrix into a 2D array where each row represents one face-pair relationship and columns represent bins.

Merged Structure:

After flattening all files, the merger concatenates along the face-pair dimension:

Merged: a3_distance [13 face-pairs × 64 bins]
                     └─ (3×3=9) + (2×2=4) = 13 pairs

The total number of rows is:

\[\text{Total face-pairs} = \sum_{i=1}^{N} (n_i \times n_i)\]

Provenance Tracking:

A file_id_faceface array tracks which file each face-pair came from:

file_id_faceface = [0,0,0,0,0,0,0,0,0, 1,1,1,1]
                    └────File 1 (9)────┘ └File 2 (4)┘

This enables filtering face-pair relationships by source file during analysis.

Benefits of Flattening:

Concatenation: Flattened matrices have consistent dimensions, enabling straightforward concatenation

Storage Efficiency: Avoids sparse 3D storage where many file-pair combinations would be empty

Query Performance: 2D structure supports efficient slicing and filtering operations

The flattening process is automatic and transparent—you don’t need to modify encoded data. The merger detects face-face relationship arrays by naming convention and applies flattening automatically.

Using the DatasetInfo Class 

The DatasetInfo class complements the DatasetMerger by handling metadata extraction, routing, and storage. While DatasetMerger consolidates numerical array data into the .dataset file, DatasetInfo processes JSON metadata files and organizes them into structured Parquet files for efficient querying and analysis.

Initialization

Creating a DatasetInfo instance requires specifying the source metadata files, output locations for different metadata types, and an optional schema that defines routing rules:

from hoops_ai.storage import DatasetInfo

ds_info = DatasetInfo(
    info_files=["path/to/part_001.json", "path/to/part_002.json", ...],
    merged_store_path="merged_dataset.infoset",      # Output for file-level metadata
    attribute_file_path="merged_dataset.attribset",  # Output for categorical metadata
    schema=schema_dict  # Optional: schema for metadata routing
)

The initialization parameters control metadata sources and destinations:

Parameters:

info_files (List[str]): List of paths to JSON metadata files, typically one per encoded CAD file. These JSON files contain metadata captured during the encoding process, such as file sizes, processing times, labels, and custom attributes.

merged_store_path (str): Path where the .infoset Parquet file will be created. This file contains file-level metadata with one row per source file, enabling efficient queries by file characteristics.

attribute_file_path (str): Path where the .attribset Parquet file will be created. This file contains categorical metadata and label descriptions, organized in a normalized table structure.

schema (Dict): Optional schema dictionary defining metadata routing rules. When provided, the schema determines which metadata fields belong in the .infoset file versus the .attribset file. Without a schema, DatasetInfo uses type-based heuristics for routing.

The separation of metadata into two files serves different purposes: .infoset provides per-file information for filtering and stratification, while .attribset stores shared categorical definitions for labels and enumerations.

Metadata Processing

After initialization, the DatasetInfo processes metadata files through a series of steps that parse, route, and store the information in optimized formats.

Parse and Route Metadata

The core metadata processing workflow involves parsing JSON files, building provenance mappings, and storing results in Parquet format:

# Parse all JSON metadata files
ds_info.parse_info_files()

# Build file ID mappings for provenance
file_id_codes = ds_info.build_code_mappings(zip_files)
# Returns: {"part_001": 0, "part_002": 1, ...}

# Store metadata to Parquet files
ds_info.store_info_to_parquet(table_name="file_info")

Each method in this workflow serves a specific purpose:

parse_info_files() Process:

The parse_info_files() method orchestrates the complete metadata extraction and routing pipeline:

Load JSON files: Reads all metadata files specified during initialization

Route metadata: Classifies each field as file-level or categorical based on schema rules or type heuristics

Extract descriptions: Merges nested description structures (e.g., label definitions, enum values)

Validate: Checks metadata against schema requirements if a schema was provided

Aggregate: Combines metadata from all files into unified data structures

Store: Writes file-level metadata to .infoset and categorical metadata to .attribset

Cleanup: Optionally deletes processed JSON files to save disk space

build_code_mappings() Purpose:

The build_code_mappings() method creates the integer ID mapping that DatasetMerger uses for provenance tracking. It extracts file stems (filenames without extensions) from the zip_files list and assigns sequential integer IDs. These IDs appear in the file_id arrays within the merged dataset, enabling you to trace each data element back to its source file.

store_info_to_parquet() Execution:

The store_info_to_parquet() method writes the parsed and routed metadata to Parquet files. The table_name parameter identifies the table within the .infoset file, allowing multiple metadata tables to coexist if needed.

Schema-Driven Routing

When a schema is provided during initialization, DatasetInfo uses sophisticated routing rules to determine where each metadata field should be stored. This schema-driven approach provides precise control over metadata organization and enables validation of metadata structure.

Schema Definition

A complete schema defines metadata categories, field specifications, and routing patterns:

# Schema defines routing rules
schema = {
    "metadata": {
        "file_level": {
            "size_cadfile": {"dtype": "int64", "required": False},
            "processing_time": {"dtype": "float32", "required": False}
        },
        "categorical": {
            "file_label": {"dtype": "int32", "values": [0, 1, 2, 3, 4]}
        },
        "routing_rules": {
            "file_level_patterns": ["size_*", "duration_*", "processing_*"],
            "categorical_patterns": ["*_label", "category", "type"],
            "default_numeric": "file_level",
            "default_categorical": "categorical"
        }
    }
}

ds_info = DatasetInfo(..., schema=schema)

The schema structure includes:

file_level: Dictionary of metadata fields that should appear in the .infoset file. Each field specifies a data type and whether it’s required.

categorical: Dictionary of categorical metadata fields for the .attribset file. Fields can specify allowed values for validation.

routing_rules: Patterns and defaults that handle fields not explicitly defined in file_level or categorical sections:

file_level_patterns: Wildcard patterns (using *) that match field names for file-level routing

categorical_patterns: Wildcard patterns for categorical routing

default_numeric: Where to route numeric fields not matched by other rules

default_categorical: Where to route string/boolean fields not matched by other rules

Routing Logic

DatasetInfo applies a hierarchical decision process to determine the destination for each metadata field:

Explicit Check: Is the field name explicitly defined in file_level or categorical sections?

Pattern Match: Does the field name match any wildcard patterns in routing_rules?

Type-Based Default: Based on the field’s data type, apply the default routing (numeric → file_level, string/boolean → categorical)

Routing Examples:

Consider these metadata fields being processed:

# Field: "size_cadfile" with value 1024000
# 1. Explicit check: Found in file_level definitions ✓
# 2. Routing decision: → .infoset file

# Field: "file_label" with value 3
# 1. Explicit check: Found in categorical definitions ✓
# 2. Routing decision: → .attribset file

# Field: "custom_metric" with value 42.5 (not in schema)
# 1. Explicit check: Not found
# 2. Pattern match: No match to any routing patterns
# 3. Type-based default: Numeric value → apply default_numeric rule
# 4. Routing decision: → .infoset file

This hierarchical approach provides flexibility: you can explicitly control routing for critical fields while relying on patterns and defaults for auxiliary metadata.

Understanding Output Files 

The DatasetMerger and DatasetInfo produce three primary output files that together comprise a complete, queryable dataset. Understanding the structure and purpose of each file enables effective use of the merged data in analysis and machine learning workflows.

.dataset Files

The .dataset file contains all numerical array data from the merged CAD files, organized into logical groups and compressed for efficient storage.

Format: Compressed Zarr (ZipStore) with .dataset extension

Contents: All numerical array data organized by groups (faces, edges, graph, etc.)

Structure:

my_flow.dataset (ZipStore)
├── faces/
│   ├── face_indices [N_faces]
│   ├── face_areas [N_faces]
│   ├── face_types [N_faces]
│   ├── face_uv_grids [N_faces × U × V × 7]
│   └── file_id [N_faces]  ← Provenance tracking
├── edges/
│   ├── edge_indices [N_edges]
│   ├── edge_lengths [N_edges]
│   ├── edge_types [N_edges]
│   └── file_id [N_edges]  ← Provenance tracking
├── graph/
│   ├── edges_source [N_graph_edges]
│   ├── edges_destination [N_graph_edges]
│   ├── num_nodes [N_files]
│   └── file_id [N_graph_edges]  ← Provenance tracking
└── faceface/
    ├── a3_distance [N_face_pairs × bins]
    ├── d2_distance [N_face_pairs × bins]
    ├── extended_adjacency [N_face_pairs]
    └── file_id [N_face_pairs]  ← Provenance tracking

Each group contains arrays concatenated from all source files. The dimensions (N_faces, N_edges, etc.) represent the total count across all merged files. The file_id arrays enable tracing each element back to its source file.

Key Features:

Compression: ZipStore format provides transparent compression, reducing disk space while maintaining fast access

Chunking: Large arrays are divided into chunks for efficient partial loading

Lazy Loading: Data is loaded on-demand rather than reading the entire file into memory

Group Organization: Logical separation enables loading only relevant data for specific analyses

Access Pattern:

Use the DatasetExplorer or DatasetLoader classes to query and analyze data from .dataset files. These classes provide high-level interfaces for filtering, statistical analysis, and ML dataset preparation without requiring direct Zarr manipulation.

.infoset Files

The .infoset file contains file-level metadata organized in a tabular structure with one row per source CAD file. This format enables efficient filtering and querying based on file characteristics.

Format: Parquet (columnar storage format)

Contents: File-level metadata (one row per file)

Example Schema:

┌────┬──────────────┬─────────────┬───────────────┬──────────────────┬────────┐
│ id │ name         │ description │ size_cadfile  │ processing_time  │ subset │
├────┼──────────────┼─────────────┼───────────────┼──────────────────┼────────┤
│ 0  │ part_001     │ Bracket     │ 1024000       │ 12.5             │ train  │
│ 1  │ part_002     │ Housing     │ 2048000       │ 18.3             │ train  │
│ 2  │ part_003     │ Cover       │ 512000        │ 8.1              │ test   │
└────┴──────────────┴─────────────┴───────────────┴──────────────────┴────────┘

The id column corresponds to the file IDs used in file_id arrays within the .dataset file, enabling cross-referencing between metadata and array data.

Standard Columns:

id: Integer file ID (matches file_id in dataset arrays)

name: File stem (filename without extension)

description: Human-readable description of the file or part

size_cadfile: Original CAD file size in bytes

processing_time: Encoding duration in seconds

subset: Dataset split assignment (train, validation, test)

Additional Columns (schema-dependent):

flow_name: Name of the flow that processed this file

stream_cache_png: Path to PNG visualization

stream_cache_3d: Path to 3D model (SCS/STL format)

Custom fields: Any additional metadata defined in your schema

.attribset Files

The .attribset file contains categorical metadata and label descriptions in a normalized table structure. This file provides human-readable names and descriptions for categorical values used throughout the dataset.

Format: Parquet (columnar storage format)

Contents: Categorical metadata and label descriptions

Example Schema:

┌──────────────┬────┬──────────────┬─────────────────────┐
│ table_name   │ id │ name         │ description         │
├──────────────┼────┼──────────────┼─────────────────────┤
│ file_label   │ 0  │ Simple       │ Basic geometry      │
│ file_label   │ 1  │ Medium       │ Moderate complexity │
│ file_label   │ 2  │ Complex      │ High complexity     │
│ face_types   │ 0  │ Plane        │ Planar surface      │
│ face_types   │ 1  │ Cylinder     │ Cylindrical surface │
│ face_types   │ 2  │ Sphere       │ Spherical surface   │
│ edge_types   │ 0  │ Line         │ Linear edge         │
│ edge_types   │ 1  │ Arc          │ Circular arc        │
└──────────────┴────┴──────────────┴─────────────────────┘

The table uses a normalized structure where each row defines one categorical value. The table_name column groups related definitions together.

Column Definitions:

table_name: Name of the categorical attribute (e.g., “file_label”, “face_types”)

id: Integer code used in the dataset arrays

name: Short, human-readable name for this category

description: Detailed description of what this category represents

Purpose:

The .attribset file enables interpretation of integer codes in the dataset. For example, when you see face_types = 1 in the dataset arrays, you can look up the .attribset file to find that code 1 means “Cylinder” with description “Cylindrical surface.”

Usage:

DatasetExplorer automatically loads .attribset data and uses it to provide human-readable labels in visualizations and summaries. When creating distribution plots or categorical analyses, the descriptions from this file appear in legends and axis labels.

Integration with DatasetExplorer and DatasetLoader 

The merged outputs (.dataset, .infoset, .attribset) serve as input for downstream analysis and machine learning workflows. The DatasetExplorer class provides tools for querying and analyzing merged datasets, while DatasetLoader prepares data for ML training with stratified splitting and framework integration.

Quick Start

The following example demonstrates basic usage of both DatasetExplorer and DatasetLoader with merged datasets:

from hoops_ai.dataset import DatasetExplorer, DatasetLoader

# Analysis with DatasetExplorer
explorer = DatasetExplorer(flow_output_file="cad_pipeline.flow")
explorer.print_table_of_contents()

# Query and analyze
face_dist = explorer.create_distribution(key="face_areas", group="faces", bins=20)
file_codes = explorer.get_file_list(
    group="faces",
    where=lambda ds: ds['complexity_level'] >= 4
)

explorer.close()

# ML Preparation with DatasetLoader
loader = DatasetLoader(
    merged_store_path="cad_pipeline.dataset",
    parquet_file_path="cad_pipeline.infoset"
)

# Stratified split
loader.split(key="complexity_level", train=0.7, validation=0.15, test=0.15)

# Get datasets
train_dataset = loader.get_dataset("train")
val_dataset = loader.get_dataset("validation")

# PyTorch integration
train_torch = train_dataset.to_torch()

This workflow illustrates the typical progression: explore and validate the merged dataset using DatasetExplorer, then prepare training data using DatasetLoader.

For Comprehensive Documentation:

For detailed information on using these tools, see:

Dataset Exploration and Mining - Complete guide covering query operations, distribution analysis, metadata filtering, stratified splitting for ML training, PyTorch integration, and complete workflow examples with performance optimization techniques

The exploration and loading documentation provides in-depth coverage of all available operations, including advanced filtering, custom item loaders, and framework-specific adapters.

Custom Merging Tasks

While the Flow module’s automatic merging (via auto_dataset_export=True) handles most use cases, you may need custom merging behavior for specialized requirements. Creating a custom merging task using the @flowtask.custom decorator provides complete control over merge parameters and processing logic.

Advanced: Custom Merging Task

The following example demonstrates a custom merging task that specifies non-standard chunk sizes, batch configurations, and cleanup behavior:

from hoops_ai.flowmanager import flowtask
from typing import List

@flowtask.custom(
    name="custom_dataset_merge",
    inputs=["cad_files_encoded"],
    outputs=["merged_dataset_path"],
    parallel_execution=False
)
def custom_merge_task(encoded_files: List[str]) -> str:
    """Custom merging logic with specific parameters"""
    from hoops_ai.storage import DatasetMerger, DatasetInfo
    import pathlib

    # Find all .data and .json files
    data_files = [f"{f}.data" for f in encoded_files]
    json_files = [f"{f}.json" for f in encoded_files]

    output_dir = pathlib.Path("./custom_merged")
    output_dir.mkdir(exist_ok=True)

    # Process metadata
    ds_info = DatasetInfo(
        info_files=json_files,
        merged_store_path=str(output_dir / "custom.infoset"),
        attribute_file_path=str(output_dir / "custom.attribset"),
        schema=cad_schema
    )
    ds_info.parse_info_files()
    file_id_codes = ds_info.build_code_mappings(data_files)
    ds_info.store_info_to_parquet()

    # Merge data arrays with custom configuration
    merger = DatasetMerger(
        zip_files=data_files,
        merged_store_path=str(output_dir / "custom.dataset"),
        file_id_codes=file_id_codes,
        dask_client_params={'n_workers': 16, 'threads_per_worker': 2},
        delete_source_files=False
    )
    merger.set_schema(cad_schema)
    merger.merge_data(
        face_chunk=1_000_000,  # Custom chunk sizes
        edge_chunk=1_000_000,
        batch_size=500  # Process in larger batches
    )

    return str(output_dir / "custom.dataset")

# Use in flow with custom merging
import hoops_ai

custom_flow = hoops_ai.create_flow(
    name="custom_pipeline",
    tasks=[gather_cad_files, encode_cad_geometry, custom_merge_task],
    auto_dataset_export=False  # Disable automatic merging
)

This approach is useful when you need to:

Use non-standard chunk sizes optimized for your specific data characteristics

Preserve source files (delete_source_files=False) for archival or debugging

Apply custom Dask configurations for specific hardware environments

Integrate additional processing steps between encoding and merging

The custom task has access to the full DatasetMerger and DatasetInfo APIs, enabling any merging configuration supported by these classes.

Performance Considerations 

Optimizing merge performance requires understanding memory constraints, parallelization strategies, and compression tradeoffs. This section provides guidelines for tuning DatasetMerger for different dataset sizes and hardware configurations.

Memory Management

Large datasets can exhaust available memory during merging. Batch merging provides a solution by processing files in manageable groups and progressively building the final dataset.

Batch Merging for Large Datasets

For datasets with thousands of files or very large individual files, batch merging prevents out-of-memory errors:

# For 10,000+ files, use batch merging
merger.merge_data(
    batch_size=500,  # Process 500 files at a time
    face_chunk=500_000,
    edge_chunk=500_000
)

Batch merging offers several benefits:

Benefits:

Prevents out-of-memory errors

Enables progress tracking

Allows incremental cleanup of source files

The optimal batch_size depends on your system’s available memory and the size of individual files. Monitor memory usage during the first batch and adjust accordingly.

Parallel Processing

DatasetMerger uses Dask for parallel processing, distributing merge operations across multiple workers. Proper Dask configuration significantly impacts merge performance.

Dask Configuration

The dask_client_params parameter controls parallel execution behavior:

dask_client_params = {
    'n_workers': 8,              # Number of parallel workers
    'threads_per_worker': 4,     # Threads per worker
    'processes': True,           # Use separate processes (not threads)
    'memory_limit': '8GB',       # Per-worker memory limit
    'dashboard_address': ':8787' # Monitoring dashboard URL
}

merger = DatasetMerger(
    zip_files=files,
    merged_store_path="output.dataset",
    dask_client_params=dask_client_params
)

Each parameter affects performance differently:

n_workers: Total number of parallel processes. Set based on CPU cores available. More workers enable greater parallelism but increase memory requirements.

threads_per_worker: Number of threads within each worker process. Increase for CPU-intensive operations, decrease for I/O-intensive tasks.

processes: When True, workers run as separate processes with isolated memory. This is recommended for memory-intensive operations to prevent interference.

memory_limit: Maximum memory per worker. Set conservatively to prevent system-wide memory exhaustion. If workers exceed this limit, Dask spills data to disk.

dashboard_address: Enables the Dask dashboard for real-time monitoring of task execution, memory usage, and worker status.

Tuning Guidelines

Different workload characteristics require different Dask configurations:

Many Small Files (I/O bound):

More workers, fewer threads per worker

Example: n_workers=16, threads_per_worker=2

Rationale: File I/O is the bottleneck; maximize concurrent file reads

Few Large Files (CPU bound):

Fewer workers, more threads per worker

Example: n_workers=4, threads_per_worker=8

Rationale: Computation dominates; leverage multi-threading for array operations

Memory Constrained Systems:

Reduce memory_limit per worker

Decrease batch_size to process fewer files at once

Example: n_workers=4, memory_limit='4GB', batch_size=100

Rationale: Prevent memory swapping which severely degrades performance

Optimal Configuration Discovery:

Start with conservative settings (few workers, moderate memory limits) and monitor the Dask dashboard during a test merge. Increase parallelism if CPU utilization is low, or decrease it if you observe memory pressure or disk swapping.

Compression

The DatasetMerger uses Zarr’s compression capabilities to reduce disk space requirements for merged datasets. Understanding the compression settings helps predict storage needs and access performance.

Zarr Compression Settings

DatasetMerger applies the following compression configuration automatically:

Codec: Zstd level 12 (high compression ratio with reasonable decompression speed)

Chunks: Automatic sizing based on array dimensions (typically ~1M elements per chunk)

Filters: Delta encoding for integer arrays (stores differences rather than absolute values, improving compression)

These settings balance compression ratio against decompression speed for typical CAD data patterns.

Typical Compression Ratios

Compression effectiveness varies by data type:

Raw array data (face areas, edge lengths): 10-20x compression

Floating-point arrays with spatial locality compress very well

Face-face histograms (distance distributions): 5-10x compression

Histogram bins often contain many zeros, enabling good compression

Graph structures (edge indices, adjacency): 3-5x compression

Integer sequences with predictable patterns compress moderately well

These ratios mean a dataset that would occupy 100 GB uncompressed typically requires 5-10 GB as a compressed .dataset file.

Summary 

The DatasetMerger serves as the critical bridge between individual encoded CAD files and ML-ready unified datasets, providing the consolidation layer that enables efficient analysis and training.

Key Capabilities

Automatic Integration: Seamlessly invoked by the Flow module with auto_dataset_export=True, requiring no manual setup for standard workflows

Schema-Driven Organization: Uses SchemaBuilder definitions for predictable, validated merging with explicit control over data structure

Parallel Processing: Leverages Dask’s distributed computation engine for efficient large-scale data consolidation across multiple CPU cores

Provenance Tracking: Maintains file_id arrays in every group, enabling traceability from merged data back to source files

Multiple Output Formats: Produces three complementary files serving different purposes:

.dataset: Numerical array data in compressed Zarr format

.infoset: File-level metadata in queryable Parquet format

.attribset: Categorical metadata and label descriptions in Parquet format

Downstream Integration

The merged output files serve as inputs for analysis and machine learning tools:

DatasetExplorer: Query, filter, and analyze merged datasets with support for statistical analysis, distribution creation, and metadata-based filtering

DatasetLoader: Prepare stratified train/validation/test splits for ML training with support for PyTorch, TensorFlow, and custom frameworks

Complete System Architecture

The following diagram illustrates how DatasetMerger fits within the complete HOOPS AI data pipeline:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         HOOPS AI Data Pipeline                              │
└─────────────────────────────────────────────────────────────────────────────┘

STEP 1: SCHEMA DEFINITION (Optional but Recommended)
┌──────────────────────────────────────────────────────────────┐
│  SchemaBuilder                                               │
│  • Define groups (faces, edges, graph)                       │
│  • Define arrays and dimensions                              │
│  • Set metadata routing rules                                │
│  • Build schema dictionary                                   │
└────────────────────────┬─────────────────────────────────────┘
                         │ schema.json
                         ↓
STEP 2: ENCODING (Per-File Parallel Processing)
┌──────────────────────────────────────────────────────────────┐
│  Flow → CADEncodingTask (via ParallelTask)                   │
│                                                              │
│  CAD File 1 → Encoder → DataStorage → part_001.data ────┐    │
│  CAD File 2 → Encoder → DataStorage → part_002.data ────┤    │
│      ...                                                │    │
│  CAD File N → Encoder → DataStorage → part_N.data ──────┤    │
│                                                         │    │
│  Each .data file (Zarr format) contains:                │    │
│    • faces/face_areas, face_types, face_uv_grids        │    │
│    • edges/edge_lengths, edge_types                     │    │
│    • graph/edges_source, edges_destination              │    │
│                                                         │    │
│  Metadata files (JSON):                                 │    │
│    part_001.json, part_002.json, ..., part_N.json       │    │
└──────────────────────────────────────────────────────────┼───┘
                                                           │
                         ┌─────────────────────────────────┘
                         │ All .data and .json files
                         ↓
STEP 3: AUTOMATIC MERGING (AutoDatasetExportTask)
┌──────────────────────────────────────────────────────────────┐
│  DatasetInfo                                                 │
│  • Load all .json metadata files                             │
│  • Route metadata using schema rules                         │
│  • Create file_id mappings                                   │
│  • Output: {flow_name}.infoset (file-level metadata)         │
│            {flow_name}.attribset (categorical metadata)      │
└────────────────────────┬─────────────────────────────────────┘
                         │
┌────────────────────────┴─────────────────────────────────────┐
│  DatasetMerger                                               │
│  • Discover groups from schemas or heuristics                │
│  • Load all .data files as xarray Datasets                   │
│  • Concatenate arrays by group (faces, edges, graph)         │
│  • Add file_id provenance tracking                           │
│  • Handle special processing (matrix flattening)             │
│  • Output: {flow_name}.dataset (compressed Zarr)             │
└────────────────────────┬─────────────────────────────────────┘
                         │
                         │ Three output files:
                         │  • {flow_name}.dataset
                         │  • {flow_name}.infoset
                         │  • {flow_name}.attribset
                         ↓
┌─────────────────────────────────────────────────────────────┐
│                    OUTPUT FILES                             │
│                                                             │
│  ┌────────────────────────────────────────────┐             │
│  │ {flow_name}.dataset (Zarr/ZipStore)        │             │
│  │ ─────────────────────────────────────      │             │
│  │ • Numerical array data organized by groups │             │
│  │ • faces/, edges/, graph/, faceface/        │             │
│  │ • Each group has file_id for provenance    │             │
│  │ • Compressed for efficient storage         │             │
│  └────────────────────────────────────────────┘             │
│                                                             │
│  ┌────────────────────────────────────────────┐             │
│  │ {flow_name}.infoset (Parquet)              │             │
│  │ ─────────────────────────────────          │             │
│  │ • File-level metadata (one row per file)   │             │
│  │ • Columns: id, name, description,          │             │
│  │   size_cadfile, processing_time, etc.      │             │
│  │ • Queryable with pandas                    │             │
│  └────────────────────────────────────────────┘             │
│                                                             │
│  ┌────────────────────────────────────────────┐             │
│  │ {flow_name}.attribset (Parquet)            │             │
│  │ ─────────────────────────────────────      │             │
│  │ • Categorical metadata & descriptions      │             │
│  │ • Label mappings (id → name → description) │             │
│  │ • Face types, edge types, etc.             │             │
│  └────────────────────────────────────────────┘             │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      │ Consumed by analysis & ML tools
                      ↓
STEP 4: ANALYSIS & ML TRAINING
┌──────────────────────────────────────────────────────────────┐
│  DatasetExplorer                                             │
│  • Query merged datasets (get_array_data)                    │
│  • Statistical analysis (compute_statistics)                 │
│  • Distribution creation (create_distribution)               │
│  • Metadata filtering (filter_files_by_metadata)             │
│  • Visualization support                                     │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│  DatasetLoader                                               │
│  • Load merged datasets for ML training                      │
│  • Stratified train/val/test splitting                       │
│  • PyTorch/TensorFlow adapter support                        │
│  • Custom item loader functions                              │
│  • Batch loading for training loops                          │
└──────────────────────────────────────────────────────────────┘
                         ↓
┌──────────────────────────────────────────────────────────────┐
│  ML Training Pipeline                                        │
│  • Load train/val/test datasets                              │
│  • Create DataLoaders with batching                          │
│  • Train neural networks (GNNs, CNNs, etc.)                  │
│  • Model evaluation and inference                            │
└──────────────────────────────────────────────────────────────┘

Key Integration Points

The DatasetMerger participates in several critical integration points within the HOOPS AI ecosystem:

SchemaBuilder → DataStorage: Schema defines how individual files are organized during encoding

DataStorage → DatasetMerger: Individual .data files are merged using the schema structure to determine grouping and concatenation logic

SchemaBuilder → DatasetInfo: Schema routing rules determine which metadata fields go to .infoset versus .attribset

DatasetMerger Output → DatasetExplorer: The merged .dataset file enables efficient exploration, querying, and statistical analysis

DatasetMerger Output → DatasetLoader: The merged .dataset and .infoset files provide the foundation for ML training data preparation

This integrated system enables seamless progression from raw CAD files to production ML models with minimal manual intervention. The schema-driven approach ensures consistency across all pipeline stages, while automatic merging eliminates the need for custom data consolidation code in most workflows.

Data Merging in HOOPS AI

Position in HOOPS AI Pipeline

Core Concepts

Data Organization

Group-Based Merging

Schema-Driven vs Heuristic Discovery

1. Schema-Driven Discovery (Preferred)

2. Heuristic Discovery (Fallback)

Initialization

Group Discovery

Automatic Discovery

Manual Schema Setting

Merging Execution

Single-Pass Merge

Batch Merge

Special Processing: Matrix Flattening

Initialization

Metadata Processing

Parse and Route Metadata

Schema-Driven Routing

Schema Definition

Routing Logic

.dataset Files

.infoset Files

.attribset Files

Quick Start

Custom Merging Tasks

Advanced: Custom Merging Task

Memory Management

Batch Merging for Large Datasets

Parallel Processing

Dask Configuration

Tuning Guidelines

Compression

Zarr Compression Settings

Typical Compression Ratios

Key Capabilities

Downstream Integration

Complete System Architecture

Key Integration Points

Hello! I'm HOOPSY