Data Merging in HOOPS AI
Overview
The DatasetMerger is a critical component in HOOPS AI’s data pipeline that consolidates multiple individual encoded CAD files (.data files) into a single, unified, compressed dataset (.dataset file). It performs parallel array concatenation, metadata aggregation, and schema-driven organization to create ML-ready datasets.
Key Purpose: The DatasetMerger transforms a collection of per-file encoded data into a unified, queryable, analysis-ready dataset suitable for machine learning training and exploration.
The system follows an automated pipeline pattern:
Individual .data files → DatasetMerger → Unified .dataset → DatasetExplorer/DatasetLoader → ML Pipeline
Important
This document focuses on the merging process (consolidation of individual files). For information on using the merged datasets (analysis, querying, ML preparation), see:
Dataset Exploration and Mining - Comprehensive guide for dataset exploration and ML training preparation
Architecture & Role
Position in HOOPS AI Pipeline
The DatasetMerger operates in the ETL (Extract-Transform-Load) phase of the HOOPS AI pipeline:
┌──────────────────────────────────────────────────────────────────────┐
│ HOOPS AI Data Pipeline │
└──────────────────────────────────────────────────────────────────────┘
1. ENCODING PHASE (Per-File)
┌─────────────────────────────────────────────────────┐
│ CAD File → Encoder → DataStorage → .data file │
│ (Repeated for each file in parallel) │
└─────────────────────────────────────────────────────┘
↓
2. MERGING PHASE (DatasetMerger) ← YOU ARE HERE
┌─────────────────────────────────────────────────────┐
│ Multiple .data files → DatasetMerger → .dataset │
│ Multiple .json files → DatasetInfo → .infoset │
│ → .attribset │
└─────────────────────────────────────────────────────┘
↓
3. ANALYSIS/ML PHASE
┌─────────────────────────────────────────────────────┐
│ .dataset → DatasetExplorer → Query/Filter/Analyze │
│ .dataset → DatasetLoader → Train/Val/Test Splits │
│ → ML Model Training │
└─────────────────────────────────────────────────────┘
Why Merging is Essential:
Unified Access: ML models need a single dataset, not thousands of individual files
Efficient Queries: Merged data enables fast filtering and statistical analysis
Memory Efficiency: Dask-based chunked operations handle datasets larger than RAM
Metadata Correlation: File-level and categorical metadata linked to data arrays
Core Concepts
Data Organization
Individual encoded files have a group-based structure:
part_001.data (ZipStore Zarr)
├── faces/
│ ├── face_indices (array)
│ ├── face_areas (array)
│ ├── face_types (array)
│ └── face_uv_grids (array)
├── edges/
│ ├── edge_indices (array)
│ ├── edge_lengths (array)
│ └── edge_types (array)
├── graph/
│ ├── edges_source (array)
│ ├── edges_destination (array)
│ └── num_nodes (array)
└── metadata.json
After merging, the dataset consolidates all files:
my_flow.dataset (ZipStore Zarr)
├── faces/
│ ├── face_indices (concatenated from all files)
│ ├── face_areas (concatenated from all files)
│ └── file_id (provenance tracking: which face came from which file)
├── edges/
│ └── ... (similarly concatenated)
└── graph/
└── ... (similarly concatenated)
Group-Based Merging
The DatasetMerger organizes arrays into logical groups for concatenation:
Mathematical Representation:
For a group \(G\) with primary dimension \(d\) (e.g., “face”), given \(N\) files with arrays:
where \(a_{i,j}\) is the \(j\)-th element of array \(A\) from file \(i\), and \(n_i\) is the count of dimension \(d\) in file \(i\).
The merged array is:
where \(\oplus\) denotes concatenation along dimension \(d\):
Example:
File 1: face_areas = [1.5, 2.3, 4.1] (3 faces)
File 2: face_areas = [3.7, 5.2] (2 faces)
File 3: face_areas = [2.1, 1.8, 6.3] (3 faces)
Merged: face_areas = [1.5, 2.3, 4.1, 3.7, 5.2, 2.1, 1.8, 6.3] (8 faces total)
Provenance Tracking:
A file_id array is added to track origin:
file_id = [0, 0, 0, 1, 1, 2, 2, 2]
└─File 1─┘ └File 2┘ └─File 3─┘
Schema-Driven vs Heuristic Discovery
The DatasetMerger supports two modes for discovering data structure:
1. Schema-Driven Discovery (Preferred)
When a schema is available (from SchemaBuilder.set_schema()), the merger uses explicit group definitions:
schema = {
"groups": {
"faces": {
"primary_dimension": "face",
"arrays": {
"face_indices": {"dims": ["face"], "dtype": "int32"},
"face_areas": {"dims": ["face"], "dtype": "float32"}
}
},
"edges": {
"primary_dimension": "edge",
"arrays": {
"edge_lengths": {"dims": ["edge"], "dtype": "float32"}
}
}
}
}
merger.set_schema(schema) # Use explicit structure
Benefits:
Predictable: Structure is explicit and documented
Validated: Arrays are checked against schema dimensions
Extensible: New groups/arrays follow defined patterns
2. Heuristic Discovery (Fallback)
Without a schema, the merger scans files and uses naming patterns to infer structure:
# Heuristic rules:
# - Arrays with "face" in name → "faces" group
# - Arrays with "edge" in name → "edges" group
# - Arrays with "graph" in name → "graph" group
# - Arrays in "graph/edges/" subgroup → flattened to "graph/edges_source"
discovered_groups = merger.discover_groups_from_files(max_files_to_scan=5)
Pattern Matching:
if "face" in array_name.lower():
group = "faces"
elif "edge" in array_name.lower():
group = "edges"
elif "duration" in array_name.lower() or "size" in array_name.lower():
group = "file" # File-level metadata
Using the DatasetMerger Class
The DatasetMerger class provides a high-level interface for consolidating individual encoded CAD files into unified datasets. This section demonstrates how to initialize the merger, discover data structures, and execute the merging process with various configurations.
Initialization
Creating a DatasetMerger instance requires specifying the source files to merge, the output location, and configuration parameters for parallel processing. The merger uses Dask for distributed computation, enabling efficient processing of large datasets that may not fit in memory.
from hoops_ai.storage.datasetstorage import DatasetMerger
# Create merger instance with configuration
merger = DatasetMerger(
zip_files=["path/to/part_001.data", "path/to/part_002.data", ...],
merged_store_path="merged_dataset.dataset",
file_id_codes={"part_001": 0, "part_002": 1, ...}, # From DatasetInfo
dask_client_params={'n_workers': 4, 'threads_per_worker': 4},
delete_source_files=True # Clean up after merging
)
The initialization parameters control both the data sources and the merge behavior:
Parameters:
zip_files (List[str]): List of paths to individual
.datafiles to merge. These are typically the output from parallel CAD encoding tasks.merged_store_path (str): Path where the merged
.datasetfile will be created. This becomes the single unified dataset used by DatasetExplorer and DatasetLoader.file_id_codes (Dict[str, int]): Mapping from file stem (filename without extension) to integer ID. This dictionary provides provenance tracking, allowing you to identify which source file contributed each piece of data in the merged dataset. Typically provided by DatasetInfo.
dask_client_params (Dict): Configuration for Dask’s parallel processing engine. Controls the number of worker processes (
n_workers) and threads per worker (threads_per_worker). Adjust based on available CPU cores and memory.delete_source_files (bool): If
True, automatically deletes individual.datafiles after successful merging. This saves disk space but should only be enabled if the source files are no longer needed.
The file_id_codes dictionary is crucial for maintaining data provenance. During merging, a file_id array is automatically added to each group, tracking which source file each element came from. This enables filtering and analysis by source file using DatasetExplorer.
Group Discovery
Before merging data, the DatasetMerger must understand the structure of the encoded files. This discovery process can be automatic (scanning files and inferring structure) or explicit (using a predefined schema).
Automatic Discovery
The merger can automatically discover data groups and arrays by scanning a sample of input files and analyzing their structure:
# Discover groups by scanning input files
discovered_groups = merger.discover_groups_from_files(max_files_to_scan=5)
# Print the discovered structure for verification
merger.print_discovered_structure()
Example output:
Discovered Data Structure:
==================================================
Group: faces
Arrays: ['face_indices', 'face_areas', 'face_types', 'face_uv_grids']
Group: edges
Arrays: ['edge_indices', 'edge_lengths', 'edge_types']
Group: graph
Arrays: ['edges_source', 'edges_destination', 'num_nodes']
The max_files_to_scan parameter controls how many files are examined during discovery. Scanning more files ensures a complete picture of the data structure, but increases discovery time. For large datasets with consistent structure, scanning 5-10 files is typically sufficient.
Automatic discovery uses heuristic rules to group arrays (e.g., arrays with “face” in the name go to the “faces” group). While convenient, this approach may misclassify arrays with unconventional naming or fail to capture semantic relationships between arrays.
Manual Schema Setting
For production pipelines, explicitly setting a schema is recommended. This provides precise control over data organization and validates that all encoded files conform to the expected structure:
from hoops_ai.storage.datasetstorage import SchemaBuilder
# Build schema definition
builder = SchemaBuilder(domain="CAD_analysis", version="1.0")
# Define faces group
faces_group = builder.create_group("faces", "face", "Face-level geometric data")
faces_group.create_array("face_areas", ["face"], "float32", "Surface area per face")
faces_group.create_array("face_types", ["face"], "int32", "Face type codes")
# Define edges group
edges_group = builder.create_group("edges", "edge", "Edge-level data")
edges_group.create_array("edge_lengths", ["edge"], "float32", "Edge lengths")
# Build and apply schema
schema = builder.build()
merger.set_schema(schema)
Using a schema provides several benefits:
Validation: The merger verifies that all arrays in the source files match the schema definitions, catching encoding errors early
Documentation: Schema serves as explicit documentation of the dataset structure
Consistency: Guarantees that all source files have the same structure, preventing merge failures
Type Safety: Ensures data types match expectations, avoiding silent type coercion issues
The schema-based approach is particularly important when merging files from different encoding runs or when integrating data from multiple sources.
Merging Execution
Once the DatasetMerger is initialized and the data structure is discovered or defined, you can execute the merge operation. The merger supports two execution modes: single-pass merging for smaller datasets and batch merging for large datasets that exceed available memory.
Single-Pass Merge
For datasets where all source files can be loaded into memory simultaneously, single-pass merging provides the fastest execution:
merger.merge_data(
face_chunk=500_000, # Chunk size for face arrays
edge_chunk=500_000, # Chunk size for edge arrays
faceface_flat_chunk=100_000, # Chunk size for face-face relationships
batch_size=None, # None = merge all files at once
consolidate_metadata=True, # Consolidate Zarr metadata after merge
force_compression=True # Output as compressed .dataset file
)
The chunk size parameters control how Dask partitions arrays during processing. Larger chunks reduce overhead but increase memory usage. The optimal chunk size depends on your dataset characteristics:
face_chunk/edge_chunk: Set based on the typical number of faces/edges per file. For CAD models with 1000-10000 faces, 500,000 is a reasonable default.
faceface_flat_chunk: For face adjacency matrices, use smaller chunks (100,000) to manage the quadratic memory growth.
Process Overview:
The single-pass merge follows these steps:
Load: All
.datafiles are loaded into xarray Datasets, organized by group (faces, edges, graph, etc.)Concatenate: Arrays within each group are concatenated along their primary dimension (face, edge, etc.)
Provenance: A
file_idarray is added to each group, tracking which source file contributed each elementWrite: The merged data is written to the
.datasetfile with optional compression
This approach is memory-intensive but fast, making it ideal for datasets with hundreds or thousands of files where the total data size fits comfortably in available RAM.
Batch Merge
For large datasets that exceed available memory, batch merging processes files in smaller groups and progressively builds the final dataset:
merger.merge_data(
face_chunk=500_000,
edge_chunk=500_000,
faceface_flat_chunk=100_000,
batch_size=200, # Process 200 files at a time
consolidate_metadata=True,
force_compression=True
)
Setting batch_size to a positive integer activates batch mode. The merger processes files in batches, creating partial .dataset files, then merges these partial datasets into the final result.
Process Overview:
Batch merging follows a two-stage approach:
Partial Merges: The source files are divided into batches of
batch_sizefiles. Each batch is merged into a partial.datasetfile. Ifdelete_source_files=True, source files are deleted after each batch is successfully merged, progressively freeing disk space.Final Merge: All partial
.datasetfiles are merged into the final unified.datasetfile. The partial datasets are then deleted (ifdelete_source_files=True).
Mathematical Formulation:
For \(N\) total files with batch size \(B\), the number of batches is:
Batch \(i\) contains files:
Each batch is merged into a partial dataset \(D_{\text{batch}_i}\). The final merge concatenates these partial results:
where \(\oplus\) denotes the concatenation operation along the primary dimension of each group.
Benefits of Batch Merging:
Memory Efficiency: Only one batch needs to be in memory at a time, enabling processing of arbitrarily large datasets
Progress Tracking: Partial merges provide natural checkpoints, making it easier to resume if the process is interrupted
Disk Space Management: Source files can be deleted incrementally, reducing peak disk space requirements
Choosing Batch Size:
The optimal batch size balances memory usage against merge overhead:
Too Small (e.g., 10-50 files): Many partial merges increase overhead and total processing time
Too Large (e.g., 1000+ files): May exceed available memory, causing slow disk swapping
Recommended: 100-300 files per batch for typical CAD datasets with 1000-100,000 faces per file
Monitor memory usage during the first batch and adjust batch_size if you see excessive swapping or out-of-memory errors.
Special Processing: Matrix Flattening
Some encoded data includes face-to-face relationship matrices, such as distance distributions or interaction features. These matrices have dimensions \([\text{face} \times \text{face} \times \text{bins}]\), where the number of faces varies per file. The DatasetMerger applies special flattening logic to enable efficient concatenation.
Per-File Structure:
Each file contains a matrix representing relationships between all pairs of faces:
File 1: a3_distance [3 faces × 3 faces × 64 bins]
File 2: a3_distance [2 faces × 2 faces × 64 bins]
Direct concatenation is impossible because the first two dimensions differ between files.
Flattening Process:
For each file \(i\) with \(n_i\) faces and relationship matrix \(A_i \in \mathbb{R}^{n_i \times n_i \times B}\), where \(B\) is the number of bins, the flattening operation reshapes the matrix:
This converts the 3D matrix into a 2D array where each row represents one face-pair relationship and columns represent bins.
Merged Structure:
After flattening all files, the merger concatenates along the face-pair dimension:
Merged: a3_distance [13 face-pairs × 64 bins]
└─ (3×3=9) + (2×2=4) = 13 pairs
The total number of rows is:
Provenance Tracking:
A file_id_faceface array tracks which file each face-pair came from:
file_id_faceface = [0,0,0,0,0,0,0,0,0, 1,1,1,1]
└────File 1 (9)────┘ └File 2 (4)┘
This enables filtering face-pair relationships by source file during analysis.
Benefits of Flattening:
Concatenation: Flattened matrices have consistent dimensions, enabling straightforward concatenation
Storage Efficiency: Avoids sparse 3D storage where many file-pair combinations would be empty
Query Performance: 2D structure supports efficient slicing and filtering operations
The flattening process is automatic and transparent—you don’t need to modify encoded data. The merger detects face-face relationship arrays by naming convention and applies flattening automatically.
Using the DatasetInfo Class
The DatasetInfo class complements the DatasetMerger by handling metadata extraction, routing, and storage. While DatasetMerger consolidates numerical array data into the .dataset file, DatasetInfo processes JSON metadata files and organizes them into structured Parquet files for efficient querying and analysis.
Initialization
Creating a DatasetInfo instance requires specifying the source metadata files, output locations for different metadata types, and an optional schema that defines routing rules:
from hoops_ai.storage.datasetstorage import DatasetInfo
ds_info = DatasetInfo(
info_files=["path/to/part_001.json", "path/to/part_002.json", ...],
merged_store_path="merged_dataset.infoset", # Output for file-level metadata
attribute_file_path="merged_dataset.attribset", # Output for categorical metadata
schema=schema_dict # Optional: schema for metadata routing
)
The initialization parameters control metadata sources and destinations:
Parameters:
info_files (List[str]): List of paths to JSON metadata files, typically one per encoded CAD file. These JSON files contain metadata captured during the encoding process, such as file sizes, processing times, labels, and custom attributes.
merged_store_path (str): Path where the
.infosetParquet file will be created. This file contains file-level metadata with one row per source file, enabling efficient queries by file characteristics.attribute_file_path (str): Path where the
.attribsetParquet file will be created. This file contains categorical metadata and label descriptions, organized in a normalized table structure.schema (Dict): Optional schema dictionary defining metadata routing rules. When provided, the schema determines which metadata fields belong in the
.infosetfile versus the.attribsetfile. Without a schema, DatasetInfo uses type-based heuristics for routing.
The separation of metadata into two files serves different purposes: .infoset provides per-file information for filtering and stratification, while .attribset stores shared categorical definitions for labels and enumerations.
Metadata Processing
After initialization, the DatasetInfo processes metadata files through a series of steps that parse, route, and store the information in optimized formats.
Parse and Route Metadata
The core metadata processing workflow involves parsing JSON files, building provenance mappings, and storing results in Parquet format:
# Parse all JSON metadata files
ds_info.parse_info_files()
# Build file ID mappings for provenance
file_id_codes = ds_info.build_code_mappings(zip_files)
# Returns: {"part_001": 0, "part_002": 1, ...}
# Store metadata to Parquet files
ds_info.store_info_to_parquet(table_name="file_info")
Each method in this workflow serves a specific purpose:
parse_info_files() Process:
The parse_info_files() method orchestrates the complete metadata extraction and routing pipeline:
Load JSON files: Reads all metadata files specified during initialization
Route metadata: Classifies each field as file-level or categorical based on schema rules or type heuristics
Extract descriptions: Merges nested description structures (e.g., label definitions, enum values)
Validate: Checks metadata against schema requirements if a schema was provided
Aggregate: Combines metadata from all files into unified data structures
Store: Writes file-level metadata to
.infosetand categorical metadata to.attribsetCleanup: Optionally deletes processed JSON files to save disk space
build_code_mappings() Purpose:
The build_code_mappings() method creates the integer ID mapping that DatasetMerger uses for provenance tracking. It extracts file stems (filenames without extensions) from the zip_files list and assigns sequential integer IDs. These IDs appear in the file_id arrays within the merged dataset, enabling you to trace each data element back to its source file.
store_info_to_parquet() Execution:
The store_info_to_parquet() method writes the parsed and routed metadata to Parquet files. The table_name parameter identifies the table within the .infoset file, allowing multiple metadata tables to coexist if needed.
Schema-Driven Routing
When a schema is provided during initialization, DatasetInfo uses sophisticated routing rules to determine where each metadata field should be stored. This schema-driven approach provides precise control over metadata organization and enables validation of metadata structure.
Schema Definition
A complete schema defines metadata categories, field specifications, and routing patterns:
# Schema defines routing rules
schema = {
"metadata": {
"file_level": {
"size_cadfile": {"dtype": "int64", "required": False},
"processing_time": {"dtype": "float32", "required": False}
},
"categorical": {
"file_label": {"dtype": "int32", "values": [0, 1, 2, 3, 4]}
},
"routing_rules": {
"file_level_patterns": ["size_*", "duration_*", "processing_*"],
"categorical_patterns": ["*_label", "category", "type"],
"default_numeric": "file_level",
"default_categorical": "categorical"
}
}
}
ds_info = DatasetInfo(..., schema=schema)
The schema structure includes:
file_level: Dictionary of metadata fields that should appear in the
.infosetfile. Each field specifies a data type and whether it’s required.categorical: Dictionary of categorical metadata fields for the
.attribsetfile. Fields can specify allowed values for validation.routing_rules: Patterns and defaults that handle fields not explicitly defined in
file_levelorcategoricalsections:
file_level_patterns: Wildcard patterns (using
*) that match field names for file-level routingcategorical_patterns: Wildcard patterns for categorical routing
default_numeric: Where to route numeric fields not matched by other rules
default_categorical: Where to route string/boolean fields not matched by other rules
Routing Logic
DatasetInfo applies a hierarchical decision process to determine the destination for each metadata field:
Explicit Check: Is the field name explicitly defined in
file_levelorcategoricalsections?Pattern Match: Does the field name match any wildcard patterns in
routing_rules?Type-Based Default: Based on the field’s data type, apply the default routing (numeric → file_level, string/boolean → categorical)
Routing Examples:
Consider these metadata fields being processed:
# Field: "size_cadfile" with value 1024000
# 1. Explicit check: Found in file_level definitions ✓
# 2. Routing decision: → .infoset file
# Field: "file_label" with value 3
# 1. Explicit check: Found in categorical definitions ✓
# 2. Routing decision: → .attribset file
# Field: "custom_metric" with value 42.5 (not in schema)
# 1. Explicit check: Not found
# 2. Pattern match: No match to any routing patterns
# 3. Type-based default: Numeric value → apply default_numeric rule
# 4. Routing decision: → .infoset file
This hierarchical approach provides flexibility: you can explicitly control routing for critical fields while relying on patterns and defaults for auxiliary metadata.
Understanding Output Files
The DatasetMerger and DatasetInfo produce three primary output files that together comprise a complete, queryable dataset. Understanding the structure and purpose of each file enables effective use of the merged data in analysis and machine learning workflows.
.dataset Files
The .dataset file contains all numerical array data from the merged CAD files, organized into logical groups and compressed for efficient storage.
Format: Compressed Zarr (ZipStore) with .dataset extension
Contents: All numerical array data organized by groups (faces, edges, graph, etc.)
Structure:
my_flow.dataset (ZipStore)
├── faces/
│ ├── face_indices [N_faces]
│ ├── face_areas [N_faces]
│ ├── face_types [N_faces]
│ ├── face_uv_grids [N_faces × U × V × 7]
│ └── file_id [N_faces] ← Provenance tracking
├── edges/
│ ├── edge_indices [N_edges]
│ ├── edge_lengths [N_edges]
│ ├── edge_types [N_edges]
│ └── file_id [N_edges] ← Provenance tracking
├── graph/
│ ├── edges_source [N_graph_edges]
│ ├── edges_destination [N_graph_edges]
│ ├── num_nodes [N_files]
│ └── file_id [N_graph_edges] ← Provenance tracking
└── faceface/
├── a3_distance [N_face_pairs × bins]
├── d2_distance [N_face_pairs × bins]
├── extended_adjacency [N_face_pairs]
└── file_id [N_face_pairs] ← Provenance tracking
Each group contains arrays concatenated from all source files. The dimensions (N_faces, N_edges, etc.) represent the total count across all merged files. The file_id arrays enable tracing each element back to its source file.
Key Features:
Compression: ZipStore format provides transparent compression, reducing disk space while maintaining fast access
Chunking: Large arrays are divided into chunks for efficient partial loading
Lazy Loading: Data is loaded on-demand rather than reading the entire file into memory
Group Organization: Logical separation enables loading only relevant data for specific analyses
Access Pattern:
Use the DatasetExplorer or DatasetLoader classes to query and analyze data from .dataset files. These classes provide high-level interfaces for filtering, statistical analysis, and ML dataset preparation without requiring direct Zarr manipulation.
.infoset Files
The .infoset file contains file-level metadata organized in a tabular structure with one row per source CAD file. This format enables efficient filtering and querying based on file characteristics.
Format: Parquet (columnar storage format)
Contents: File-level metadata (one row per file)
Example Schema:
┌────┬──────────────┬─────────────┬───────────────┬──────────────────┬────────┐
│ id │ name │ description │ size_cadfile │ processing_time │ subset │
├────┼──────────────┼─────────────┼───────────────┼──────────────────┼────────┤
│ 0 │ part_001 │ Bracket │ 1024000 │ 12.5 │ train │
│ 1 │ part_002 │ Housing │ 2048000 │ 18.3 │ train │
│ 2 │ part_003 │ Cover │ 512000 │ 8.1 │ test │
└────┴──────────────┴─────────────┴───────────────┴──────────────────┴────────┘
The id column corresponds to the file IDs used in file_id arrays within the .dataset file, enabling cross-referencing between metadata and array data.
Standard Columns:
id: Integer file ID (matches
file_idin dataset arrays)name: File stem (filename without extension)
description: Human-readable description of the file or part
size_cadfile: Original CAD file size in bytes
processing_time: Encoding duration in seconds
subset: Dataset split assignment (train, validation, test)
Additional Columns (schema-dependent):
flow_name: Name of the flow that processed this file
stream_cache_png: Path to PNG visualization
stream_cache_3d: Path to 3D model (SCS/STL format)
Custom fields: Any additional metadata defined in your schema
.attribset Files
The .attribset file contains categorical metadata and label descriptions in a normalized table structure. This file provides human-readable names and descriptions for categorical values used throughout the dataset.
Format: Parquet (columnar storage format)
Contents: Categorical metadata and label descriptions
Example Schema:
┌──────────────┬────┬──────────────┬─────────────────────┐
│ table_name │ id │ name │ description │
├──────────────┼────┼──────────────┼─────────────────────┤
│ file_label │ 0 │ Simple │ Basic geometry │
│ file_label │ 1 │ Medium │ Moderate complexity │
│ file_label │ 2 │ Complex │ High complexity │
│ face_types │ 0 │ Plane │ Planar surface │
│ face_types │ 1 │ Cylinder │ Cylindrical surface │
│ face_types │ 2 │ Sphere │ Spherical surface │
│ edge_types │ 0 │ Line │ Linear edge │
│ edge_types │ 1 │ Arc │ Circular arc │
└──────────────┴────┴──────────────┴─────────────────────┘
The table uses a normalized structure where each row defines one categorical value. The table_name column groups related definitions together.
Column Definitions:
table_name: Name of the categorical attribute (e.g., “file_label”, “face_types”)
id: Integer code used in the dataset arrays
name: Short, human-readable name for this category
description: Detailed description of what this category represents
Purpose:
The .attribset file enables interpretation of integer codes in the dataset. For example, when you see face_types = 1 in the dataset arrays, you can look up the .attribset file to find that code 1 means “Cylinder” with description “Cylindrical surface.”
Usage:
DatasetExplorer automatically loads .attribset data and uses it to provide human-readable labels in visualizations and summaries. When creating distribution plots or categorical analyses, the descriptions from this file appear in legends and axis labels.
Integration with DatasetExplorer and DatasetLoader
The merged outputs (.dataset, .infoset, .attribset) serve as input for downstream analysis and machine learning workflows. The DatasetExplorer class provides tools for querying and analyzing merged datasets, while DatasetLoader prepares data for ML training with stratified splitting and framework integration.
Quick Start
The following example demonstrates basic usage of both DatasetExplorer and DatasetLoader with merged datasets:
from hoops_ai.dataset import DatasetExplorer, DatasetLoader
# Analysis with DatasetExplorer
explorer = DatasetExplorer(flow_output_file="cad_pipeline.flow")
explorer.print_table_of_contents()
# Query and analyze
face_dist = explorer.create_distribution(key="face_areas", group="faces", bins=20)
file_codes = explorer.get_file_list(
group="faces",
where=lambda ds: ds['complexity_level'] >= 4
)
explorer.close()
# ML Preparation with DatasetLoader
loader = DatasetLoader(
merged_store_path="cad_pipeline.dataset",
parquet_file_path="cad_pipeline.infoset"
)
# Stratified split
loader.split(key="complexity_level", train=0.7, validation=0.15, test=0.15)
# Get datasets
train_dataset = loader.get_dataset("train")
val_dataset = loader.get_dataset("validation")
# PyTorch integration
train_torch = train_dataset.to_torch()
This workflow illustrates the typical progression: explore and validate the merged dataset using DatasetExplorer, then prepare training data using DatasetLoader.
For Comprehensive Documentation:
For detailed information on using these tools, see:
Dataset Exploration and Mining - Complete guide covering query operations, distribution analysis, metadata filtering, stratified splitting for ML training, PyTorch integration, and complete workflow examples with performance optimization techniques
The exploration and loading documentation provides in-depth coverage of all available operations, including advanced filtering, custom item loaders, and framework-specific adapters.
Custom Merging Tasks
While the Flow module’s automatic merging (via auto_dataset_export=True) handles most use cases, you may need custom merging behavior for specialized requirements. Creating a custom merging task using the @flowtask.custom decorator provides complete control over merge parameters and processing logic.
Advanced: Custom Merging Task
The following example demonstrates a custom merging task that specifies non-standard chunk sizes, batch configurations, and cleanup behavior:
from hoops_ai.flowmanager import flowtask
from typing import List
@flowtask.custom(
name="custom_dataset_merge",
inputs=["cad_files_encoded"],
outputs=["merged_dataset_path"],
parallel_execution=False
)
def custom_merge_task(encoded_files: List[str]) -> str:
"""Custom merging logic with specific parameters"""
from hoops_ai.storage.datasetstorage import DatasetMerger, DatasetInfo
import pathlib
# Find all .data and .json files
data_files = [f"{f}.data" for f in encoded_files]
json_files = [f"{f}.json" for f in encoded_files]
output_dir = pathlib.Path("./custom_merged")
output_dir.mkdir(exist_ok=True)
# Process metadata
ds_info = DatasetInfo(
info_files=json_files,
merged_store_path=str(output_dir / "custom.infoset"),
attribute_file_path=str(output_dir / "custom.attribset"),
schema=cad_schema
)
ds_info.parse_info_files()
file_id_codes = ds_info.build_code_mappings(data_files)
ds_info.store_info_to_parquet()
# Merge data arrays with custom configuration
merger = DatasetMerger(
zip_files=data_files,
merged_store_path=str(output_dir / "custom.dataset"),
file_id_codes=file_id_codes,
dask_client_params={'n_workers': 16, 'threads_per_worker': 2},
delete_source_files=False
)
merger.set_schema(cad_schema)
merger.merge_data(
face_chunk=1_000_000, # Custom chunk sizes
edge_chunk=1_000_000,
batch_size=500 # Process in larger batches
)
return str(output_dir / "custom.dataset")
# Use in flow with custom merging
import hoops_ai
custom_flow = hoops_ai.create_flow(
name="custom_pipeline",
tasks=[gather_cad_files, encode_cad_geometry, custom_merge_task],
auto_dataset_export=False # Disable automatic merging
)
This approach is useful when you need to:
Use non-standard chunk sizes optimized for your specific data characteristics
Preserve source files (
delete_source_files=False) for archival or debuggingApply custom Dask configurations for specific hardware environments
Integrate additional processing steps between encoding and merging
The custom task has access to the full DatasetMerger and DatasetInfo APIs, enabling any merging configuration supported by these classes.
Performance Considerations
Optimizing merge performance requires understanding memory constraints, parallelization strategies, and compression tradeoffs. This section provides guidelines for tuning DatasetMerger for different dataset sizes and hardware configurations.
Memory Management
Large datasets can exhaust available memory during merging. Batch merging provides a solution by processing files in manageable groups and progressively building the final dataset.
Batch Merging for Large Datasets
For datasets with thousands of files or very large individual files, batch merging prevents out-of-memory errors:
# For 10,000+ files, use batch merging
merger.merge_data(
batch_size=500, # Process 500 files at a time
face_chunk=500_000,
edge_chunk=500_000
)
Batch merging offers several benefits:
Benefits:
Prevents out-of-memory errors
Enables progress tracking
Allows incremental cleanup of source files
The optimal batch_size depends on your system’s available memory and the size of individual files. Monitor memory usage during the first batch and adjust accordingly.
Parallel Processing
DatasetMerger uses Dask for parallel processing, distributing merge operations across multiple workers. Proper Dask configuration significantly impacts merge performance.
Dask Configuration
The dask_client_params parameter controls parallel execution behavior:
dask_client_params = {
'n_workers': 8, # Number of parallel workers
'threads_per_worker': 4, # Threads per worker
'processes': True, # Use separate processes (not threads)
'memory_limit': '8GB', # Per-worker memory limit
'dashboard_address': ':8787' # Monitoring dashboard URL
}
merger = DatasetMerger(
zip_files=files,
merged_store_path="output.dataset",
dask_client_params=dask_client_params
)
Each parameter affects performance differently:
n_workers: Total number of parallel processes. Set based on CPU cores available. More workers enable greater parallelism but increase memory requirements.
threads_per_worker: Number of threads within each worker process. Increase for CPU-intensive operations, decrease for I/O-intensive tasks.
processes: When
True, workers run as separate processes with isolated memory. This is recommended for memory-intensive operations to prevent interference.memory_limit: Maximum memory per worker. Set conservatively to prevent system-wide memory exhaustion. If workers exceed this limit, Dask spills data to disk.
dashboard_address: Enables the Dask dashboard for real-time monitoring of task execution, memory usage, and worker status.
Tuning Guidelines
Different workload characteristics require different Dask configurations:
Many Small Files (I/O bound):
More workers, fewer threads per worker
Example:
n_workers=16, threads_per_worker=2Rationale: File I/O is the bottleneck; maximize concurrent file reads
Few Large Files (CPU bound):
Fewer workers, more threads per worker
Example:
n_workers=4, threads_per_worker=8Rationale: Computation dominates; leverage multi-threading for array operations
Memory Constrained Systems:
Reduce
memory_limitper workerDecrease
batch_sizeto process fewer files at onceExample:
n_workers=4, memory_limit='4GB', batch_size=100Rationale: Prevent memory swapping which severely degrades performance
Optimal Configuration Discovery:
Start with conservative settings (few workers, moderate memory limits) and monitor the Dask dashboard during a test merge. Increase parallelism if CPU utilization is low, or decrease it if you observe memory pressure or disk swapping.
Compression
The DatasetMerger uses Zarr’s compression capabilities to reduce disk space requirements for merged datasets. Understanding the compression settings helps predict storage needs and access performance.
Zarr Compression Settings
DatasetMerger applies the following compression configuration automatically:
Codec: Zstd level 12 (high compression ratio with reasonable decompression speed)
Chunks: Automatic sizing based on array dimensions (typically ~1M elements per chunk)
Filters: Delta encoding for integer arrays (stores differences rather than absolute values, improving compression)
These settings balance compression ratio against decompression speed for typical CAD data patterns.
Typical Compression Ratios
Compression effectiveness varies by data type:
- Raw array data (face areas, edge lengths): 10-20x compression
Floating-point arrays with spatial locality compress very well
- Face-face histograms (distance distributions): 5-10x compression
Histogram bins often contain many zeros, enabling good compression
- Graph structures (edge indices, adjacency): 3-5x compression
Integer sequences with predictable patterns compress moderately well
These ratios mean a dataset that would occupy 100 GB uncompressed typically requires 5-10 GB as a compressed .dataset file.
Summary
The DatasetMerger serves as the critical bridge between individual encoded CAD files and ML-ready unified datasets, providing the consolidation layer that enables efficient analysis and training.
Key Capabilities
Automatic Integration: Seamlessly invoked by the Flow module with
auto_dataset_export=True, requiring no manual setup for standard workflowsSchema-Driven Organization: Uses SchemaBuilder definitions for predictable, validated merging with explicit control over data structure
Parallel Processing: Leverages Dask’s distributed computation engine for efficient large-scale data consolidation across multiple CPU cores
Provenance Tracking: Maintains
file_idarrays in every group, enabling traceability from merged data back to source filesMultiple Output Formats: Produces three complementary files serving different purposes:
.dataset: Numerical array data in compressed Zarr format
.infoset: File-level metadata in queryable Parquet format
.attribset: Categorical metadata and label descriptions in Parquet format
Downstream Integration
The merged output files serve as inputs for analysis and machine learning tools:
DatasetExplorer: Query, filter, and analyze merged datasets with support for statistical analysis, distribution creation, and metadata-based filtering
DatasetLoader: Prepare stratified train/validation/test splits for ML training with support for PyTorch, TensorFlow, and custom frameworks
Complete System Architecture
The following diagram illustrates how DatasetMerger fits within the complete HOOPS AI data pipeline:
┌─────────────────────────────────────────────────────────────────────────────┐
│ HOOPS AI Data Pipeline │
└─────────────────────────────────────────────────────────────────────────────┘
STEP 1: SCHEMA DEFINITION (Optional but Recommended)
┌──────────────────────────────────────────────────────────────┐
│ SchemaBuilder │
│ • Define groups (faces, edges, graph) │
│ • Define arrays and dimensions │
│ • Set metadata routing rules │
│ • Build schema dictionary │
└────────────────────────┬─────────────────────────────────────┘
│ schema.json
↓
STEP 2: ENCODING (Per-File Parallel Processing)
┌──────────────────────────────────────────────────────────────┐
│ Flow → CADEncodingTask (via ParallelTask) │
│ │
│ CAD File 1 → Encoder → DataStorage → part_001.data ────┐ │
│ CAD File 2 → Encoder → DataStorage → part_002.data ────┤ │
│ ... │ │
│ CAD File N → Encoder → DataStorage → part_N.data ──────┤ │
│ │ │
│ Each .data file (Zarr format) contains: │ │
│ • faces/face_areas, face_types, face_uv_grids │ │
│ • edges/edge_lengths, edge_types │ │
│ • graph/edges_source, edges_destination │ │
│ │ │
│ Metadata files (JSON): │ │
│ part_001.json, part_002.json, ..., part_N.json │ │
└──────────────────────────────────────────────────────────┼───┘
│
┌─────────────────────────────────┘
│ All .data and .json files
↓
STEP 3: AUTOMATIC MERGING (AutoDatasetExportTask)
┌──────────────────────────────────────────────────────────────┐
│ DatasetInfo │
│ • Load all .json metadata files │
│ • Route metadata using schema rules │
│ • Create file_id mappings │
│ • Output: {flow_name}.infoset (file-level metadata) │
│ {flow_name}.attribset (categorical metadata) │
└────────────────────────┬─────────────────────────────────────┘
│
┌────────────────────────┴─────────────────────────────────────┐
│ DatasetMerger │
│ • Discover groups from schemas or heuristics │
│ • Load all .data files as xarray Datasets │
│ • Concatenate arrays by group (faces, edges, graph) │
│ • Add file_id provenance tracking │
│ • Handle special processing (matrix flattening) │
│ • Output: {flow_name}.dataset (compressed Zarr) │
└────────────────────────┬─────────────────────────────────────┘
│
│ Three output files:
│ • {flow_name}.dataset
│ • {flow_name}.infoset
│ • {flow_name}.attribset
↓
┌─────────────────────────────────────────────────────────────┐
│ OUTPUT FILES │
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ {flow_name}.dataset (Zarr/ZipStore) │ │
│ │ ───────────────────────────────────── │ │
│ │ • Numerical array data organized by groups │ │
│ │ • faces/, edges/, graph/, faceface/ │ │
│ │ • Each group has file_id for provenance │ │
│ │ • Compressed for efficient storage │ │
│ └────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ {flow_name}.infoset (Parquet) │ │
│ │ ───────────────────────────────── │ │
│ │ • File-level metadata (one row per file) │ │
│ │ • Columns: id, name, description, │ │
│ │ size_cadfile, processing_time, etc. │ │
│ │ • Queryable with pandas │ │
│ └────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ {flow_name}.attribset (Parquet) │ │
│ │ ───────────────────────────────────── │ │
│ │ • Categorical metadata & descriptions │ │
│ │ • Label mappings (id → name → description) │ │
│ │ • Face types, edge types, etc. │ │
│ └────────────────────────────────────────────┘ │
└─────────────────────┬───────────────────────────────────────┘
│
│ Consumed by analysis & ML tools
↓
STEP 4: ANALYSIS & ML TRAINING
┌──────────────────────────────────────────────────────────────┐
│ DatasetExplorer │
│ • Query merged datasets (get_array_data) │
│ • Statistical analysis (compute_statistics) │
│ • Distribution creation (create_distribution) │
│ • Metadata filtering (filter_files_by_metadata) │
│ • Visualization support │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ DatasetLoader │
│ • Load merged datasets for ML training │
│ • Stratified train/val/test splitting │
│ • PyTorch/TensorFlow adapter support │
│ • Custom item loader functions │
│ • Batch loading for training loops │
└──────────────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────┐
│ ML Training Pipeline │
│ • Load train/val/test datasets │
│ • Create DataLoaders with batching │
│ • Train neural networks (GNNs, CNNs, etc.) │
│ • Model evaluation and inference │
└──────────────────────────────────────────────────────────────┘
Key Integration Points
The DatasetMerger participates in several critical integration points within the HOOPS AI ecosystem:
SchemaBuilder → DataStorage: Schema defines how individual files are organized during encoding
DataStorage → DatasetMerger: Individual
.datafiles are merged using the schema structure to determine grouping and concatenation logicSchemaBuilder → DatasetInfo: Schema routing rules determine which metadata fields go to
.infosetversus.attribsetDatasetMerger Output → DatasetExplorer: The merged
.datasetfile enables efficient exploration, querying, and statistical analysisDatasetMerger Output → DatasetLoader: The merged
.datasetand.infosetfiles provide the foundation for ML training data preparation
This integrated system enables seamless progression from raw CAD files to production ML models with minimal manual intervention. The schema-driven approach ensures consistency across all pipeline stages, while automatic merging eliminates the need for custom data consolidation code in most workflows.