######################## Data Merging in HOOPS AI ######################## .. sidebar:: .. contents:: Table of Contents :local: :depth: 1 Overview ======== The **DatasetMerger** is a critical component in HOOPS AI's data pipeline that consolidates multiple individual encoded CAD files (`.data` files) into a single, unified, compressed dataset (`.dataset` file). It performs parallel array concatenation, metadata aggregation, and schema-driven organization to create ML-ready datasets. **Key Purpose**: The DatasetMerger transforms a collection of per-file encoded data into a unified, queryable, analysis-ready dataset suitable for machine learning training and exploration. The system follows an automated pipeline pattern:: Individual .data files → DatasetMerger → Unified .dataset → DatasetExplorer/DatasetLoader → ML Pipeline .. important:: This document focuses on the **merging process** (consolidation of individual files). For information on **using** the merged datasets (analysis, querying, ML preparation), see: - :doc:`explore-dataset` - Comprehensive guide for dataset exploration and ML training preparation Architecture & Role =================== Position in HOOPS AI Pipeline ---------------------------- The DatasetMerger operates in the **ETL (Extract-Transform-Load) phase** of the HOOPS AI pipeline: .. code-block:: text ┌──────────────────────────────────────────────────────────────────────┐ │ HOOPS AI Data Pipeline │ └──────────────────────────────────────────────────────────────────────┘ 1. ENCODING PHASE (Per-File) ┌─────────────────────────────────────────────────────┐ │ CAD File → Encoder → DataStorage → .data file │ │ (Repeated for each file in parallel) │ └─────────────────────────────────────────────────────┘ ↓ 2. MERGING PHASE (DatasetMerger) ← YOU ARE HERE ┌─────────────────────────────────────────────────────┐ │ Multiple .data files → DatasetMerger → .dataset │ │ Multiple .json files → DatasetInfo → .infoset │ │ → .attribset │ └─────────────────────────────────────────────────────┘ ↓ 3. ANALYSIS/ML PHASE ┌─────────────────────────────────────────────────────┐ │ .dataset → DatasetExplorer → Query/Filter/Analyze │ │ .dataset → DatasetLoader → Train/Val/Test Splits │ │ → ML Model Training │ └─────────────────────────────────────────────────────┘ **Why Merging is Essential:** - **Unified Access**: ML models need a single dataset, not thousands of individual files - **Efficient Queries**: Merged data enables fast filtering and statistical analysis - **Memory Efficiency**: Dask-based chunked operations handle datasets larger than RAM - **Metadata Correlation**: File-level and categorical metadata linked to data arrays Core Concepts ---------------- Data Organization ~~~~~~~~~~~~~~~~~~ Individual encoded files have a group-based structure: .. code-block:: text part_001.data (ZipStore Zarr) ├── faces/ │ ├── face_indices (array) │ ├── face_areas (array) │ ├── face_types (array) │ └── face_uv_grids (array) ├── edges/ │ ├── edge_indices (array) │ ├── edge_lengths (array) │ └── edge_types (array) ├── graph/ │ ├── edges_source (array) │ ├── edges_destination (array) │ └── num_nodes (array) └── metadata.json After merging, the dataset consolidates all files: .. code-block:: text my_flow.dataset (ZipStore Zarr) ├── faces/ │ ├── face_indices (concatenated from all files) │ ├── face_areas (concatenated from all files) │ └── file_id (provenance tracking: which face came from which file) ├── edges/ │ └── ... (similarly concatenated) └── graph/ └── ... (similarly concatenated) Group-Based Merging ~~~~~~~~~~~~~~~~~~~~ The DatasetMerger organizes arrays into **logical groups** for concatenation: **Mathematical Representation:** For a group :math:`G` with primary dimension :math:`d` (e.g., "face"), given :math:`N` files with arrays: .. math:: A_i^G = \{a_{i,1}, a_{i,2}, \ldots, a_{i,n_i}\} where :math:`a_{i,j}` is the :math:`j`-th element of array :math:`A` from file :math:`i`, and :math:`n_i` is the count of dimension :math:`d` in file :math:`i`. The merged array is: .. math:: A_{\text{merged}}^G = A_1^G \oplus A_2^G \oplus \cdots \oplus A_N^G where :math:`\oplus` denotes concatenation along dimension :math:`d`: .. math:: A_{\text{merged}}^G = \{a_{1,1}, \ldots, a_{1,n_1}, a_{2,1}, \ldots, a_{2,n_2}, \ldots, a_{N,1}, \ldots, a_{N,n_N}\} **Example:** .. code-block:: text File 1: face_areas = [1.5, 2.3, 4.1] (3 faces) File 2: face_areas = [3.7, 5.2] (2 faces) File 3: face_areas = [2.1, 1.8, 6.3] (3 faces) Merged: face_areas = [1.5, 2.3, 4.1, 3.7, 5.2, 2.1, 1.8, 6.3] (8 faces total) **Provenance Tracking:** A `file_id` array is added to track origin: .. code-block:: text file_id = [0, 0, 0, 1, 1, 2, 2, 2] └─File 1─┘ └File 2┘ └─File 3─┘ Schema-Driven vs Heuristic Discovery ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The DatasetMerger supports two modes for discovering data structure: 1. Schema-Driven Discovery (Preferred) ************************************** When a schema is available (from `SchemaBuilder.set_schema()`), the merger uses explicit group definitions: .. code-block:: python schema = { "groups": { "faces": { "primary_dimension": "face", "arrays": { "face_indices": {"dims": ["face"], "dtype": "int32"}, "face_areas": {"dims": ["face"], "dtype": "float32"} } }, "edges": { "primary_dimension": "edge", "arrays": { "edge_lengths": {"dims": ["edge"], "dtype": "float32"} } } } } merger.set_schema(schema) # Use explicit structure **Benefits:** - **Predictable**: Structure is explicit and documented - **Validated**: Arrays are checked against schema dimensions - **Extensible**: New groups/arrays follow defined patterns 2. Heuristic Discovery (Fallback) ******************************** Without a schema, the merger scans files and uses naming patterns to infer structure: .. code-block:: python # Heuristic rules: # - Arrays with "face" in name → "faces" group # - Arrays with "edge" in name → "edges" group # - Arrays with "graph" in name → "graph" group # - Arrays in "graph/edges/" subgroup → flattened to "graph/edges_source" discovered_groups = merger.discover_groups_from_files(max_files_to_scan=5) **Pattern Matching:** .. code-block:: python if "face" in array_name.lower(): group = "faces" elif "edge" in array_name.lower(): group = "edges" elif "duration" in array_name.lower() or "size" in array_name.lower(): group = "file" # File-level metadata Using the DatasetMerger Class ============================== The DatasetMerger class provides a high-level interface for consolidating individual encoded CAD files into unified datasets. This section demonstrates how to initialize the merger, discover data structures, and execute the merging process with various configurations. Initialization -------------- Creating a DatasetMerger instance requires specifying the source files to merge, the output location, and configuration parameters for parallel processing. The merger uses Dask for distributed computation, enabling efficient processing of large datasets that may not fit in memory. .. code-block:: python from hoops_ai.storage.datasetstorage import DatasetMerger # Create merger instance with configuration merger = DatasetMerger( zip_files=["path/to/part_001.data", "path/to/part_002.data", ...], merged_store_path="merged_dataset.dataset", file_id_codes={"part_001": 0, "part_002": 1, ...}, # From DatasetInfo dask_client_params={'n_workers': 4, 'threads_per_worker': 4}, delete_source_files=True # Clean up after merging ) The initialization parameters control both the data sources and the merge behavior: **Parameters:** - **zip_files** (List[str]): List of paths to individual ``.data`` files to merge. These are typically the output from parallel CAD encoding tasks. - **merged_store_path** (str): Path where the merged ``.dataset`` file will be created. This becomes the single unified dataset used by DatasetExplorer and DatasetLoader. - **file_id_codes** (Dict[str, int]): Mapping from file stem (filename without extension) to integer ID. This dictionary provides provenance tracking, allowing you to identify which source file contributed each piece of data in the merged dataset. Typically provided by DatasetInfo. - **dask_client_params** (Dict): Configuration for Dask's parallel processing engine. Controls the number of worker processes (``n_workers``) and threads per worker (``threads_per_worker``). Adjust based on available CPU cores and memory. - **delete_source_files** (bool): If ``True``, automatically deletes individual ``.data`` files after successful merging. This saves disk space but should only be enabled if the source files are no longer needed. The file_id_codes dictionary is crucial for maintaining data provenance. During merging, a ``file_id`` array is automatically added to each group, tracking which source file each element came from. This enables filtering and analysis by source file using DatasetExplorer. Group Discovery --------------- Before merging data, the DatasetMerger must understand the structure of the encoded files. This discovery process can be automatic (scanning files and inferring structure) or explicit (using a predefined schema). Automatic Discovery ~~~~~~~~~~~~~~~~~~~ The merger can automatically discover data groups and arrays by scanning a sample of input files and analyzing their structure: .. code-block:: python # Discover groups by scanning input files discovered_groups = merger.discover_groups_from_files(max_files_to_scan=5) # Print the discovered structure for verification merger.print_discovered_structure() **Example output:** .. code-block:: text Discovered Data Structure: ================================================== Group: faces Arrays: ['face_indices', 'face_areas', 'face_types', 'face_uv_grids'] Group: edges Arrays: ['edge_indices', 'edge_lengths', 'edge_types'] Group: graph Arrays: ['edges_source', 'edges_destination', 'num_nodes'] The ``max_files_to_scan`` parameter controls how many files are examined during discovery. Scanning more files ensures a complete picture of the data structure, but increases discovery time. For large datasets with consistent structure, scanning 5-10 files is typically sufficient. Automatic discovery uses heuristic rules to group arrays (e.g., arrays with "face" in the name go to the "faces" group). While convenient, this approach may misclassify arrays with unconventional naming or fail to capture semantic relationships between arrays. Manual Schema Setting ~~~~~~~~~~~~~~~~~~~~~~ For production pipelines, explicitly setting a schema is recommended. This provides precise control over data organization and validates that all encoded files conform to the expected structure: .. code-block:: python from hoops_ai.storage.datasetstorage import SchemaBuilder # Build schema definition builder = SchemaBuilder(domain="CAD_analysis", version="1.0") # Define faces group faces_group = builder.create_group("faces", "face", "Face-level geometric data") faces_group.create_array("face_areas", ["face"], "float32", "Surface area per face") faces_group.create_array("face_types", ["face"], "int32", "Face type codes") # Define edges group edges_group = builder.create_group("edges", "edge", "Edge-level data") edges_group.create_array("edge_lengths", ["edge"], "float32", "Edge lengths") # Build and apply schema schema = builder.build() merger.set_schema(schema) Using a schema provides several benefits: - **Validation**: The merger verifies that all arrays in the source files match the schema definitions, catching encoding errors early - **Documentation**: Schema serves as explicit documentation of the dataset structure - **Consistency**: Guarantees that all source files have the same structure, preventing merge failures - **Type Safety**: Ensures data types match expectations, avoiding silent type coercion issues The schema-based approach is particularly important when merging files from different encoding runs or when integrating data from multiple sources. Merging Execution ----------------- Once the DatasetMerger is initialized and the data structure is discovered or defined, you can execute the merge operation. The merger supports two execution modes: single-pass merging for smaller datasets and batch merging for large datasets that exceed available memory. Single-Pass Merge ~~~~~~~~~~~~~~~~~ For datasets where all source files can be loaded into memory simultaneously, single-pass merging provides the fastest execution: .. code-block:: python merger.merge_data( face_chunk=500_000, # Chunk size for face arrays edge_chunk=500_000, # Chunk size for edge arrays faceface_flat_chunk=100_000, # Chunk size for face-face relationships batch_size=None, # None = merge all files at once consolidate_metadata=True, # Consolidate Zarr metadata after merge force_compression=True # Output as compressed .dataset file ) The chunk size parameters control how Dask partitions arrays during processing. Larger chunks reduce overhead but increase memory usage. The optimal chunk size depends on your dataset characteristics: - **face_chunk/edge_chunk**: Set based on the typical number of faces/edges per file. For CAD models with 1000-10000 faces, 500,000 is a reasonable default. - **faceface_flat_chunk**: For face adjacency matrices, use smaller chunks (100,000) to manage the quadratic memory growth. **Process Overview:** The single-pass merge follows these steps: 1. **Load**: All ``.data`` files are loaded into xarray Datasets, organized by group (faces, edges, graph, etc.) 2. **Concatenate**: Arrays within each group are concatenated along their primary dimension (face, edge, etc.) 3. **Provenance**: A ``file_id`` array is added to each group, tracking which source file contributed each element 4. **Write**: The merged data is written to the ``.dataset`` file with optional compression This approach is memory-intensive but fast, making it ideal for datasets with hundreds or thousands of files where the total data size fits comfortably in available RAM. Batch Merge ~~~~~~~~~~~ For large datasets that exceed available memory, batch merging processes files in smaller groups and progressively builds the final dataset: .. code-block:: python merger.merge_data( face_chunk=500_000, edge_chunk=500_000, faceface_flat_chunk=100_000, batch_size=200, # Process 200 files at a time consolidate_metadata=True, force_compression=True ) Setting ``batch_size`` to a positive integer activates batch mode. The merger processes files in batches, creating partial ``.dataset`` files, then merges these partial datasets into the final result. **Process Overview:** Batch merging follows a two-stage approach: 1. **Partial Merges**: The source files are divided into batches of ``batch_size`` files. Each batch is merged into a partial ``.dataset`` file. If ``delete_source_files=True``, source files are deleted after each batch is successfully merged, progressively freeing disk space. 2. **Final Merge**: All partial ``.dataset`` files are merged into the final unified ``.dataset`` file. The partial datasets are then deleted (if ``delete_source_files=True``). **Mathematical Formulation:** For :math:`N` total files with batch size :math:`B`, the number of batches is: .. math:: \text{Number of batches} = \lceil N / B \rceil Batch :math:`i` contains files: .. math:: \text{Batch}_i = \{f_{iB+1}, f_{iB+2}, \ldots, f_{\min((i+1)B, N)}\} Each batch is merged into a partial dataset :math:`D_{\text{batch}_i}`. The final merge concatenates these partial results: .. math:: D_{\text{final}} = D_{\text{batch}_1} \oplus D_{\text{batch}_2} \oplus \cdots \oplus D_{\text{batch}_k} where :math:`\oplus` denotes the concatenation operation along the primary dimension of each group. **Benefits of Batch Merging:** - **Memory Efficiency**: Only one batch needs to be in memory at a time, enabling processing of arbitrarily large datasets - **Progress Tracking**: Partial merges provide natural checkpoints, making it easier to resume if the process is interrupted - **Disk Space Management**: Source files can be deleted incrementally, reducing peak disk space requirements **Choosing Batch Size:** The optimal batch size balances memory usage against merge overhead: - **Too Small** (e.g., 10-50 files): Many partial merges increase overhead and total processing time - **Too Large** (e.g., 1000+ files): May exceed available memory, causing slow disk swapping - **Recommended**: 100-300 files per batch for typical CAD datasets with 1000-100,000 faces per file Monitor memory usage during the first batch and adjust ``batch_size`` if you see excessive swapping or out-of-memory errors. Special Processing: Matrix Flattening ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some encoded data includes face-to-face relationship matrices, such as distance distributions or interaction features. These matrices have dimensions :math:`[\text{face} \times \text{face} \times \text{bins}]`, where the number of faces varies per file. The DatasetMerger applies special flattening logic to enable efficient concatenation. **Per-File Structure:** Each file contains a matrix representing relationships between all pairs of faces: .. code-block:: text File 1: a3_distance [3 faces × 3 faces × 64 bins] File 2: a3_distance [2 faces × 2 faces × 64 bins] Direct concatenation is impossible because the first two dimensions differ between files. **Flattening Process:** For each file :math:`i` with :math:`n_i` faces and relationship matrix :math:`A_i \in \mathbb{R}^{n_i \times n_i \times B}`, where :math:`B` is the number of bins, the flattening operation reshapes the matrix: .. math:: \text{Flattened}_i = \text{reshape}(A_i, [n_i \times n_i, B]) This converts the 3D matrix into a 2D array where each row represents one face-pair relationship and columns represent bins. **Merged Structure:** After flattening all files, the merger concatenates along the face-pair dimension: .. code-block:: text Merged: a3_distance [13 face-pairs × 64 bins] └─ (3×3=9) + (2×2=4) = 13 pairs The total number of rows is: .. math:: \text{Total face-pairs} = \sum_{i=1}^{N} (n_i \times n_i) **Provenance Tracking:** A ``file_id_faceface`` array tracks which file each face-pair came from: .. code-block:: text file_id_faceface = [0,0,0,0,0,0,0,0,0, 1,1,1,1] └────File 1 (9)────┘ └File 2 (4)┘ This enables filtering face-pair relationships by source file during analysis. **Benefits of Flattening:** - **Concatenation**: Flattened matrices have consistent dimensions, enabling straightforward concatenation - **Storage Efficiency**: Avoids sparse 3D storage where many file-pair combinations would be empty - **Query Performance**: 2D structure supports efficient slicing and filtering operations The flattening process is automatic and transparent—you don't need to modify encoded data. The merger detects face-face relationship arrays by naming convention and applies flattening automatically. Using the DatasetInfo Class ============================ The DatasetInfo class complements the DatasetMerger by handling metadata extraction, routing, and storage. While DatasetMerger consolidates numerical array data into the ``.dataset`` file, DatasetInfo processes JSON metadata files and organizes them into structured Parquet files for efficient querying and analysis. Initialization -------------- Creating a DatasetInfo instance requires specifying the source metadata files, output locations for different metadata types, and an optional schema that defines routing rules: .. code-block:: python from hoops_ai.storage.datasetstorage import DatasetInfo ds_info = DatasetInfo( info_files=["path/to/part_001.json", "path/to/part_002.json", ...], merged_store_path="merged_dataset.infoset", # Output for file-level metadata attribute_file_path="merged_dataset.attribset", # Output for categorical metadata schema=schema_dict # Optional: schema for metadata routing ) The initialization parameters control metadata sources and destinations: **Parameters:** - **info_files** (List[str]): List of paths to JSON metadata files, typically one per encoded CAD file. These JSON files contain metadata captured during the encoding process, such as file sizes, processing times, labels, and custom attributes. - **merged_store_path** (str): Path where the ``.infoset`` Parquet file will be created. This file contains file-level metadata with one row per source file, enabling efficient queries by file characteristics. - **attribute_file_path** (str): Path where the ``.attribset`` Parquet file will be created. This file contains categorical metadata and label descriptions, organized in a normalized table structure. - **schema** (Dict): Optional schema dictionary defining metadata routing rules. When provided, the schema determines which metadata fields belong in the ``.infoset`` file versus the ``.attribset`` file. Without a schema, DatasetInfo uses type-based heuristics for routing. The separation of metadata into two files serves different purposes: ``.infoset`` provides per-file information for filtering and stratification, while ``.attribset`` stores shared categorical definitions for labels and enumerations. Metadata Processing ------------------- After initialization, the DatasetInfo processes metadata files through a series of steps that parse, route, and store the information in optimized formats. Parse and Route Metadata ~~~~~~~~~~~~~~~~~~~~~~~~~ The core metadata processing workflow involves parsing JSON files, building provenance mappings, and storing results in Parquet format: .. code-block:: python # Parse all JSON metadata files ds_info.parse_info_files() # Build file ID mappings for provenance file_id_codes = ds_info.build_code_mappings(zip_files) # Returns: {"part_001": 0, "part_002": 1, ...} # Store metadata to Parquet files ds_info.store_info_to_parquet(table_name="file_info") Each method in this workflow serves a specific purpose: **parse_info_files() Process:** The ``parse_info_files()`` method orchestrates the complete metadata extraction and routing pipeline: 1. **Load JSON files**: Reads all metadata files specified during initialization 2. **Route metadata**: Classifies each field as file-level or categorical based on schema rules or type heuristics 3. **Extract descriptions**: Merges nested description structures (e.g., label definitions, enum values) 4. **Validate**: Checks metadata against schema requirements if a schema was provided 5. **Aggregate**: Combines metadata from all files into unified data structures 6. **Store**: Writes file-level metadata to ``.infoset`` and categorical metadata to ``.attribset`` 7. **Cleanup**: Optionally deletes processed JSON files to save disk space **build_code_mappings() Purpose:** The ``build_code_mappings()`` method creates the integer ID mapping that DatasetMerger uses for provenance tracking. It extracts file stems (filenames without extensions) from the zip_files list and assigns sequential integer IDs. These IDs appear in the ``file_id`` arrays within the merged dataset, enabling you to trace each data element back to its source file. **store_info_to_parquet() Execution:** The ``store_info_to_parquet()`` method writes the parsed and routed metadata to Parquet files. The ``table_name`` parameter identifies the table within the ``.infoset`` file, allowing multiple metadata tables to coexist if needed. Schema-Driven Routing ---------------------- When a schema is provided during initialization, DatasetInfo uses sophisticated routing rules to determine where each metadata field should be stored. This schema-driven approach provides precise control over metadata organization and enables validation of metadata structure. Schema Definition ~~~~~~~~~~~~~~~~~ A complete schema defines metadata categories, field specifications, and routing patterns: .. code-block:: python # Schema defines routing rules schema = { "metadata": { "file_level": { "size_cadfile": {"dtype": "int64", "required": False}, "processing_time": {"dtype": "float32", "required": False} }, "categorical": { "file_label": {"dtype": "int32", "values": [0, 1, 2, 3, 4]} }, "routing_rules": { "file_level_patterns": ["size_*", "duration_*", "processing_*"], "categorical_patterns": ["*_label", "category", "type"], "default_numeric": "file_level", "default_categorical": "categorical" } } } ds_info = DatasetInfo(..., schema=schema) The schema structure includes: - **file_level**: Dictionary of metadata fields that should appear in the ``.infoset`` file. Each field specifies a data type and whether it's required. - **categorical**: Dictionary of categorical metadata fields for the ``.attribset`` file. Fields can specify allowed values for validation. - **routing_rules**: Patterns and defaults that handle fields not explicitly defined in ``file_level`` or ``categorical`` sections: - **file_level_patterns**: Wildcard patterns (using ``*``) that match field names for file-level routing - **categorical_patterns**: Wildcard patterns for categorical routing - **default_numeric**: Where to route numeric fields not matched by other rules - **default_categorical**: Where to route string/boolean fields not matched by other rules Routing Logic ~~~~~~~~~~~~~ DatasetInfo applies a hierarchical decision process to determine the destination for each metadata field: 1. **Explicit Check**: Is the field name explicitly defined in ``file_level`` or ``categorical`` sections? 2. **Pattern Match**: Does the field name match any wildcard patterns in ``routing_rules``? 3. **Type-Based Default**: Based on the field's data type, apply the default routing (numeric → file_level, string/boolean → categorical) **Routing Examples:** Consider these metadata fields being processed: .. code-block:: python # Field: "size_cadfile" with value 1024000 # 1. Explicit check: Found in file_level definitions ✓ # 2. Routing decision: → .infoset file # Field: "file_label" with value 3 # 1. Explicit check: Found in categorical definitions ✓ # 2. Routing decision: → .attribset file # Field: "custom_metric" with value 42.5 (not in schema) # 1. Explicit check: Not found # 2. Pattern match: No match to any routing patterns # 3. Type-based default: Numeric value → apply default_numeric rule # 4. Routing decision: → .infoset file This hierarchical approach provides flexibility: you can explicitly control routing for critical fields while relying on patterns and defaults for auxiliary metadata. Understanding Output Files =========================== The DatasetMerger and DatasetInfo produce three primary output files that together comprise a complete, queryable dataset. Understanding the structure and purpose of each file enables effective use of the merged data in analysis and machine learning workflows. .dataset Files -------------- The ``.dataset`` file contains all numerical array data from the merged CAD files, organized into logical groups and compressed for efficient storage. **Format:** Compressed Zarr (ZipStore) with ``.dataset`` extension **Contents:** All numerical array data organized by groups (faces, edges, graph, etc.) **Structure:** .. code-block:: text my_flow.dataset (ZipStore) ├── faces/ │ ├── face_indices [N_faces] │ ├── face_areas [N_faces] │ ├── face_types [N_faces] │ ├── face_uv_grids [N_faces × U × V × 7] │ └── file_id [N_faces] ← Provenance tracking ├── edges/ │ ├── edge_indices [N_edges] │ ├── edge_lengths [N_edges] │ ├── edge_types [N_edges] │ └── file_id [N_edges] ← Provenance tracking ├── graph/ │ ├── edges_source [N_graph_edges] │ ├── edges_destination [N_graph_edges] │ ├── num_nodes [N_files] │ └── file_id [N_graph_edges] ← Provenance tracking └── faceface/ ├── a3_distance [N_face_pairs × bins] ├── d2_distance [N_face_pairs × bins] ├── extended_adjacency [N_face_pairs] └── file_id [N_face_pairs] ← Provenance tracking Each group contains arrays concatenated from all source files. The dimensions (``N_faces``, ``N_edges``, etc.) represent the total count across all merged files. The ``file_id`` arrays enable tracing each element back to its source file. **Key Features:** - **Compression**: ZipStore format provides transparent compression, reducing disk space while maintaining fast access - **Chunking**: Large arrays are divided into chunks for efficient partial loading - **Lazy Loading**: Data is loaded on-demand rather than reading the entire file into memory - **Group Organization**: Logical separation enables loading only relevant data for specific analyses **Access Pattern:** Use the DatasetExplorer or DatasetLoader classes to query and analyze data from ``.dataset`` files. These classes provide high-level interfaces for filtering, statistical analysis, and ML dataset preparation without requiring direct Zarr manipulation. .infoset Files -------------- The ``.infoset`` file contains file-level metadata organized in a tabular structure with one row per source CAD file. This format enables efficient filtering and querying based on file characteristics. **Format:** Parquet (columnar storage format) **Contents:** File-level metadata (one row per file) **Example Schema:** .. code-block:: text ┌────┬──────────────┬─────────────┬───────────────┬──────────────────┬────────┐ │ id │ name │ description │ size_cadfile │ processing_time │ subset │ ├────┼──────────────┼─────────────┼───────────────┼──────────────────┼────────┤ │ 0 │ part_001 │ Bracket │ 1024000 │ 12.5 │ train │ │ 1 │ part_002 │ Housing │ 2048000 │ 18.3 │ train │ │ 2 │ part_003 │ Cover │ 512000 │ 8.1 │ test │ └────┴──────────────┴─────────────┴───────────────┴──────────────────┴────────┘ The ``id`` column corresponds to the file IDs used in ``file_id`` arrays within the ``.dataset`` file, enabling cross-referencing between metadata and array data. **Standard Columns:** - **id**: Integer file ID (matches ``file_id`` in dataset arrays) - **name**: File stem (filename without extension) - **description**: Human-readable description of the file or part - **size_cadfile**: Original CAD file size in bytes - **processing_time**: Encoding duration in seconds - **subset**: Dataset split assignment (train, validation, test) **Additional Columns** (schema-dependent): - **flow_name**: Name of the flow that processed this file - **stream_cache_png**: Path to PNG visualization - **stream_cache_3d**: Path to 3D model (SCS/STL format) - **Custom fields**: Any additional metadata defined in your schema .attribset Files ---------------- The ``.attribset`` file contains categorical metadata and label descriptions in a normalized table structure. This file provides human-readable names and descriptions for categorical values used throughout the dataset. **Format:** Parquet (columnar storage format) **Contents:** Categorical metadata and label descriptions **Example Schema:** .. code-block:: text ┌──────────────┬────┬──────────────┬─────────────────────┐ │ table_name │ id │ name │ description │ ├──────────────┼────┼──────────────┼─────────────────────┤ │ file_label │ 0 │ Simple │ Basic geometry │ │ file_label │ 1 │ Medium │ Moderate complexity │ │ file_label │ 2 │ Complex │ High complexity │ │ face_types │ 0 │ Plane │ Planar surface │ │ face_types │ 1 │ Cylinder │ Cylindrical surface │ │ face_types │ 2 │ Sphere │ Spherical surface │ │ edge_types │ 0 │ Line │ Linear edge │ │ edge_types │ 1 │ Arc │ Circular arc │ └──────────────┴────┴──────────────┴─────────────────────┘ The table uses a normalized structure where each row defines one categorical value. The ``table_name`` column groups related definitions together. **Column Definitions:** - **table_name**: Name of the categorical attribute (e.g., "file_label", "face_types") - **id**: Integer code used in the dataset arrays - **name**: Short, human-readable name for this category - **description**: Detailed description of what this category represents **Purpose:** The ``.attribset`` file enables interpretation of integer codes in the dataset. For example, when you see ``face_types = 1`` in the dataset arrays, you can look up the ``.attribset`` file to find that code 1 means "Cylinder" with description "Cylindrical surface." **Usage:** DatasetExplorer automatically loads ``.attribset`` data and uses it to provide human-readable labels in visualizations and summaries. When creating distribution plots or categorical analyses, the descriptions from this file appear in legends and axis labels. Integration with DatasetExplorer and DatasetLoader =================================================== The merged outputs (``.dataset``, ``.infoset``, ``.attribset``) serve as input for downstream analysis and machine learning workflows. The **DatasetExplorer** class provides tools for querying and analyzing merged datasets, while **DatasetLoader** prepares data for ML training with stratified splitting and framework integration. Quick Start ----------- The following example demonstrates basic usage of both DatasetExplorer and DatasetLoader with merged datasets: .. code-block:: python from hoops_ai.dataset import DatasetExplorer, DatasetLoader # Analysis with DatasetExplorer explorer = DatasetExplorer(flow_output_file="cad_pipeline.flow") explorer.print_table_of_contents() # Query and analyze face_dist = explorer.create_distribution(key="face_areas", group="faces", bins=20) file_codes = explorer.get_file_list( group="faces", where=lambda ds: ds['complexity_level'] >= 4 ) explorer.close() # ML Preparation with DatasetLoader loader = DatasetLoader( merged_store_path="cad_pipeline.dataset", parquet_file_path="cad_pipeline.infoset" ) # Stratified split loader.split(key="complexity_level", train=0.7, validation=0.15, test=0.15) # Get datasets train_dataset = loader.get_dataset("train") val_dataset = loader.get_dataset("validation") # PyTorch integration train_torch = train_dataset.to_torch() This workflow illustrates the typical progression: explore and validate the merged dataset using DatasetExplorer, then prepare training data using DatasetLoader. **For Comprehensive Documentation:** For detailed information on using these tools, see: - :doc:`explore-dataset` - Complete guide covering query operations, distribution analysis, metadata filtering, stratified splitting for ML training, PyTorch integration, and complete workflow examples with performance optimization techniques The exploration and loading documentation provides in-depth coverage of all available operations, including advanced filtering, custom item loaders, and framework-specific adapters. Custom Merging Tasks --------------------- While the Flow module's automatic merging (via ``auto_dataset_export=True``) handles most use cases, you may need custom merging behavior for specialized requirements. Creating a custom merging task using the ``@flowtask.custom`` decorator provides complete control over merge parameters and processing logic. Advanced: Custom Merging Task ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following example demonstrates a custom merging task that specifies non-standard chunk sizes, batch configurations, and cleanup behavior: .. code-block:: python from hoops_ai.flowmanager import flowtask from typing import List @flowtask.custom( name="custom_dataset_merge", inputs=["cad_files_encoded"], outputs=["merged_dataset_path"], parallel_execution=False ) def custom_merge_task(encoded_files: List[str]) -> str: """Custom merging logic with specific parameters""" from hoops_ai.storage.datasetstorage import DatasetMerger, DatasetInfo import pathlib # Find all .data and .json files data_files = [f"{f}.data" for f in encoded_files] json_files = [f"{f}.json" for f in encoded_files] output_dir = pathlib.Path("./custom_merged") output_dir.mkdir(exist_ok=True) # Process metadata ds_info = DatasetInfo( info_files=json_files, merged_store_path=str(output_dir / "custom.infoset"), attribute_file_path=str(output_dir / "custom.attribset"), schema=cad_schema ) ds_info.parse_info_files() file_id_codes = ds_info.build_code_mappings(data_files) ds_info.store_info_to_parquet() # Merge data arrays with custom configuration merger = DatasetMerger( zip_files=data_files, merged_store_path=str(output_dir / "custom.dataset"), file_id_codes=file_id_codes, dask_client_params={'n_workers': 16, 'threads_per_worker': 2}, delete_source_files=False ) merger.set_schema(cad_schema) merger.merge_data( face_chunk=1_000_000, # Custom chunk sizes edge_chunk=1_000_000, batch_size=500 # Process in larger batches ) return str(output_dir / "custom.dataset") # Use in flow with custom merging import hoops_ai custom_flow = hoops_ai.create_flow( name="custom_pipeline", tasks=[gather_cad_files, encode_cad_geometry, custom_merge_task], auto_dataset_export=False # Disable automatic merging ) This approach is useful when you need to: - Use non-standard chunk sizes optimized for your specific data characteristics - Preserve source files (``delete_source_files=False``) for archival or debugging - Apply custom Dask configurations for specific hardware environments - Integrate additional processing steps between encoding and merging The custom task has access to the full DatasetMerger and DatasetInfo APIs, enabling any merging configuration supported by these classes. Performance Considerations ========================== Optimizing merge performance requires understanding memory constraints, parallelization strategies, and compression tradeoffs. This section provides guidelines for tuning DatasetMerger for different dataset sizes and hardware configurations. Memory Management ----------------- Large datasets can exhaust available memory during merging. Batch merging provides a solution by processing files in manageable groups and progressively building the final dataset. Batch Merging for Large Datasets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For datasets with thousands of files or very large individual files, batch merging prevents out-of-memory errors: .. code-block:: python # For 10,000+ files, use batch merging merger.merge_data( batch_size=500, # Process 500 files at a time face_chunk=500_000, edge_chunk=500_000 ) Batch merging offers several benefits: **Benefits:** - Prevents out-of-memory errors - Enables progress tracking - Allows incremental cleanup of source files The optimal ``batch_size`` depends on your system's available memory and the size of individual files. Monitor memory usage during the first batch and adjust accordingly. Parallel Processing ------------------- DatasetMerger uses Dask for parallel processing, distributing merge operations across multiple workers. Proper Dask configuration significantly impacts merge performance. Dask Configuration ~~~~~~~~~~~~~~~~~~ The ``dask_client_params`` parameter controls parallel execution behavior: .. code-block:: python dask_client_params = { 'n_workers': 8, # Number of parallel workers 'threads_per_worker': 4, # Threads per worker 'processes': True, # Use separate processes (not threads) 'memory_limit': '8GB', # Per-worker memory limit 'dashboard_address': ':8787' # Monitoring dashboard URL } merger = DatasetMerger( zip_files=files, merged_store_path="output.dataset", dask_client_params=dask_client_params ) Each parameter affects performance differently: - **n_workers**: Total number of parallel processes. Set based on CPU cores available. More workers enable greater parallelism but increase memory requirements. - **threads_per_worker**: Number of threads within each worker process. Increase for CPU-intensive operations, decrease for I/O-intensive tasks. - **processes**: When ``True``, workers run as separate processes with isolated memory. This is recommended for memory-intensive operations to prevent interference. - **memory_limit**: Maximum memory per worker. Set conservatively to prevent system-wide memory exhaustion. If workers exceed this limit, Dask spills data to disk. - **dashboard_address**: Enables the Dask dashboard for real-time monitoring of task execution, memory usage, and worker status. Tuning Guidelines ~~~~~~~~~~~~~~~~~ Different workload characteristics require different Dask configurations: **Many Small Files** (I/O bound): - More workers, fewer threads per worker - Example: ``n_workers=16, threads_per_worker=2`` - Rationale: File I/O is the bottleneck; maximize concurrent file reads **Few Large Files** (CPU bound): - Fewer workers, more threads per worker - Example: ``n_workers=4, threads_per_worker=8`` - Rationale: Computation dominates; leverage multi-threading for array operations **Memory Constrained Systems**: - Reduce ``memory_limit`` per worker - Decrease ``batch_size`` to process fewer files at once - Example: ``n_workers=4, memory_limit='4GB', batch_size=100`` - Rationale: Prevent memory swapping which severely degrades performance **Optimal Configuration Discovery**: Start with conservative settings (few workers, moderate memory limits) and monitor the Dask dashboard during a test merge. Increase parallelism if CPU utilization is low, or decrease it if you observe memory pressure or disk swapping. Compression ----------- The DatasetMerger uses Zarr's compression capabilities to reduce disk space requirements for merged datasets. Understanding the compression settings helps predict storage needs and access performance. Zarr Compression Settings ~~~~~~~~~~~~~~~~~~~~~~~~~~ DatasetMerger applies the following compression configuration automatically: - **Codec**: Zstd level 12 (high compression ratio with reasonable decompression speed) - **Chunks**: Automatic sizing based on array dimensions (typically ~1M elements per chunk) - **Filters**: Delta encoding for integer arrays (stores differences rather than absolute values, improving compression) These settings balance compression ratio against decompression speed for typical CAD data patterns. Typical Compression Ratios ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Compression effectiveness varies by data type: - **Raw array data** (face areas, edge lengths): 10-20x compression - Floating-point arrays with spatial locality compress very well - **Face-face histograms** (distance distributions): 5-10x compression - Histogram bins often contain many zeros, enabling good compression - **Graph structures** (edge indices, adjacency): 3-5x compression - Integer sequences with predictable patterns compress moderately well These ratios mean a dataset that would occupy 100 GB uncompressed typically requires 5-10 GB as a compressed ``.dataset`` file. Summary ======= The **DatasetMerger** serves as the critical bridge between individual encoded CAD files and ML-ready unified datasets, providing the consolidation layer that enables efficient analysis and training. Key Capabilities ---------------- - **Automatic Integration**: Seamlessly invoked by the Flow module with ``auto_dataset_export=True``, requiring no manual setup for standard workflows - **Schema-Driven Organization**: Uses SchemaBuilder definitions for predictable, validated merging with explicit control over data structure - **Parallel Processing**: Leverages Dask's distributed computation engine for efficient large-scale data consolidation across multiple CPU cores - **Provenance Tracking**: Maintains ``file_id`` arrays in every group, enabling traceability from merged data back to source files - **Multiple Output Formats**: Produces three complementary files serving different purposes: - ``.dataset``: Numerical array data in compressed Zarr format - ``.infoset``: File-level metadata in queryable Parquet format - ``.attribset``: Categorical metadata and label descriptions in Parquet format Downstream Integration ----------------------- The merged output files serve as inputs for analysis and machine learning tools: - **DatasetExplorer**: Query, filter, and analyze merged datasets with support for statistical analysis, distribution creation, and metadata-based filtering - **DatasetLoader**: Prepare stratified train/validation/test splits for ML training with support for PyTorch, TensorFlow, and custom frameworks Complete System Architecture ----------------------------- The following diagram illustrates how DatasetMerger fits within the complete HOOPS AI data pipeline: .. code-block:: text ┌─────────────────────────────────────────────────────────────────────────────┐ │ HOOPS AI Data Pipeline │ └─────────────────────────────────────────────────────────────────────────────┘ STEP 1: SCHEMA DEFINITION (Optional but Recommended) ┌──────────────────────────────────────────────────────────────┐ │ SchemaBuilder │ │ • Define groups (faces, edges, graph) │ │ • Define arrays and dimensions │ │ • Set metadata routing rules │ │ • Build schema dictionary │ └────────────────────────┬─────────────────────────────────────┘ │ schema.json ↓ STEP 2: ENCODING (Per-File Parallel Processing) ┌──────────────────────────────────────────────────────────────┐ │ Flow → CADEncodingTask (via ParallelTask) │ │ │ │ CAD File 1 → Encoder → DataStorage → part_001.data ────┐ │ │ CAD File 2 → Encoder → DataStorage → part_002.data ────┤ │ │ ... │ │ │ CAD File N → Encoder → DataStorage → part_N.data ──────┤ │ │ │ │ │ Each .data file (Zarr format) contains: │ │ │ • faces/face_areas, face_types, face_uv_grids │ │ │ • edges/edge_lengths, edge_types │ │ │ • graph/edges_source, edges_destination │ │ │ │ │ │ Metadata files (JSON): │ │ │ part_001.json, part_002.json, ..., part_N.json │ │ └──────────────────────────────────────────────────────────┼───┘ │ ┌─────────────────────────────────┘ │ All .data and .json files ↓ STEP 3: AUTOMATIC MERGING (AutoDatasetExportTask) ┌──────────────────────────────────────────────────────────────┐ │ DatasetInfo │ │ • Load all .json metadata files │ │ • Route metadata using schema rules │ │ • Create file_id mappings │ │ • Output: {flow_name}.infoset (file-level metadata) │ │ {flow_name}.attribset (categorical metadata) │ └────────────────────────┬─────────────────────────────────────┘ │ ┌────────────────────────┴─────────────────────────────────────┐ │ DatasetMerger │ │ • Discover groups from schemas or heuristics │ │ • Load all .data files as xarray Datasets │ │ • Concatenate arrays by group (faces, edges, graph) │ │ • Add file_id provenance tracking │ │ • Handle special processing (matrix flattening) │ │ • Output: {flow_name}.dataset (compressed Zarr) │ └────────────────────────┬─────────────────────────────────────┘ │ │ Three output files: │ • {flow_name}.dataset │ • {flow_name}.infoset │ • {flow_name}.attribset ↓ ┌─────────────────────────────────────────────────────────────┐ │ OUTPUT FILES │ │ │ │ ┌────────────────────────────────────────────┐ │ │ │ {flow_name}.dataset (Zarr/ZipStore) │ │ │ │ ───────────────────────────────────── │ │ │ │ • Numerical array data organized by groups │ │ │ │ • faces/, edges/, graph/, faceface/ │ │ │ │ • Each group has file_id for provenance │ │ │ │ • Compressed for efficient storage │ │ │ └────────────────────────────────────────────┘ │ │ │ │ ┌────────────────────────────────────────────┐ │ │ │ {flow_name}.infoset (Parquet) │ │ │ │ ───────────────────────────────── │ │ │ │ • File-level metadata (one row per file) │ │ │ │ • Columns: id, name, description, │ │ │ │ size_cadfile, processing_time, etc. │ │ │ │ • Queryable with pandas │ │ │ └────────────────────────────────────────────┘ │ │ │ │ ┌────────────────────────────────────────────┐ │ │ │ {flow_name}.attribset (Parquet) │ │ │ │ ───────────────────────────────────── │ │ │ │ • Categorical metadata & descriptions │ │ │ │ • Label mappings (id → name → description) │ │ │ │ • Face types, edge types, etc. │ │ │ └────────────────────────────────────────────┘ │ └─────────────────────┬───────────────────────────────────────┘ │ │ Consumed by analysis & ML tools ↓ STEP 4: ANALYSIS & ML TRAINING ┌──────────────────────────────────────────────────────────────┐ │ DatasetExplorer │ │ • Query merged datasets (get_array_data) │ │ • Statistical analysis (compute_statistics) │ │ • Distribution creation (create_distribution) │ │ • Metadata filtering (filter_files_by_metadata) │ │ • Visualization support │ └──────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────┐ │ DatasetLoader │ │ • Load merged datasets for ML training │ │ • Stratified train/val/test splitting │ │ • PyTorch/TensorFlow adapter support │ │ • Custom item loader functions │ │ • Batch loading for training loops │ └──────────────────────────────────────────────────────────────┘ ↓ ┌──────────────────────────────────────────────────────────────┐ │ ML Training Pipeline │ │ • Load train/val/test datasets │ │ • Create DataLoaders with batching │ │ • Train neural networks (GNNs, CNNs, etc.) │ │ • Model evaluation and inference │ └──────────────────────────────────────────────────────────────┘ Key Integration Points ---------------------- The DatasetMerger participates in several critical integration points within the HOOPS AI ecosystem: 1. **SchemaBuilder → DataStorage**: Schema defines how individual files are organized during encoding 2. **DataStorage → DatasetMerger**: Individual ``.data`` files are merged using the schema structure to determine grouping and concatenation logic 3. **SchemaBuilder → DatasetInfo**: Schema routing rules determine which metadata fields go to ``.infoset`` versus ``.attribset`` 4. **DatasetMerger Output → DatasetExplorer**: The merged ``.dataset`` file enables efficient exploration, querying, and statistical analysis 5. **DatasetMerger Output → DatasetLoader**: The merged ``.dataset`` and ``.infoset`` files provide the foundation for ML training data preparation This integrated system enables seamless progression from raw CAD files to production ML models with minimal manual intervention. The schema-driven approach ensures consistency across all pipeline stages, while automatic merging eliminates the need for custom data consolidation code in most workflows.