Dataset Exploration and Mining

Introduction 

The DatasetExplorer and DatasetLoader are complementary tools that work with merged datasets (.dataset, .infoset, .attribset files) produced by the DatasetMerger during Flow execution. Together, they provide comprehensive capabilities for dataset analysis, querying, and ML training preparation.

These classes form the analysis and ML preparation layer of the HOOPS AI pipeline, consuming the unified datasets produced by the automatic merging process described in Data Merging in HOOPS AI.

Key Purposes

The dataset exploration and ML preparation module consists of two core components:

DatasetExplorer: Query, analyze, and visualize merged CAD datasets. This tool provides read-only exploration operations and statistical analysis without modifying the underlying data.
DatasetLoader: Prepare datasets for ML training with stratified train/validation/test splitting. The DatasetLoader manages dataset splitting using stratification techniques to ensure each subset maintains the same proportion of categories as the original dataset, preventing evaluation bias.

Pipeline Position

The typical workflow follows this progression:

DatasetMerger Output → DatasetExplorer (Analysis) → DatasetLoader (ML Prep) → Training
(.dataset/.infoset/.attribset)

This integration ensures seamless progression from merged data consolidation through exploratory analysis to ML model training.

Architecture Overview 

Position in Pipeline

DatasetExplorer and DatasetLoader operate in the Analysis & ML Preparation Phase of the HOOPS AI pipeline:

┌──────────────────────────────────────────────────────────────────────┐
│                    HOOPS AI Complete Pipeline                        │
└──────────────────────────────────────────────────────────────────────┘

1. ENCODING PHASE (Per-File)
   ┌─────────────────────────────────────────────────────┐
   │  @flowtask.transform                                │
   │  CAD File → Encoder → Storage → .data file          │
   └─────────────────────────────────────────────────────┘
                            ↓
2. MERGING PHASE (Automatic)
   ┌─────────────────────────────────────────────────────┐
   │  AutoDatasetExportTask (auto_dataset_export=True)   │
   │  Multiple .data → DatasetMerger → .dataset          │
   │  Multiple .json → DatasetInfo → .infoset/.attribset │
   └─────────────────────────────────────────────────────┘
                            ↓
3. ANALYSIS PHASE (DatasetExplorer) ← YOU ARE HERE
   ┌─────────────────────────────────────────────────────┐
   │  .dataset + .infoset + .attribset                   │
   │      ↓                                              │
   │  DatasetExplorer                                    │
   │   - Query arrays by group/key                       │
   │   - Analyze distributions                           │
   │   - Filter by metadata                              │
   │   - Statistical summaries                           │
   │   - Cross-group queries                             │
   └─────────────────────────────────────────────────────┘
                            ↓
4. ML PREPARATION PHASE (DatasetLoader)
   ┌─────────────────────────────────────────────────────┐
   │  DatasetLoader                                      │
   │   - Stratified train/val/test split                 │
   │   - Multi-label support                             │
   │   - Framework-agnostic CADDataset                   │
   │   - PyTorch adapter (.to_torch())                   │
   │   - Custom item loaders                             │
   └─────────────────────────────────────────────────────┘
                            ↓
5. ML TRAINING PHASE
   ┌─────────────────────────────────────────────────────┐
   │  PyTorch DataLoader → Training Loop → Model         │
   └─────────────────────────────────────────────────────┘

Input Files

Both DatasetExplorer and DatasetLoader consume the output files generated by the DatasetMerger during Flow execution.

Required Files

1. .dataset file (Compressed Zarr)

The .dataset file contains all merged array data organized by groups. This file uses the ZipStore format with Zstd compression for efficient storage and access via xarray and Dask for parallel operations.

Structure: {flow_name}.dataset

Format: Zarr ZipStore with compression

Access: xarray with Dask parallel processing

2. .infoset file (Parquet)

The .infoset file contains file-level metadata with one row per CAD file. This columnar storage format enables efficient querying using pandas DataFrame operations.

Structure: Columnar storage with id, name, description, and custom fields

Format: Parquet

Access: pandas DataFrame operations

3. .attribset file (Parquet) - Optional

The .attribset file contains categorical metadata and label descriptions, mapping numeric codes to human-readable names and descriptions.

Structure: table_name, id, name, description columns

Format: Parquet

Access: pandas DataFrame operations

File Location

Files are generated by Flow execution in the following directory structure:

flow_output/flows/{flow_name}/
├── {flow_name}.dataset      ← Merged data arrays
├── {flow_name}.infoset      ← File-level metadata
├── {flow_name}.attribset    ← Categorical metadata
└── {flow_name}.flow         ← Flow specification (JSON)

Relationship to DatasetMerger

Understanding the relationship between DatasetMerger and the exploration/loading tools clarifies their distinct roles:

DatasetMerger (automatic during Flow)

Input: Individual .data and .json files (per CAD file)
Process: Concatenate arrays, route metadata, add provenance tracking
Output: Unified .dataset, .infoset, .attribset files

DatasetExplorer (user-driven analysis)

Input: Output files from DatasetMerger
Process: Query, filter, analyze, visualize
Output: Statistics, distributions, filtered file lists

DatasetLoader (ML preparation)

Input: Output files from DatasetMerger
Process: Stratified splitting, dataset creation
Output: Train/val/test CADDataset objects

Key Distinction

The three components serve different operational roles:

DatasetMerger: Write-heavy operation (consolidate many files into one)

DatasetExplorer: Read-heavy operation (query and analyze unified data)

DatasetLoader: Read + Index operation (split and serve data for training)

This separation of concerns enables efficient workflows where data consolidation occurs once, followed by iterative analysis and multiple ML training experiments without re-merging.

DatasetExplorer 

The DatasetExplorer class provides methods for discovering, querying, and analyzing merged datasets. This class focuses on read-only exploration operations and statistical analysis without modifying the underlying data.

Initialization and Setup

The DatasetExplorer supports multiple initialization methods to accommodate different workflow preferences.

Initialization Methods

Method 1: Using Flow Output JSON File (Recommended)

The most convenient approach uses the .flow JSON file generated by Flow execution. This file contains all necessary paths and the explorer automatically resolves them:

from hoops_ai.dataset import DatasetExplorer

# Initialize using flow file
explorer = DatasetExplorer(flow_output_file="path/to/flow_name.flow")

The flow file contains keys such as flow_data (pointing to the Zarr dataset), flow_info (pointing to the Parquet metadata), and flow_attributes (pointing to attribute metadata). The explorer automatically resolves relative paths based on the flow file location.

Method 2: Explicit File Paths

For scenarios where you need direct control over file paths or when working outside the Flow framework:

# Initialize with explicit paths
explorer = DatasetExplorer(
    merged_store_path="path/to/flow_name.dataset",
    parquet_file_path="path/to/flow_name.infoset",
    parquet_file_attribs="path/to/flow_name.attribset"  # Optional
)

This approach is useful when files are in non-standard locations or when integrating with external data processing pipelines.

Method 3: With Custom Dask Configuration

For large datasets or specific performance requirements, you can customize the Dask parallel processing configuration:

# Initialize with custom Dask settings
explorer = DatasetExplorer(
    flow_output_file="path/to/flow_name.flow",
    dask_client_params={
        'n_workers': 8,
        'threads_per_worker': 4,
        'memory_limit': '8GB'
    }
)

Dask Configuration

The DatasetExplorer uses Dask for parallel processing of large datasets. Dask is a parallel computing library that processes data in chunks across multiple CPU cores, enabling work with datasets larger than available RAM.

By default, DatasetExplorer creates a local Dask cluster with sensible defaults. You can customize the Dask configuration by providing parameters:

Parameters:

flow_output_file (str, optional): Path to .flow JSON containing all file paths
merged_store_path (str, optional): Path to .dataset file
parquet_file_path (str, optional): Path to .infoset file
parquet_file_attribs (str, optional): Path to .attribset file
dask_client_params (dict, optional): Dask configuration for parallel operations

For very large datasets, configuring Dask with more workers and increased memory limits improves performance. However, for smaller datasets or systems with limited resources, the default configuration is sufficient.

Discovering Dataset Structure

Before querying specific data, understanding the available groups and arrays in the dataset is essential.

Available Groups

Groups represent logical collections of related data. To discover available groups:

# Get list of available groups
available_groups = explorer.available_groups()
print(f"Groups: {available_groups}")
# Output: {'faces', 'edges', 'graph', 'machining'}

Each group corresponds to a category of CAD data (faces, edges, graph structures, etc.) as defined in the schema used during encoding.

Available Arrays within Groups

Each group contains multiple arrays storing different attributes. To discover arrays within a specific group:

# Get arrays in the faces group
available_arrays = explorer.available_arrays("faces")
print(f"Face arrays: {available_arrays}")
# Output: {'face_indices', 'face_areas', 'face_types', 'face_uv_grids', 'file_id_code_faces'}

The array names reflect the data stored: geometric properties (areas, lengths), categorical types, and provenance tracking (file_id_code_* arrays).

Retrieving Metadata Descriptions

The Parquet metadata file contains description tables that map numeric codes to human-readable names. To retrieve these descriptions:

# Get face type descriptions
face_types = explorer.get_descriptions(table_name="face_types")
print(face_types)
# Output:
#   id     name           description
# 0  0   Plane          Planar surface
# 1  1   Cylinder       Cylindrical surface
# 2  2   Cone           Conical surface
# 3  3   Sphere         Spherical surface

The get_descriptions method accepts several parameters:

table_name: The name of the metadata table (e.g., "face_types", "edge_types", "label")

key_id: Optional integer to filter results to a specific ID

use_wildchar: Optional boolean to enable wildcard matching in table names

To search for label-related tables using wildcards:

# Find label tables using wildcard
label_tables = explorer.get_descriptions("label", None, True)
print(label_tables)
# Returns all tables with "label" in the name

Print Dataset Overview

To get a comprehensive overview of the entire dataset structure:

# Print complete table of contents
explorer.print_table_of_contents()

This command outputs a formatted summary showing all groups, their arrays with shapes and data types, and metadata file information. Example output:

========================================
DATASET TABLE OF CONTENTS
========================================

Available Groups:
--------------------------------------------------

Group: faces
  Arrays:
    - face_indices: (48530,) int32
    - face_areas: (48530,) float32
    - face_types: (48530,) int32
    - face_uv_grids: (48530, 20, 20, 7) float32
    - file_id_code_faces: (48530,) int32

Group: edges
  Arrays:
    - edge_indices: (72845,) int32
    - edge_lengths: (72845,) float32
    - edge_types: (72845,) int32
    - file_id_code_edges: (72845,) int32

Group: machining
  Arrays:
    - machining_category: (100,) int32
    - material_type: (100,) int32
    - file_id_code_machining: (100,) int32

Metadata Files:
  - Info: cad_pipeline.infoset (file-level metadata)
  - Attributes: cad_pipeline.attribset (categorical metadata)

Total Files: 100

Querying Data

The DatasetExplorer provides multiple methods for accessing data at different granularities: individual arrays, complete groups, or file-specific subsets.

Get Array Data

To retrieve a complete array for a specific group:

# Get complete array data for a group
face_areas = explorer.get_array_data(group_name="faces", array_name="face_areas")
# Returns: xr.DataArray with shape [N_total_faces]

# Access underlying NumPy array
face_areas_np = face_areas.values
print(f"Total faces: {len(face_areas_np)}")
print(f"Mean area: {face_areas_np.mean():.2f}")

The returned object is an xarray DataArray, which provides labeled multi-dimensional array functionality similar to pandas for higher-dimensional data. The .values attribute accesses the underlying NumPy array.

Get Group Data

To access all arrays within a group as a single dataset:

# Get entire dataset for a group
faces_ds = explorer.get_group_data("faces")
print(faces_ds)
# Output:
# <xarray.Dataset>
# Dimensions:        (face: 48530)
# Coordinates:
#   * face           (face) int64 0 1 2 3 ... 48527 48528 48529
# Data variables:
#     face_indices   (face) int32 ...
#     face_areas     (face) float32 ...
#     face_types     (face) int32 ...
#     file_id_code_faces (face) int32 ...

# Access multiple arrays
face_areas = faces_ds['face_areas']
face_types = faces_ds['face_types']

Each returned dataset is an xarray.Dataset object containing data variables (arrays) with their associated coordinates and dimensions. This provides a convenient way to work with related arrays together.

Get File-Specific Data

To retrieve data for a specific CAD file within the merged dataset:

# Get data for a specific file
file_id_code = 5
face_subset = explorer.file_dataset(file_id_code=file_id_code, group="faces")
print(f"File {file_id_code} has {len(face_subset.face)} faces")

# Access arrays for this file only
file_face_areas = face_subset['face_areas'].values
print(f"Face areas for file {file_id_code}: {file_face_areas}")

The provenance tracking (file_id_code_* arrays) enables efficient filtering to extract data belonging to a single file from the merged dataset.

Filter by Condition

To identify files matching specific criteria:

# Get files matching a boolean condition
def high_complexity_filter(ds):
    """Filter for files with many faces"""
    # Example: faces with area > 100
    return ds['face_areas'] > 100

file_codes = explorer.get_file_list(
    group="faces",
    where=high_complexity_filter
)
print(f"Found {len(file_codes)} files with large faces")

# Convert file codes to file names
file_names = [explorer.decode_file_id_code(code) for code in file_codes]

The where parameter accepts a callable (function or lambda) that receives an xarray.Dataset and returns an xarray.DataArray of boolean values. The method returns an array of file ID codes where the condition is True.

Distribution Analysis

Computing distributions and histograms helps understand data balance and inform stratification strategies for ML training.

Creating Distributions

To compute the distribution of attributes across the entire dataset:

# Create distribution with automatic binning
distribution = explorer.create_distribution(
    key="face_areas",
    group="faces",
    bins=20
)

# Access distribution components
print(f"Bin edges: {distribution['bin_edges']}")
print(f"Histogram counts: {distribution['hist']}")
print(f"Files per bin: {distribution['file_ids_in_bins']}")

# Example output:
# bin_edges: [0.5, 1.5, 2.5, ..., 20.5]
# hist: [145, 302, 567, ..., 89]
# file_ids_in_bins: [['part_001', 'part_003'], ['part_002', 'part_005'], ...]

When bins=None, the method automatically detects categorical data and creates one bin per unique category. For continuous numeric variables, specify the number of bins to create evenly spaced bins spanning the data range.

The returned dictionary contains:

bin_edges: Array of bin boundaries
hist: Count of items in each bin
file_ids_in_bins: Lists of file IDs whose items fall in each bin

Visualizing Distributions

Distribution results can be visualized using standard plotting libraries:

import matplotlib.pyplot as plt
import numpy as np

dist = explorer.create_distribution(key="face_areas", group="faces", bins=30)

# Plot histogram
bin_centers = 0.5 * (dist['bin_edges'][1:] + dist['bin_edges'][:-1])
plt.bar(bin_centers, dist['hist'], width=(dist['bin_edges'][1] - dist['bin_edges'][0]))
plt.xlabel('Face Area')
plt.ylabel('Count')
plt.title('Face Area Distribution')
plt.show()

This visualization helps identify class imbalance and guides decisions about data augmentation or weighted loss functions during training.

Metadata Queries

The DatasetExplorer provides methods to access file-level metadata and categorical descriptions stored in the Parquet files.

File-Level Metadata

To retrieve metadata for all files in the dataset:

# Get metadata for all files
all_file_info = explorer.get_file_info_all()
print(all_file_info.head())
# Output:
#    id         name  size_cadfile  processing_time  complexity_level  subset
# 0   0   part_001      1024000             12.5                     3   train
# 1   1   part_002      2048000             18.3                     4   train
# 2   2   part_003       512000              8.1                     2    test

The returned pandas DataFrame contains complete metadata for every file, enabling bulk analysis and reporting.

To retrieve metadata for a specific file:

# Get metadata for specific file
file_info = explorer.get_parquet_info_by_code(file_id_code=5)
print(file_info)

Categorical Metadata (Labels/Descriptions)

To access label descriptions from the .attribset file:

# Get label descriptions
complexity_labels = explorer.get_descriptions(table_name="complexity_level")
print(complexity_labels)
# Output:
#   id     name           description
# 0  1   Simple      Basic geometry
# 1  2   Medium      Moderate complexity
# 2  3   Complex     High complexity
# 3  4   Very Complex   Advanced features

# Get specific label description
label_3 = explorer.get_descriptions(table_name="complexity_level", key_id=3)
print(label_3['name'].values[0])  # Output: "Complex"

Stream Cache Paths (Visualizations)

To retrieve paths to visualization assets (PNG thumbnails and 3D stream cache files):

# Get paths to PNG and 3D stream cache files
stream_paths = explorer.get_stream_cache_paths()
print(stream_paths[['id', 'name', 'stream_cache_png', 'stream_cache_3d']])

# Get stream cache for specific file
file_stream = explorer.get_stream_cache_paths(file_id_code=10)
png_path = file_stream['stream_cache_png'].values[0]
scs_path = file_stream['stream_cache_3d'].values[0]

Advanced Features

The DatasetExplorer provides advanced analytical capabilities for multi-label analysis and stratification.

Membership Matrix

Membership matrices are critical for stratified dataset splitting. A membership matrix is a 2D array where each row represents a file and each column represents a category or bin. The cell value indicates membership: for binary matrices, a value of 1 means the file contains at least one item in that category; for count matrices, the value indicates how many items belong to that category.

# Create membership matrix for multi-label analysis
matrix, file_codes, categories = explorer.build_membership_matrix(
    group="faces",
    key="face_types",
    bins_or_categories=None,  # Auto-discover categories
    as_counts=False  # Boolean membership (True) or counts (False)
)

print(f"Matrix shape: {matrix.shape}")  # (N_files, N_categories)
print(f"File codes: {file_codes[:10]}")
print(f"Categories: {categories}")

# Use for stratification analysis
import pandas as pd
df = pd.DataFrame(matrix, columns=categories)
df['file_code'] = file_codes
print(df.head())

Count-based matrices (as_counts=True) provide more detailed information about the distribution of features within each file, which can be valuable for certain analysis tasks.

Resource Management

The DatasetExplorer creates internal resources (such as Dask clients) that should be cleaned up when no longer needed:

# Close resources when done
explorer.close(close_dask=True)

Always close the explorer to free memory and terminate Dask workers, especially when working with large datasets or in interactive environments.

DatasetLoader 

The DatasetLoader class manages dataset splitting and provides framework-agnostic access to training, validation, and test subsets. This class builds upon the DatasetExplorer to handle stratified splitting and subset management.

Initialization

The DatasetLoader requires paths to the Zarr dataset and Parquet metadata file:

from hoops_ai.dataset import DatasetLoader

# Basic initialization
loader = DatasetLoader(
    merged_store_path="path/to/flow_name.dataset",
    parquet_file_path="path/to/flow_name.infoset"
)

You can optionally provide a custom item_loader_func that defines how to load individual items from files. If no loader function is provided, the loader returns raw file paths and metadata when items are accessed.

Custom Item Loader (Experimental)

For advanced use cases, you can define a custom item loader function:

def custom_loader(graph_file, label_file, data_id):
    """Custom function to load and process items"""
    import dgl
    import numpy as np

    # Load graph
    graph = dgl.load_graphs(graph_file)[0][0]

    # Load label
    label = np.load(label_file)

    # Return as dictionary
    return {
        'graph': graph,
        'label': label,
        'id': data_id,
        'num_nodes': graph.number_of_nodes(),
        'num_edges': graph.number_of_edges()
    }

loader = DatasetLoader(
    merged_store_path="path/to/flow_name.dataset",
    parquet_file_path="path/to/flow_name.infoset",
    item_loader_func=custom_loader
)

Parameters:

merged_store_path (str): Path to .dataset file
parquet_file_path (str): Path to .infoset file
item_loader_func (callable, optional): Custom function to load items
- Signature: func(graph_file, label_file, data_id) -> item
- If None, returns raw file paths

Stratified Splitting

The DatasetLoader’s split method performs stratified splitting of the dataset into training, validation, and test subsets.

Basic Stratified Split

To perform a stratified split by a categorical key:

# Perform stratified split by a categorical key
train_size, val_size, test_size = loader.split(
    key="complexity_level",  # Metadata key to stratify on
    group="faces",           # Group containing the key
    train=0.7,               # 70% training
    validation=0.15,         # 15% validation
    test=0.15,               # 15% testing
    random_state=42          # For reproducibility
)

print(f"Dataset split:")
print(f"  Train: {train_size} files")
print(f"  Validation: {val_size} files")
print(f"  Test: {test_size} files")

Stratification ensures that each subset (train, validation, test) maintains the same label distribution as the overall dataset. This is critical for training models that generalize well to unseen data.

Mathematical Formulation

For stratified splitting with key \(K\) having \(C\) categories, the split aims to preserve the distribution:

\[P(k_i | \text{train}) \approx P(k_i | \text{validation}) \approx P(k_i | \text{test}) \approx P(k_i)\]

where \(k_i \in K\) is a category and \(P(k_i)\) is its proportion in the full dataset.

Multi-Label Stratification

For files with multiple labels (e.g., multiple face types per file), the loader uses MultilabelStratifiedShuffleSplit. This approach creates a membership matrix:

\[\mathbf{M} \in \{0, 1\}^{N \times C}\]

where:

\(N\) = number of files
\(C\) = number of categories
\(M_{ij} = 1\) if file \(i\) has category \(j\), else 0

The split preserves label co-occurrence patterns, ensuring that combinations of labels are proportionally represented in each subset.

Dataset Access

After splitting, you can retrieve framework-agnostic dataset objects for each subset.

Retrieving Subsets

To get datasets for each subset:

# Get framework-agnostic datasets
train_dataset = loader.get_dataset("train")
val_dataset = loader.get_dataset("validation")
test_dataset = loader.get_dataset("test")

print(f"Train: {len(train_dataset)} samples")
print(f"Val: {len(val_dataset)} samples")
print(f"Test: {len(test_dataset)} samples")

# Access individual items
item = train_dataset.get_item(0)
print(f"Item: {item}")

CADDataset Class

The CADDataset is a framework-agnostic wrapper that provides consistent access to dataset subsets:

# Properties
train_dataset.indices          # Indices into parent dataset
train_dataset.parent_dataset   # Reference to DatasetLoader

# Methods
item = train_dataset.get_item(i)       # Get item by local index
raw = train_dataset.get_raw_data(i)    # Get file paths without loading

ML Framework Integration

The DatasetLoader provides integration with popular ML frameworks through adapter methods.

PyTorch Integration

To convert a CADDataset to a PyTorch-compatible dataset:

from torch.utils.data import DataLoader

# Get training dataset
train_dataset = loader.get_dataset("train")

# Convert to PyTorch
train_torch = train_dataset.to_torch()

# Create DataLoader
train_loader = DataLoader(
    train_torch,
    batch_size=32,
    shuffle=True,
    num_workers=4,
    pin_memory=True  # For GPU training
)

# Training loop
for epoch in range(num_epochs):
    for batch in train_loader:
        # Unpack batch
        graphs = batch['graph']
        labels = batch['label']
        file_ids = batch['id']

        # Your training code
        outputs = model(graphs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

The .to_torch() method returns a PyTorch-compatible dataset object that can be used directly with PyTorch’s DataLoader for batching and parallel loading.

Resource Management

The DatasetLoader creates internal resources (such as DatasetExplorer instances) that should be cleaned up when no longer needed:

# Close resources
loader.close_resources(clear_split_history=True)

Always close the loader to free memory and clean up temporary resources, especially when running multiple experiments or in long-running processes.

Complete Workflow Examples 

This section demonstrates complete end-to-end workflows integrating DatasetExplorer and DatasetLoader.

Example 1: Basic Analysis and ML Preparation

The following example demonstrates the typical progression from dataset exploration through ML preparation:

import hoops_ai
from hoops_ai.flowmanager import flowtask
from hoops_ai.dataset import DatasetExplorer, DatasetLoader
import pathlib

# Assume flow already executed and created:
# - cad_pipeline.dataset
# - cad_pipeline.infoset
# - cad_pipeline.attribset
# - cad_pipeline.flow

flow_file = pathlib.Path("flow_output/flows/cad_pipeline/cad_pipeline.flow")

# ===== STEP 1: Explore Dataset =====
print("="*70)
print("STEP 1: DATASET EXPLORATION")
print("="*70)

explorer = DatasetExplorer(flow_output_file=str(flow_file))

# Print overview
explorer.print_table_of_contents()

# Analyze face area distribution
face_dist = explorer.create_distribution(key="face_areas", group="faces", bins=20)
print(f"\nFace area distribution:")
print(f"  Range: [{face_dist['bin_edges'][0]:.2f}, {face_dist['bin_edges'][-1]:.2f}]")
print(f"  Total faces: {face_dist['hist'].sum()}")
print(f"  Mean bin count: {face_dist['hist'].mean():.1f}")

# Filter files by complexity
high_complexity_filter = lambda ds: ds['complexity_level'] >= 4
complex_files = explorer.get_file_list(group="faces", where=high_complexity_filter)
print(f"\nHigh complexity files: {len(complex_files)}")

# Close explorer
explorer.close()

# ===== STEP 2: Prepare ML Dataset =====
print("\n" + "="*70)
print("STEP 2: ML DATASET PREPARATION")
print("="*70)

# Initialize loader
flow_path = pathlib.Path(flow_file)
loader = DatasetLoader(
    merged_store_path=str(flow_path.parent / f"{flow_path.stem}.dataset"),
    parquet_file_path=str(flow_path.parent / f"{flow_path.stem}.infoset")
)

# Stratified split
train_size, val_size, test_size = loader.split(
    key="complexity_level",
    group="faces",
    train=0.7,
    validation=0.15,
    test=0.15,
    random_state=42
)

print(f"\nDataset split:")
print(f"  Train: {train_size} files")
print(f"  Validation: {val_size} files")
print(f"  Test: {test_size} files")

# Get datasets
train_dataset = loader.get_dataset("train")
val_dataset = loader.get_dataset("validation")
test_dataset = loader.get_dataset("test")

# ===== STEP 3: Prepare for Training =====
print("\n" + "="*70)
print("STEP 3: PYTORCH INTEGRATION")
print("="*70)

from torch.utils.data import DataLoader

# Convert to PyTorch
train_torch = train_dataset.to_torch()
val_torch = val_dataset.to_torch()

# Create data loaders
train_loader = DataLoader(train_torch, batch_size=32, shuffle=True, num_workers=4)
val_loader = DataLoader(val_torch, batch_size=32, shuffle=False, num_workers=4)

print(f"\nDataLoaders created:")
print(f"  Train batches: {len(train_loader)}")
print(f"  Val batches: {len(val_loader)}")

# Test iteration
batch = next(iter(train_loader))
print(f"\nSample batch keys: {list(batch.keys())}")

# ===== STEP 4: Training Loop (Skeleton) =====
print("\n" + "="*70)
print("STEP 4: TRAINING (SKELETON)")
print("="*70)

num_epochs = 10
for epoch in range(num_epochs):
    print(f"\nEpoch {epoch+1}/{num_epochs}")

    # Training phase
    for batch_idx, batch in enumerate(train_loader):
        # Your training code here
        pass

    # Validation phase
    for batch in val_loader:
        # Your validation code here
        pass

print("\nWorkflow complete!")
loader.close_resources()

This workflow illustrates the typical progression: explore and validate the merged dataset using DatasetExplorer, then prepare training data using DatasetLoader.

Example 2: Advanced Analysis with Visualization

This example demonstrates multi-dimensional analysis with visualization:

from hoops_ai.dataset import DatasetExplorer
from hoops_ai.insights import DatasetViewer
import matplotlib.pyplot as plt
import numpy as np

# Initialize explorer
explorer = DatasetExplorer(flow_output_file="cad_pipeline.flow")

# ===== Multi-Dimensional Analysis =====

# 1. Face area distribution
face_dist = explorer.create_distribution(key="face_areas", group="faces", bins=30)

# 2. Edge length distribution
edge_dist = explorer.create_distribution(key="edge_lengths", group="edges", bins=30)

# 3. Create visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot face area histogram
ax1 = axes[0, 0]
bin_centers = 0.5 * (face_dist['bin_edges'][1:] + face_dist['bin_edges'][:-1])
ax1.bar(bin_centers, face_dist['hist'], width=(face_dist['bin_edges'][1] - face_dist['bin_edges'][0]))
ax1.set_xlabel('Face Area')
ax1.set_ylabel('Count')
ax1.set_title('Face Area Distribution')

# Plot edge length histogram
ax2 = axes[0, 1]
bin_centers = 0.5 * (edge_dist['bin_edges'][1:] + edge_dist['bin_edges'][:-1])
ax2.bar(bin_centers, edge_dist['hist'], width=(edge_dist['bin_edges'][1] - edge_dist['bin_edges'][0]))
ax2.set_xlabel('Edge Length')
ax2.set_ylabel('Count')
ax2.set_title('Edge Length Distribution')

# Plot file count per bin
ax3 = axes[1, 0]
file_counts = [len(files) for files in face_dist['file_ids_in_bins']]
ax3.plot(bin_centers, file_counts, marker='o')
ax3.set_xlabel('Face Area')
ax3.set_ylabel('Number of Files')
ax3.set_title('Files per Face Area Bin')

# Plot complexity distribution
complexity_stats = explorer.get_array_statistics(group_name="faces", array_name="complexity_level")
ax4 = axes[1, 1]
ax4.text(0.1, 0.9, f"Mean: {complexity_stats['mean']:.2f}", transform=ax4.transAxes)
ax4.text(0.1, 0.8, f"Std: {complexity_stats['std']:.2f}", transform=ax4.transAxes)
ax4.text(0.1, 0.7, f"Min: {complexity_stats['min']:.2f}", transform=ax4.transAxes)
ax4.text(0.1, 0.6, f"Max: {complexity_stats['max']:.2f}", transform=ax4.transAxes)
ax4.set_title('Dataset Statistics')
ax4.axis('off')

plt.tight_layout()
plt.savefig('dataset_analysis.png', dpi=300)
plt.show()

# ===== Visual Inspection =====

# Get high complexity files for visual inspection
high_complexity_filter = lambda ds: ds['complexity_level'] >= 4
complex_file_codes = explorer.get_file_list(group="faces", where=high_complexity_filter)

# Use DatasetViewer for visual inspection
viewer = DatasetViewer.from_explorer(explorer)
fig = viewer.show_preview_as_image(
    complex_file_codes[:25],  # First 25 complex files
    k=25,
    grid_cols=5,
    label_format='id',
    figsize=(15, 12)
)
plt.savefig('complex_files_preview.png', dpi=300)
plt.show()

explorer.close()

This example demonstrates how to perform comprehensive analysis combining statistical summaries, distribution analysis, and visual inspection of the dataset.

Best Practices 

The following best practices help ensure efficient and correct usage of the dataset exploration and loading tools.

For DatasetExplorer

1. Use flow_output_file parameter

This simplifies initialization and ensures correct file paths:

explorer = DatasetExplorer(flow_output_file="path/to/flow.flow")

2. Close resources

Always close when done to free memory and Dask workers:

explorer.close(close_dask=True)

3. Check available groups first

Use available_groups() and available_arrays() before querying:

groups = explorer.available_groups()
if 'faces' in groups:
    face_data = explorer.get_group_data('faces')

4. Print table of contents early

Understand dataset structure before analysis:

explorer.print_table_of_contents()

For DatasetLoader

1. Set random_state

Ensure reproducible splits:

loader.split(key="label", random_state=42)

2. Clean up resources

Close explorer and clear caches:

loader.close_resources(clear_split_history=True)

Performance Considerations 

Understanding performance characteristics helps optimize dataset operations for different use cases.

Memory Management

DatasetExplorer:

The DatasetExplorer uses Dask for out-of-core processing, enabling work with data larger than RAM. Zarr chunking enables partial array loading. Configure Dask workers based on available memory:

dask_params = {
    'n_workers': 4,
    'threads_per_worker': 2,
    'memory_limit': '8GB'  # Per worker
}

DatasetLoader:

The DatasetLoader keeps only indices in memory, not full data. Custom loaders should be memory-efficient. Use PyTorch DataLoader num_workers for parallel loading.

Parallel Processing

DatasetExplorer Parallelism:

Distribution computation: Dask parallel histogram

Cross-group queries: Parallel joins

Subgraph search: Parallel pattern matching

DatasetLoader Parallelism:

PyTorch DataLoader num_workers: Controls loading parallelism

Set based on CPU cores: num_workers = min(4, cpu_count())

Use pin_memory=True for GPU training

Summary 

DatasetExplorer and DatasetLoader provide a complete solution for dataset analysis and ML preparation within the HOOPS AI pipeline.

Key Capabilities

DatasetExplorer: Analysis & Exploration

✅ Query arrays by group and key
✅ Analyze distributions with histograms
✅ Filter files by metadata conditions
✅ Statistical analysis and visualization
✅ Cross-group queries and joins

DatasetLoader: ML Preparation

✅ Stratified train/val/test splitting
✅ Multi-label stratification support
✅ Framework-agnostic CADDataset
✅ PyTorch integration with .to_torch()
✅ Custom item loaders for preprocessing

Integration with HOOPS AI Pipeline

✅ Automatic consumption of DatasetMerger outputs

✅ Schema-driven group and array discovery

✅ Seamless connection to Flow-based workflows

✅ Support for visualization assets (PNG, 3D cache)

These tools complete the HOOPS AI data pipeline, enabling users to go from raw CAD files to trained ML models with minimal custom code.

Dataset Exploration and Mining

Key Purposes

Pipeline Position

Position in Pipeline

Input Files

Required Files

File Location

Relationship to DatasetMerger

Key Distinction

Initialization and Setup

Initialization Methods

Dask Configuration

Discovering Dataset Structure

Available Groups

Available Arrays within Groups

Retrieving Metadata Descriptions

Print Dataset Overview

Querying Data

Get Array Data

Get Group Data

Get File-Specific Data

Filter by Condition

Distribution Analysis

Creating Distributions

Visualizing Distributions

Metadata Queries

File-Level Metadata

Categorical Metadata (Labels/Descriptions)

Stream Cache Paths (Visualizations)

Advanced Features

Membership Matrix

Resource Management

Initialization

Custom Item Loader (Experimental)

Stratified Splitting

Basic Stratified Split

Mathematical Formulation

Multi-Label Stratification

Dataset Access

Retrieving Subsets

CADDataset Class

ML Framework Integration

PyTorch Integration

Resource Management

Example 1: Basic Analysis and ML Preparation

Example 2: Advanced Analysis with Visualization

For DatasetExplorer

For DatasetLoader

Memory Management

Parallel Processing

Key Capabilities

Integration with HOOPS AI Pipeline

Hello! I'm HOOPSY