####################################
Dataset Exploration and Mining
####################################

.. sidebar:: Table of Contents

   .. contents::
      :local:
      :depth: 1

**************
Introduction
**************

The **DatasetExplorer** and **DatasetLoader** are complementary tools that work with merged datasets (``.dataset``, ``.infoset``, ``.attribset`` files) produced by the DatasetMerger during Flow execution. Together, they provide comprehensive capabilities for dataset analysis, querying, and ML training preparation.

These classes form the **analysis and ML preparation layer** of the HOOPS AI pipeline, consuming the unified datasets produced by the automatic merging process described in :doc:`dataset-merger`.

Key Purposes
============

The dataset exploration and ML preparation module consists of two core components:

**DatasetExplorer**
    Query, analyze, and visualize merged CAD datasets. This tool provides read-only exploration operations and statistical analysis without modifying the underlying data.

**DatasetLoader**
    Prepare datasets for ML training with stratified train/validation/test splitting. The DatasetLoader manages dataset splitting using stratification techniques to ensure each subset maintains the same proportion of categories as the original dataset, preventing evaluation bias.

Pipeline Position
=================

The typical workflow follows this progression:

.. code-block:: text

    DatasetMerger Output → DatasetExplorer (Analysis) → DatasetLoader (ML Prep) → Training
    (.dataset/.infoset/.attribset)

This integration ensures seamless progression from merged data consolidation through exploratory analysis to ML model training.

*****************************
Architecture Overview
*****************************

Position in Pipeline
====================

DatasetExplorer and DatasetLoader operate in the **Analysis & ML Preparation Phase** of the HOOPS AI pipeline:

.. code-block:: text

    ┌──────────────────────────────────────────────────────────────────────┐
    │                    HOOPS AI Complete Pipeline                        │
    └──────────────────────────────────────────────────────────────────────┘
    
    1. ENCODING PHASE (Per-File)
       ┌─────────────────────────────────────────────────────┐
       │  @flowtask.transform                                │
       │  CAD File → Encoder → Storage → .data file          │
       └─────────────────────────────────────────────────────┘
                                ↓
    2. MERGING PHASE (Automatic)
       ┌─────────────────────────────────────────────────────┐
       │  AutoDatasetExportTask (auto_dataset_export=True)   │
       │  Multiple .data → DatasetMerger → .dataset          │
       │  Multiple .json → DatasetInfo → .infoset/.attribset │
       └─────────────────────────────────────────────────────┘
                                ↓
    3. ANALYSIS PHASE (DatasetExplorer) ← YOU ARE HERE
       ┌─────────────────────────────────────────────────────┐
       │  .dataset + .infoset + .attribset                   │
       │      ↓                                              │
       │  DatasetExplorer                                    │
       │   - Query arrays by group/key                       │
       │   - Analyze distributions                           │
       │   - Filter by metadata                              │
       │   - Statistical summaries                           │
       │   - Cross-group queries                             │
       └─────────────────────────────────────────────────────┘
                                ↓
    4. ML PREPARATION PHASE (DatasetLoader)
       ┌─────────────────────────────────────────────────────┐
       │  DatasetLoader                                      │
       │   - Stratified train/val/test split                 │
       │   - Multi-label support                             │
       │   - Framework-agnostic CADDataset                   │
       │   - PyTorch adapter (.to_torch())                   │
       │   - Custom item loaders                             │
       └─────────────────────────────────────────────────────┘
                                ↓
    5. ML TRAINING PHASE
       ┌─────────────────────────────────────────────────────┐
       │  PyTorch DataLoader → Training Loop → Model         │
       └─────────────────────────────────────────────────────┘

Input Files
===========

Both DatasetExplorer and DatasetLoader consume the output files generated by the DatasetMerger during Flow execution.

Required Files
--------------

**1. .dataset file (Compressed Zarr)**

The ``.dataset`` file contains all merged array data organized by groups. This file uses the ZipStore format with Zstd compression for efficient storage and access via xarray and Dask for parallel operations.

    - Structure: ``{flow_name}.dataset``
    - Format: Zarr ZipStore with compression
    - Access: xarray with Dask parallel processing

**2. .infoset file (Parquet)**

The ``.infoset`` file contains file-level metadata with one row per CAD file. This columnar storage format enables efficient querying using pandas DataFrame operations.

    - Structure: Columnar storage with ``id``, ``name``, ``description``, and custom fields
    - Format: Parquet
    - Access: pandas DataFrame operations

**3. .attribset file (Parquet) - Optional**

The ``.attribset`` file contains categorical metadata and label descriptions, mapping numeric codes to human-readable names and descriptions.

    - Structure: ``table_name``, ``id``, ``name``, ``description`` columns
    - Format: Parquet
    - Access: pandas DataFrame operations

File Location
-------------

Files are generated by Flow execution in the following directory structure:

.. code-block:: text

    flow_output/flows/{flow_name}/
    ├── {flow_name}.dataset      ← Merged data arrays
    ├── {flow_name}.infoset      ← File-level metadata
    ├── {flow_name}.attribset    ← Categorical metadata
    └── {flow_name}.flow         ← Flow specification (JSON)

Relationship to DatasetMerger
==============================

Understanding the relationship between DatasetMerger and the exploration/loading tools clarifies their distinct roles:

**DatasetMerger (automatic during Flow)**
    - **Input**: Individual ``.data`` and ``.json`` files (per CAD file)
    - **Process**: Concatenate arrays, route metadata, add provenance tracking
    - **Output**: Unified ``.dataset``, ``.infoset``, ``.attribset`` files

**DatasetExplorer (user-driven analysis)**
    - **Input**: Output files from DatasetMerger
    - **Process**: Query, filter, analyze, visualize
    - **Output**: Statistics, distributions, filtered file lists

**DatasetLoader (ML preparation)**
    - **Input**: Output files from DatasetMerger
    - **Process**: Stratified splitting, dataset creation
    - **Output**: Train/val/test CADDataset objects

Key Distinction
---------------

The three components serve different operational roles:

    - **DatasetMerger**: Write-heavy operation (consolidate many files into one)
    - **DatasetExplorer**: Read-heavy operation (query and analyze unified data)
    - **DatasetLoader**: Read + Index operation (split and serve data for training)

This separation of concerns enables efficient workflows where data consolidation occurs once, followed by iterative analysis and multiple ML training experiments without re-merging.

**************
DatasetExplorer
**************

The :class:`DatasetExplorer <hoops_ai.dataset.dataset_explorer.DatasetExplorer>` class provides methods for discovering, querying, and analyzing merged datasets. This class focuses on read-only exploration operations and statistical analysis without modifying the underlying data.

Initialization and Setup
=========================

The DatasetExplorer supports multiple initialization methods to accommodate different workflow preferences.

Initialization Methods
----------------------

**Method 1: Using Flow Output JSON File (Recommended)**

The most convenient approach uses the ``.flow`` JSON file generated by Flow execution. This file contains all necessary paths and the explorer automatically resolves them:

.. code-block:: python

    from hoops_ai.dataset import DatasetExplorer
    
    # Initialize using flow file
    explorer = DatasetExplorer(flow_output_file="path/to/flow_name.flow")

The flow file contains keys such as ``flow_data`` (pointing to the Zarr dataset), ``flow_info`` (pointing to the Parquet metadata), and ``flow_attributes`` (pointing to attribute metadata). The explorer automatically resolves relative paths based on the flow file location.

**Method 2: Explicit File Paths**

For scenarios where you need direct control over file paths or when working outside the Flow framework:

.. code-block:: python

    # Initialize with explicit paths
    explorer = DatasetExplorer(
        merged_store_path="path/to/flow_name.dataset",
        parquet_file_path="path/to/flow_name.infoset",
        parquet_file_attribs="path/to/flow_name.attribset"  # Optional
    )

This approach is useful when files are in non-standard locations or when integrating with external data processing pipelines.

**Method 3: With Custom Dask Configuration**

For large datasets or specific performance requirements, you can customize the Dask parallel processing configuration:

.. code-block:: python

    # Initialize with custom Dask settings
    explorer = DatasetExplorer(
        flow_output_file="path/to/flow_name.flow",
        dask_client_params={
            'n_workers': 8,
            'threads_per_worker': 4,
            'memory_limit': '8GB'
        }
    )

Dask Configuration
------------------

The DatasetExplorer uses Dask for parallel processing of large datasets. Dask is a parallel computing library that processes data in chunks across multiple CPU cores, enabling work with datasets larger than available RAM.

By default, DatasetExplorer creates a local Dask cluster with sensible defaults. You can customize the Dask configuration by providing parameters:

**Parameters:**
    - ``flow_output_file`` (str, optional): Path to ``.flow`` JSON containing all file paths
    - ``merged_store_path`` (str, optional): Path to ``.dataset`` file
    - ``parquet_file_path`` (str, optional): Path to ``.infoset`` file
    - ``parquet_file_attribs`` (str, optional): Path to ``.attribset`` file
    - ``dask_client_params`` (dict, optional): Dask configuration for parallel operations

For very large datasets, configuring Dask with more workers and increased memory limits improves performance. However, for smaller datasets or systems with limited resources, the default configuration is sufficient.

Discovering Dataset Structure
==============================

Before querying specific data, understanding the available groups and arrays in the dataset is essential.

Available Groups
----------------

Groups represent logical collections of related data. To discover available groups:

.. code-block:: python

    # Get list of available groups
    available_groups = explorer.available_groups()
    print(f"Groups: {available_groups}")
    # Output: {'faces', 'edges', 'graph', 'machining'}

Each group corresponds to a category of CAD data (faces, edges, graph structures, etc.) as defined in the schema used during encoding.

Available Arrays within Groups
-------------------------------

Each group contains multiple arrays storing different attributes. To discover arrays within a specific group:

.. code-block:: python

    # Get arrays in the faces group
    available_arrays = explorer.available_arrays("faces")
    print(f"Face arrays: {available_arrays}")
    # Output: {'face_indices', 'face_areas', 'face_types', 'face_uv_grids', 'file_id_code_faces'}

The array names reflect the data stored: geometric properties (areas, lengths), categorical types, and provenance tracking (``file_id_code_*`` arrays).

Retrieving Metadata Descriptions
---------------------------------

The Parquet metadata file contains description tables that map numeric codes to human-readable names. To retrieve these descriptions:

.. code-block:: python

    # Get face type descriptions
    face_types = explorer.get_descriptions(table_name="face_types")
    print(face_types)
    # Output:
    #   id     name           description
    # 0  0   Plane          Planar surface
    # 1  1   Cylinder       Cylindrical surface
    # 2  2   Cone           Conical surface
    # 3  3   Sphere         Spherical surface

The :meth:`get_descriptions <hoops_ai.dataset.dataset_explorer.DatasetExplorer.get_descriptions>` method accepts several parameters:

    - ``table_name``: The name of the metadata table (e.g., ``"face_types"``, ``"edge_types"``, ``"label"``)
    - ``key_id``: Optional integer to filter results to a specific ID
    - ``use_wildchar``: Optional boolean to enable wildcard matching in table names

To search for label-related tables using wildcards:

.. code-block:: python

    # Find label tables using wildcard
    label_tables = explorer.get_descriptions("label", None, True)
    print(label_tables)
    # Returns all tables with "label" in the name

Print Dataset Overview
----------------------

To get a comprehensive overview of the entire dataset structure:

.. code-block:: python

    # Print complete table of contents
    explorer.print_table_of_contents()

This command outputs a formatted summary showing all groups, their arrays with shapes and data types, and metadata file information. Example output:

.. code-block:: console

    ========================================
    DATASET TABLE OF CONTENTS
    ========================================
    
    Available Groups:
    --------------------------------------------------
    
    Group: faces
      Arrays:
        - face_indices: (48530,) int32
        - face_areas: (48530,) float32
        - face_types: (48530,) int32
        - face_uv_grids: (48530, 20, 20, 7) float32
        - file_id_code_faces: (48530,) int32
    
    Group: edges
      Arrays:
        - edge_indices: (72845,) int32
        - edge_lengths: (72845,) float32
        - edge_types: (72845,) int32
        - file_id_code_edges: (72845,) int32
    
    Group: machining
      Arrays:
        - machining_category: (100,) int32
        - material_type: (100,) int32
        - file_id_code_machining: (100,) int32
    
    Metadata Files:
      - Info: cad_pipeline.infoset (file-level metadata)
      - Attributes: cad_pipeline.attribset (categorical metadata)
    
    Total Files: 100

Querying Data
=============

The DatasetExplorer provides multiple methods for accessing data at different granularities: individual arrays, complete groups, or file-specific subsets.

Get Array Data
--------------

To retrieve a complete array for a specific group:

.. code-block:: python

    # Get complete array data for a group
    face_areas = explorer.get_array_data(group_name="faces", array_name="face_areas")
    # Returns: xr.DataArray with shape [N_total_faces]
    
    # Access underlying NumPy array
    face_areas_np = face_areas.values
    print(f"Total faces: {len(face_areas_np)}")
    print(f"Mean area: {face_areas_np.mean():.2f}")

The returned object is an xarray DataArray, which provides labeled multi-dimensional array functionality similar to pandas for higher-dimensional data. The ``.values`` attribute accesses the underlying NumPy array.

Get Group Data
--------------

To access all arrays within a group as a single dataset:

.. code-block:: python

    # Get entire dataset for a group
    faces_ds = explorer.get_group_data("faces")
    print(faces_ds)
    # Output:
    # <xarray.Dataset>
    # Dimensions:        (face: 48530)
    # Coordinates:
    #   * face           (face) int64 0 1 2 3 ... 48527 48528 48529
    # Data variables:
    #     face_indices   (face) int32 ...
    #     face_areas     (face) float32 ...
    #     face_types     (face) int32 ...
    #     file_id_code_faces (face) int32 ...
    
    # Access multiple arrays
    face_areas = faces_ds['face_areas']
    face_types = faces_ds['face_types']

Each returned dataset is an xarray.Dataset object containing data variables (arrays) with their associated coordinates and dimensions. This provides a convenient way to work with related arrays together.

Get File-Specific Data
----------------------

To retrieve data for a specific CAD file within the merged dataset:

.. code-block:: python

    # Get data for a specific file
    file_id_code = 5
    face_subset = explorer.file_dataset(file_id_code=file_id_code, group="faces")
    print(f"File {file_id_code} has {len(face_subset.face)} faces")
    
    # Access arrays for this file only
    file_face_areas = face_subset['face_areas'].values
    print(f"Face areas for file {file_id_code}: {file_face_areas}")

The provenance tracking (``file_id_code_*`` arrays) enables efficient filtering to extract data belonging to a single file from the merged dataset.

Filter by Condition
-------------------

To identify files matching specific criteria:

.. code-block:: python

    # Get files matching a boolean condition
    def high_complexity_filter(ds):
        """Filter for files with many faces"""
        # Example: faces with area > 100
        return ds['face_areas'] > 100
    
    file_codes = explorer.get_file_list(
        group="faces",
        where=high_complexity_filter
    )
    print(f"Found {len(file_codes)} files with large faces")
    
    # Convert file codes to file names
    file_names = [explorer.decode_file_id_code(code) for code in file_codes]

The ``where`` parameter accepts a callable (function or lambda) that receives an xarray.Dataset and returns an xarray.DataArray of boolean values. The method returns an array of file ID codes where the condition is True.

Distribution Analysis
=====================

Computing distributions and histograms helps understand data balance and inform stratification strategies for ML training.

Creating Distributions
-----------------------

To compute the distribution of attributes across the entire dataset:

.. code-block:: python

    # Create distribution with automatic binning
    distribution = explorer.create_distribution(
        key="face_areas",
        group="faces",
        bins=20
    )
    
    # Access distribution components
    print(f"Bin edges: {distribution['bin_edges']}")
    print(f"Histogram counts: {distribution['hist']}")
    print(f"Files per bin: {distribution['file_ids_in_bins']}")
    
    # Example output:
    # bin_edges: [0.5, 1.5, 2.5, ..., 20.5]
    # hist: [145, 302, 567, ..., 89]
    # file_ids_in_bins: [['part_001', 'part_003'], ['part_002', 'part_005'], ...]

When ``bins=None``, the method automatically detects categorical data and creates one bin per unique category. For continuous numeric variables, specify the number of bins to create evenly spaced bins spanning the data range.

The returned dictionary contains:
    - ``bin_edges``: Array of bin boundaries
    - ``hist``: Count of items in each bin
    - ``file_ids_in_bins``: Lists of file IDs whose items fall in each bin

Visualizing Distributions
--------------------------

Distribution results can be visualized using standard plotting libraries:

.. code-block:: python

    import matplotlib.pyplot as plt
    import numpy as np
    
    dist = explorer.create_distribution(key="face_areas", group="faces", bins=30)
    
    # Plot histogram
    bin_centers = 0.5 * (dist['bin_edges'][1:] + dist['bin_edges'][:-1])
    plt.bar(bin_centers, dist['hist'], width=(dist['bin_edges'][1] - dist['bin_edges'][0]))
    plt.xlabel('Face Area')
    plt.ylabel('Count')
    plt.title('Face Area Distribution')
    plt.show()

This visualization helps identify class imbalance and guides decisions about data augmentation or weighted loss functions during training.

Metadata Queries
================

The DatasetExplorer provides methods to access file-level metadata and categorical descriptions stored in the Parquet files.

File-Level Metadata
-------------------

To retrieve metadata for all files in the dataset:

.. code-block:: python

    # Get metadata for all files
    all_file_info = explorer.get_file_info_all()
    print(all_file_info.head())
    # Output:
    #    id         name  size_cadfile  processing_time  complexity_level  subset
    # 0   0   part_001      1024000             12.5                     3   train
    # 1   1   part_002      2048000             18.3                     4   train
    # 2   2   part_003       512000              8.1                     2    test

The returned pandas DataFrame contains complete metadata for every file, enabling bulk analysis and reporting.

To retrieve metadata for a specific file:

.. code-block:: python

    # Get metadata for specific file
    file_info = explorer.get_parquet_info_by_code(file_id_code=5)
    print(file_info)

Categorical Metadata (Labels/Descriptions)
-------------------------------------------

To access label descriptions from the ``.attribset`` file:

.. code-block:: python

    # Get label descriptions
    complexity_labels = explorer.get_descriptions(table_name="complexity_level")
    print(complexity_labels)
    # Output:
    #   id     name           description
    # 0  1   Simple      Basic geometry
    # 1  2   Medium      Moderate complexity
    # 2  3   Complex     High complexity
    # 3  4   Very Complex   Advanced features
    
    # Get specific label description
    label_3 = explorer.get_descriptions(table_name="complexity_level", key_id=3)
    print(label_3['name'].values[0])  # Output: "Complex"

Stream Cache Paths (Visualizations)
------------------------------------

To retrieve paths to visualization assets (PNG thumbnails and 3D stream cache files):

.. code-block:: python

    # Get paths to PNG and 3D stream cache files
    stream_paths = explorer.get_stream_cache_paths()
    print(stream_paths[['id', 'name', 'stream_cache_png', 'stream_cache_3d']])
    
    # Get stream cache for specific file
    file_stream = explorer.get_stream_cache_paths(file_id_code=10)
    png_path = file_stream['stream_cache_png'].values[0]
    scs_path = file_stream['stream_cache_3d'].values[0]

Advanced Features
=================

The DatasetExplorer provides advanced analytical capabilities for multi-label analysis and stratification.

Membership Matrix
-----------------

Membership matrices are critical for stratified dataset splitting. A membership matrix is a 2D array where each row represents a file and each column represents a category or bin. The cell value indicates membership: for binary matrices, a value of 1 means the file contains at least one item in that category; for count matrices, the value indicates how many items belong to that category.

.. code-block:: python

    # Create membership matrix for multi-label analysis
    matrix, file_codes, categories = explorer.build_membership_matrix(
        group="faces",
        key="face_types",
        bins_or_categories=None,  # Auto-discover categories
        as_counts=False  # Boolean membership (True) or counts (False)
    )
    
    print(f"Matrix shape: {matrix.shape}")  # (N_files, N_categories)
    print(f"File codes: {file_codes[:10]}")
    print(f"Categories: {categories}")
    
    # Use for stratification analysis
    import pandas as pd
    df = pd.DataFrame(matrix, columns=categories)
    df['file_code'] = file_codes
    print(df.head())

Count-based matrices (``as_counts=True``) provide more detailed information about the distribution of features within each file, which can be valuable for certain analysis tasks.

Resource Management
===================

The DatasetExplorer creates internal resources (such as Dask clients) that should be cleaned up when no longer needed:

.. code-block:: python

    # Close resources when done
    explorer.close(close_dask=True)

Always close the explorer to free memory and terminate Dask workers, especially when working with large datasets or in interactive environments.

**************
DatasetLoader
**************

The :class:`DatasetLoader <hoops_ai.dataset.dataset_loader.DatasetLoader>` class manages dataset splitting and provides framework-agnostic access to training, validation, and test subsets. This class builds upon the DatasetExplorer to handle stratified splitting and subset management.

Initialization
==============

The DatasetLoader requires paths to the Zarr dataset and Parquet metadata file:

.. code-block:: python

    from hoops_ai.dataset import DatasetLoader
    
    # Basic initialization
    loader = DatasetLoader(
        merged_store_path="path/to/flow_name.dataset",
        parquet_file_path="path/to/flow_name.infoset"
    )

You can optionally provide a custom ``item_loader_func`` that defines how to load individual items from files. If no loader function is provided, the loader returns raw file paths and metadata when items are accessed.

Custom Item Loader (Experimental)
----------------------------------

For advanced use cases, you can define a custom item loader function:

.. code-block:: python

    def custom_loader(graph_file, label_file, data_id):
        """Custom function to load and process items"""
        import dgl
        import numpy as np
        
        # Load graph
        graph = dgl.load_graphs(graph_file)[0][0]
        
        # Load label
        label = np.load(label_file)
        
        # Return as dictionary
        return {
            'graph': graph,
            'label': label,
            'id': data_id,
            'num_nodes': graph.number_of_nodes(),
            'num_edges': graph.number_of_edges()
        }
    
    loader = DatasetLoader(
        merged_store_path="path/to/flow_name.dataset",
        parquet_file_path="path/to/flow_name.infoset",
        item_loader_func=custom_loader
    )

**Parameters:**
    - ``merged_store_path`` (str): Path to ``.dataset`` file
    - ``parquet_file_path`` (str): Path to ``.infoset`` file
    - ``item_loader_func`` (callable, optional): Custom function to load items
        - Signature: ``func(graph_file, label_file, data_id) -> item``
        - If None, returns raw file paths

Stratified Splitting
=====================

The DatasetLoader's ``split`` method performs stratified splitting of the dataset into training, validation, and test subsets.

Basic Stratified Split
-----------------------

To perform a stratified split by a categorical key:

.. code-block:: python

    # Perform stratified split by a categorical key
    train_size, val_size, test_size = loader.split(
        key="complexity_level",  # Metadata key to stratify on
        group="faces",           # Group containing the key
        train=0.7,               # 70% training
        validation=0.15,         # 15% validation
        test=0.15,               # 15% testing
        random_state=42          # For reproducibility
    )
    
    print(f"Dataset split:")
    print(f"  Train: {train_size} files")
    print(f"  Validation: {val_size} files")
    print(f"  Test: {test_size} files")

Stratification ensures that each subset (train, validation, test) maintains the same label distribution as the overall dataset. This is critical for training models that generalize well to unseen data.

Mathematical Formulation
------------------------

For stratified splitting with key :math:`K` having :math:`C` categories, the split aims to preserve the distribution:

.. math::

    P(k_i | \text{train}) \approx P(k_i | \text{validation}) \approx P(k_i | \text{test}) \approx P(k_i)

where :math:`k_i \in K` is a category and :math:`P(k_i)` is its proportion in the full dataset.

Multi-Label Stratification
---------------------------

For files with multiple labels (e.g., multiple face types per file), the loader uses ``MultilabelStratifiedShuffleSplit``. This approach creates a membership matrix:

.. math::

    \mathbf{M} \in \{0, 1\}^{N \times C}

where:
    - :math:`N` = number of files
    - :math:`C` = number of categories
    - :math:`M_{ij} = 1` if file :math:`i` has category :math:`j`, else 0

The split preserves label co-occurrence patterns, ensuring that combinations of labels are proportionally represented in each subset.

Dataset Access
==============

After splitting, you can retrieve framework-agnostic dataset objects for each subset.

Retrieving Subsets
------------------

To get datasets for each subset:

.. code-block:: python

    # Get framework-agnostic datasets
    train_dataset = loader.get_dataset("train")
    val_dataset = loader.get_dataset("validation")
    test_dataset = loader.get_dataset("test")
    
    print(f"Train: {len(train_dataset)} samples")
    print(f"Val: {len(val_dataset)} samples")
    print(f"Test: {len(test_dataset)} samples")
    
    # Access individual items
    item = train_dataset.get_item(0)
    print(f"Item: {item}")

CADDataset Class
----------------

The ``CADDataset`` is a framework-agnostic wrapper that provides consistent access to dataset subsets:

.. code-block:: python

    # Properties
    train_dataset.indices          # Indices into parent dataset
    train_dataset.parent_dataset   # Reference to DatasetLoader
    
    # Methods
    item = train_dataset.get_item(i)       # Get item by local index
    raw = train_dataset.get_raw_data(i)    # Get file paths without loading

ML Framework Integration
========================

The DatasetLoader provides integration with popular ML frameworks through adapter methods.

PyTorch Integration
-------------------

To convert a CADDataset to a PyTorch-compatible dataset:

.. code-block:: python

    from torch.utils.data import DataLoader
    
    # Get training dataset
    train_dataset = loader.get_dataset("train")
    
    # Convert to PyTorch
    train_torch = train_dataset.to_torch()
    
    # Create DataLoader
    train_loader = DataLoader(
        train_torch,
        batch_size=32,
        shuffle=True,
        num_workers=4,
        pin_memory=True  # For GPU training
    )
    
    # Training loop
    for epoch in range(num_epochs):
        for batch in train_loader:
            # Unpack batch
            graphs = batch['graph']
            labels = batch['label']
            file_ids = batch['id']
            
            # Your training code
            outputs = model(graphs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

The ``.to_torch()`` method returns a PyTorch-compatible dataset object that can be used directly with PyTorch's DataLoader for batching and parallel loading.

Resource Management
===================

The DatasetLoader creates internal resources (such as DatasetExplorer instances) that should be cleaned up when no longer needed:

.. code-block:: python

    # Close resources
    loader.close_resources(clear_split_history=True)

Always close the loader to free memory and clean up temporary resources, especially when running multiple experiments or in long-running processes.

**************************
Complete Workflow Examples
**************************

This section demonstrates complete end-to-end workflows integrating DatasetExplorer and DatasetLoader.

Example 1: Basic Analysis and ML Preparation
=============================================

The following example demonstrates the typical progression from dataset exploration through ML preparation:

.. code-block:: python

    import hoops_ai
    from hoops_ai.flowmanager import flowtask
    from hoops_ai.dataset import DatasetExplorer, DatasetLoader
    import pathlib
    
    # Assume flow already executed and created:
    # - cad_pipeline.dataset
    # - cad_pipeline.infoset
    # - cad_pipeline.attribset
    # - cad_pipeline.flow
    
    flow_file = pathlib.Path("flow_output/flows/cad_pipeline/cad_pipeline.flow")
    
    # ===== STEP 1: Explore Dataset =====
    print("="*70)
    print("STEP 1: DATASET EXPLORATION")
    print("="*70)
    
    explorer = DatasetExplorer(flow_output_file=str(flow_file))
    
    # Print overview
    explorer.print_table_of_contents()
    
    # Analyze face area distribution
    face_dist = explorer.create_distribution(key="face_areas", group="faces", bins=20)
    print(f"\nFace area distribution:")
    print(f"  Range: [{face_dist['bin_edges'][0]:.2f}, {face_dist['bin_edges'][-1]:.2f}]")
    print(f"  Total faces: {face_dist['hist'].sum()}")
    print(f"  Mean bin count: {face_dist['hist'].mean():.1f}")
    
    # Filter files by complexity
    high_complexity_filter = lambda ds: ds['complexity_level'] >= 4
    complex_files = explorer.get_file_list(group="faces", where=high_complexity_filter)
    print(f"\nHigh complexity files: {len(complex_files)}")
    
    # Close explorer
    explorer.close()
    
    # ===== STEP 2: Prepare ML Dataset =====
    print("\n" + "="*70)
    print("STEP 2: ML DATASET PREPARATION")
    print("="*70)
    
    # Initialize loader
    flow_path = pathlib.Path(flow_file)
    loader = DatasetLoader(
        merged_store_path=str(flow_path.parent / f"{flow_path.stem}.dataset"),
        parquet_file_path=str(flow_path.parent / f"{flow_path.stem}.infoset")
    )
    
    # Stratified split
    train_size, val_size, test_size = loader.split(
        key="complexity_level",
        group="faces",
        train=0.7,
        validation=0.15,
        test=0.15,
        random_state=42
    )
    
    print(f"\nDataset split:")
    print(f"  Train: {train_size} files")
    print(f"  Validation: {val_size} files")
    print(f"  Test: {test_size} files")
    
    # Get datasets
    train_dataset = loader.get_dataset("train")
    val_dataset = loader.get_dataset("validation")
    test_dataset = loader.get_dataset("test")
    
    # ===== STEP 3: Prepare for Training =====
    print("\n" + "="*70)
    print("STEP 3: PYTORCH INTEGRATION")
    print("="*70)
    
    from torch.utils.data import DataLoader
    
    # Convert to PyTorch
    train_torch = train_dataset.to_torch()
    val_torch = val_dataset.to_torch()
    
    # Create data loaders
    train_loader = DataLoader(train_torch, batch_size=32, shuffle=True, num_workers=4)
    val_loader = DataLoader(val_torch, batch_size=32, shuffle=False, num_workers=4)
    
    print(f"\nDataLoaders created:")
    print(f"  Train batches: {len(train_loader)}")
    print(f"  Val batches: {len(val_loader)}")
    
    # Test iteration
    batch = next(iter(train_loader))
    print(f"\nSample batch keys: {list(batch.keys())}")
    
    # ===== STEP 4: Training Loop (Skeleton) =====
    print("\n" + "="*70)
    print("STEP 4: TRAINING (SKELETON)")
    print("="*70)
    
    num_epochs = 10
    for epoch in range(num_epochs):
        print(f"\nEpoch {epoch+1}/{num_epochs}")
        
        # Training phase
        for batch_idx, batch in enumerate(train_loader):
            # Your training code here
            pass
        
        # Validation phase
        for batch in val_loader:
            # Your validation code here
            pass
    
    print("\nWorkflow complete!")
    loader.close_resources()

This workflow illustrates the typical progression: explore and validate the merged dataset using DatasetExplorer, then prepare training data using DatasetLoader.

Example 2: Advanced Analysis with Visualization
================================================

This example demonstrates multi-dimensional analysis with visualization:

.. code-block:: python

    from hoops_ai.dataset import DatasetExplorer
    from hoops_ai.insights import DatasetViewer
    import matplotlib.pyplot as plt
    import numpy as np
    
    # Initialize explorer
    explorer = DatasetExplorer(flow_output_file="cad_pipeline.flow")
    
    # ===== Multi-Dimensional Analysis =====
    
    # 1. Face area distribution
    face_dist = explorer.create_distribution(key="face_areas", group="faces", bins=30)
    
    # 2. Edge length distribution
    edge_dist = explorer.create_distribution(key="edge_lengths", group="edges", bins=30)
    
    # 3. Create visualization
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Plot face area histogram
    ax1 = axes[0, 0]
    bin_centers = 0.5 * (face_dist['bin_edges'][1:] + face_dist['bin_edges'][:-1])
    ax1.bar(bin_centers, face_dist['hist'], width=(face_dist['bin_edges'][1] - face_dist['bin_edges'][0]))
    ax1.set_xlabel('Face Area')
    ax1.set_ylabel('Count')
    ax1.set_title('Face Area Distribution')
    
    # Plot edge length histogram
    ax2 = axes[0, 1]
    bin_centers = 0.5 * (edge_dist['bin_edges'][1:] + edge_dist['bin_edges'][:-1])
    ax2.bar(bin_centers, edge_dist['hist'], width=(edge_dist['bin_edges'][1] - edge_dist['bin_edges'][0]))
    ax2.set_xlabel('Edge Length')
    ax2.set_ylabel('Count')
    ax2.set_title('Edge Length Distribution')
    
    # Plot file count per bin
    ax3 = axes[1, 0]
    file_counts = [len(files) for files in face_dist['file_ids_in_bins']]
    ax3.plot(bin_centers, file_counts, marker='o')
    ax3.set_xlabel('Face Area')
    ax3.set_ylabel('Number of Files')
    ax3.set_title('Files per Face Area Bin')
    
    # Plot complexity distribution
    complexity_stats = explorer.get_array_statistics(group_name="faces", array_name="complexity_level")
    ax4 = axes[1, 1]
    ax4.text(0.1, 0.9, f"Mean: {complexity_stats['mean']:.2f}", transform=ax4.transAxes)
    ax4.text(0.1, 0.8, f"Std: {complexity_stats['std']:.2f}", transform=ax4.transAxes)
    ax4.text(0.1, 0.7, f"Min: {complexity_stats['min']:.2f}", transform=ax4.transAxes)
    ax4.text(0.1, 0.6, f"Max: {complexity_stats['max']:.2f}", transform=ax4.transAxes)
    ax4.set_title('Dataset Statistics')
    ax4.axis('off')
    
    plt.tight_layout()
    plt.savefig('dataset_analysis.png', dpi=300)
    plt.show()
    
    # ===== Visual Inspection =====
    
    # Get high complexity files for visual inspection
    high_complexity_filter = lambda ds: ds['complexity_level'] >= 4
    complex_file_codes = explorer.get_file_list(group="faces", where=high_complexity_filter)
    
    # Use DatasetViewer for visual inspection
    viewer = DatasetViewer.from_explorer(explorer)
    fig = viewer.show_preview_as_image(
        complex_file_codes[:25],  # First 25 complex files
        k=25,
        grid_cols=5,
        label_format='id',
        figsize=(15, 12)
    )
    plt.savefig('complex_files_preview.png', dpi=300)
    plt.show()
    
    explorer.close()

This example demonstrates how to perform comprehensive analysis combining statistical summaries, distribution analysis, and visual inspection of the dataset.

**************************
Best Practices
**************************

The following best practices help ensure efficient and correct usage of the dataset exploration and loading tools.

For DatasetExplorer
===================

**1. Use flow_output_file parameter**

This simplifies initialization and ensures correct file paths:

.. code-block:: python

    explorer = DatasetExplorer(flow_output_file="path/to/flow.flow")

**2. Close resources**

Always close when done to free memory and Dask workers:

.. code-block:: python

    explorer.close(close_dask=True)

**3. Check available groups first**

Use ``available_groups()`` and ``available_arrays()`` before querying:

.. code-block:: python

    groups = explorer.available_groups()
    if 'faces' in groups:
        face_data = explorer.get_group_data('faces')

**4. Print table of contents early**

Understand dataset structure before analysis:

.. code-block:: python

    explorer.print_table_of_contents()

For DatasetLoader
=================

**1. Set random_state**

Ensure reproducible splits:

.. code-block:: python

    loader.split(key="label", random_state=42)

**2. Clean up resources**

Close explorer and clear caches:

.. code-block:: python

    loader.close_resources(clear_split_history=True)

**************************
Performance Considerations
**************************

Understanding performance characteristics helps optimize dataset operations for different use cases.

Memory Management
=================

**DatasetExplorer:**

The DatasetExplorer uses Dask for out-of-core processing, enabling work with data larger than RAM. Zarr chunking enables partial array loading. Configure Dask workers based on available memory:

.. code-block:: python

    dask_params = {
        'n_workers': 4,
        'threads_per_worker': 2,
        'memory_limit': '8GB'  # Per worker
    }

**DatasetLoader:**

The DatasetLoader keeps only indices in memory, not full data. Custom loaders should be memory-efficient. Use PyTorch DataLoader ``num_workers`` for parallel loading.

Parallel Processing
===================

**DatasetExplorer Parallelism:**

    - Distribution computation: Dask parallel histogram
    - Cross-group queries: Parallel joins
    - Subgraph search: Parallel pattern matching

**DatasetLoader Parallelism:**

    - PyTorch DataLoader ``num_workers``: Controls loading parallelism
    - Set based on CPU cores: ``num_workers = min(4, cpu_count())``
    - Use ``pin_memory=True`` for GPU training

**************
Summary
**************

**DatasetExplorer and DatasetLoader** provide a complete solution for dataset analysis and ML preparation within the HOOPS AI pipeline.

Key Capabilities
================

**DatasetExplorer: Analysis & Exploration**
    - ✅ Query arrays by group and key
    - ✅ Analyze distributions with histograms
    - ✅ Filter files by metadata conditions
    - ✅ Statistical analysis and visualization
    - ✅ Cross-group queries and joins

**DatasetLoader: ML Preparation**
    - ✅ Stratified train/val/test splitting
    - ✅ Multi-label stratification support
    - ✅ Framework-agnostic CADDataset
    - ✅ PyTorch integration with ``.to_torch()``
    - ✅ Custom item loaders for preprocessing

Integration with HOOPS AI Pipeline
===================================

    - ✅ Automatic consumption of DatasetMerger outputs
    - ✅ Schema-driven group and array discovery
    - ✅ Seamless connection to Flow-based workflows
    - ✅ Support for visualization assets (PNG, 3D cache)

These tools complete the HOOPS AI data pipeline, enabling users to go from raw CAD files to trained ML models with minimal custom code.

**************
See Also
**************

For related topics and additional information:

    - :doc:`dataset-merger` - Understanding the data merging process that produces the input files
    - :doc:`cad-data-encoding` - Encoding CAD data for machine learning
    - :doc:`flow` - Flow-based data processing pipelines
    - :doc:`storage` - Storage abstractions for data persistence