#################################### Dataset Exploration and Mining #################################### .. sidebar:: Table of Contents .. contents:: :local: :depth: 1 ************** Introduction ************** The **DatasetExplorer** and **DatasetLoader** are complementary tools that work with merged datasets (``.dataset``, ``.infoset``, ``.attribset`` files) produced by the DatasetMerger during Flow execution. Together, they provide comprehensive capabilities for dataset analysis, querying, and ML training preparation. These classes form the **analysis and ML preparation layer** of the HOOPS AI pipeline, consuming the unified datasets produced by the automatic merging process described in :doc:`dataset-merger`. Key Purposes ============ The dataset exploration and ML preparation module consists of two core components: **DatasetExplorer** Query, analyze, and visualize merged CAD datasets. This tool provides read-only exploration operations and statistical analysis without modifying the underlying data. **DatasetLoader** Prepare datasets for ML training with stratified train/validation/test splitting. The DatasetLoader manages dataset splitting using stratification techniques to ensure each subset maintains the same proportion of categories as the original dataset, preventing evaluation bias. Pipeline Position ================= The typical workflow follows this progression: .. code-block:: text DatasetMerger Output → DatasetExplorer (Analysis) → DatasetLoader (ML Prep) → Training (.dataset/.infoset/.attribset) This integration ensures seamless progression from merged data consolidation through exploratory analysis to ML model training. ***************************** Architecture Overview ***************************** Position in Pipeline ==================== DatasetExplorer and DatasetLoader operate in the **Analysis & ML Preparation Phase** of the HOOPS AI pipeline: .. code-block:: text ┌──────────────────────────────────────────────────────────────────────┐ │ HOOPS AI Complete Pipeline │ └──────────────────────────────────────────────────────────────────────┘ 1. ENCODING PHASE (Per-File) ┌─────────────────────────────────────────────────────┐ │ @flowtask.transform │ │ CAD File → Encoder → Storage → .data file │ └─────────────────────────────────────────────────────┘ ↓ 2. MERGING PHASE (Automatic) ┌─────────────────────────────────────────────────────┐ │ AutoDatasetExportTask (auto_dataset_export=True) │ │ Multiple .data → DatasetMerger → .dataset │ │ Multiple .json → DatasetInfo → .infoset/.attribset │ └─────────────────────────────────────────────────────┘ ↓ 3. ANALYSIS PHASE (DatasetExplorer) ← YOU ARE HERE ┌─────────────────────────────────────────────────────┐ │ .dataset + .infoset + .attribset │ │ ↓ │ │ DatasetExplorer │ │ - Query arrays by group/key │ │ - Analyze distributions │ │ - Filter by metadata │ │ - Statistical summaries │ │ - Cross-group queries │ └─────────────────────────────────────────────────────┘ ↓ 4. ML PREPARATION PHASE (DatasetLoader) ┌─────────────────────────────────────────────────────┐ │ DatasetLoader │ │ - Stratified train/val/test split │ │ - Multi-label support │ │ - Framework-agnostic CADDataset │ │ - PyTorch adapter (.to_torch()) │ │ - Custom item loaders │ └─────────────────────────────────────────────────────┘ ↓ 5. ML TRAINING PHASE ┌─────────────────────────────────────────────────────┐ │ PyTorch DataLoader → Training Loop → Model │ └─────────────────────────────────────────────────────┘ Input Files =========== Both DatasetExplorer and DatasetLoader consume the output files generated by the DatasetMerger during Flow execution. Required Files -------------- **1. .dataset file (Compressed Zarr)** The ``.dataset`` file contains all merged array data organized by groups. This file uses the ZipStore format with Zstd compression for efficient storage and access via xarray and Dask for parallel operations. - Structure: ``{flow_name}.dataset`` - Format: Zarr ZipStore with compression - Access: xarray with Dask parallel processing **2. .infoset file (Parquet)** The ``.infoset`` file contains file-level metadata with one row per CAD file. This columnar storage format enables efficient querying using pandas DataFrame operations. - Structure: Columnar storage with ``id``, ``name``, ``description``, and custom fields - Format: Parquet - Access: pandas DataFrame operations **3. .attribset file (Parquet) - Optional** The ``.attribset`` file contains categorical metadata and label descriptions, mapping numeric codes to human-readable names and descriptions. - Structure: ``table_name``, ``id``, ``name``, ``description`` columns - Format: Parquet - Access: pandas DataFrame operations File Location ------------- Files are generated by Flow execution in the following directory structure: .. code-block:: text flow_output/flows/{flow_name}/ ├── {flow_name}.dataset ← Merged data arrays ├── {flow_name}.infoset ← File-level metadata ├── {flow_name}.attribset ← Categorical metadata └── {flow_name}.flow ← Flow specification (JSON) Relationship to DatasetMerger ============================== Understanding the relationship between DatasetMerger and the exploration/loading tools clarifies their distinct roles: **DatasetMerger (automatic during Flow)** - **Input**: Individual ``.data`` and ``.json`` files (per CAD file) - **Process**: Concatenate arrays, route metadata, add provenance tracking - **Output**: Unified ``.dataset``, ``.infoset``, ``.attribset`` files **DatasetExplorer (user-driven analysis)** - **Input**: Output files from DatasetMerger - **Process**: Query, filter, analyze, visualize - **Output**: Statistics, distributions, filtered file lists **DatasetLoader (ML preparation)** - **Input**: Output files from DatasetMerger - **Process**: Stratified splitting, dataset creation - **Output**: Train/val/test CADDataset objects Key Distinction --------------- The three components serve different operational roles: - **DatasetMerger**: Write-heavy operation (consolidate many files into one) - **DatasetExplorer**: Read-heavy operation (query and analyze unified data) - **DatasetLoader**: Read + Index operation (split and serve data for training) This separation of concerns enables efficient workflows where data consolidation occurs once, followed by iterative analysis and multiple ML training experiments without re-merging. ************** DatasetExplorer ************** The :class:`DatasetExplorer ` class provides methods for discovering, querying, and analyzing merged datasets. This class focuses on read-only exploration operations and statistical analysis without modifying the underlying data. Initialization and Setup ========================= The DatasetExplorer supports multiple initialization methods to accommodate different workflow preferences. Initialization Methods ---------------------- **Method 1: Using Flow Output JSON File (Recommended)** The most convenient approach uses the ``.flow`` JSON file generated by Flow execution. This file contains all necessary paths and the explorer automatically resolves them: .. code-block:: python from hoops_ai.dataset import DatasetExplorer # Initialize using flow file explorer = DatasetExplorer(flow_output_file="path/to/flow_name.flow") The flow file contains keys such as ``flow_data`` (pointing to the Zarr dataset), ``flow_info`` (pointing to the Parquet metadata), and ``flow_attributes`` (pointing to attribute metadata). The explorer automatically resolves relative paths based on the flow file location. **Method 2: Explicit File Paths** For scenarios where you need direct control over file paths or when working outside the Flow framework: .. code-block:: python # Initialize with explicit paths explorer = DatasetExplorer( merged_store_path="path/to/flow_name.dataset", parquet_file_path="path/to/flow_name.infoset", parquet_file_attribs="path/to/flow_name.attribset" # Optional ) This approach is useful when files are in non-standard locations or when integrating with external data processing pipelines. **Method 3: With Custom Dask Configuration** For large datasets or specific performance requirements, you can customize the Dask parallel processing configuration: .. code-block:: python # Initialize with custom Dask settings explorer = DatasetExplorer( flow_output_file="path/to/flow_name.flow", dask_client_params={ 'n_workers': 8, 'threads_per_worker': 4, 'memory_limit': '8GB' } ) Dask Configuration ------------------ The DatasetExplorer uses Dask for parallel processing of large datasets. Dask is a parallel computing library that processes data in chunks across multiple CPU cores, enabling work with datasets larger than available RAM. By default, DatasetExplorer creates a local Dask cluster with sensible defaults. You can customize the Dask configuration by providing parameters: **Parameters:** - ``flow_output_file`` (str, optional): Path to ``.flow`` JSON containing all file paths - ``merged_store_path`` (str, optional): Path to ``.dataset`` file - ``parquet_file_path`` (str, optional): Path to ``.infoset`` file - ``parquet_file_attribs`` (str, optional): Path to ``.attribset`` file - ``dask_client_params`` (dict, optional): Dask configuration for parallel operations For very large datasets, configuring Dask with more workers and increased memory limits improves performance. However, for smaller datasets or systems with limited resources, the default configuration is sufficient. Discovering Dataset Structure ============================== Before querying specific data, understanding the available groups and arrays in the dataset is essential. Available Groups ---------------- Groups represent logical collections of related data. To discover available groups: .. code-block:: python # Get list of available groups available_groups = explorer.available_groups() print(f"Groups: {available_groups}") # Output: {'faces', 'edges', 'graph', 'machining'} Each group corresponds to a category of CAD data (faces, edges, graph structures, etc.) as defined in the schema used during encoding. Available Arrays within Groups ------------------------------- Each group contains multiple arrays storing different attributes. To discover arrays within a specific group: .. code-block:: python # Get arrays in the faces group available_arrays = explorer.available_arrays("faces") print(f"Face arrays: {available_arrays}") # Output: {'face_indices', 'face_areas', 'face_types', 'face_uv_grids', 'file_id_code_faces'} The array names reflect the data stored: geometric properties (areas, lengths), categorical types, and provenance tracking (``file_id_code_*`` arrays). Retrieving Metadata Descriptions --------------------------------- The Parquet metadata file contains description tables that map numeric codes to human-readable names. To retrieve these descriptions: .. code-block:: python # Get face type descriptions face_types = explorer.get_descriptions(table_name="face_types") print(face_types) # Output: # id name description # 0 0 Plane Planar surface # 1 1 Cylinder Cylindrical surface # 2 2 Cone Conical surface # 3 3 Sphere Spherical surface The :meth:`get_descriptions ` method accepts several parameters: - ``table_name``: The name of the metadata table (e.g., ``"face_types"``, ``"edge_types"``, ``"label"``) - ``key_id``: Optional integer to filter results to a specific ID - ``use_wildchar``: Optional boolean to enable wildcard matching in table names To search for label-related tables using wildcards: .. code-block:: python # Find label tables using wildcard label_tables = explorer.get_descriptions("label", None, True) print(label_tables) # Returns all tables with "label" in the name Print Dataset Overview ---------------------- To get a comprehensive overview of the entire dataset structure: .. code-block:: python # Print complete table of contents explorer.print_table_of_contents() This command outputs a formatted summary showing all groups, their arrays with shapes and data types, and metadata file information. Example output: .. code-block:: console ======================================== DATASET TABLE OF CONTENTS ======================================== Available Groups: -------------------------------------------------- Group: faces Arrays: - face_indices: (48530,) int32 - face_areas: (48530,) float32 - face_types: (48530,) int32 - face_uv_grids: (48530, 20, 20, 7) float32 - file_id_code_faces: (48530,) int32 Group: edges Arrays: - edge_indices: (72845,) int32 - edge_lengths: (72845,) float32 - edge_types: (72845,) int32 - file_id_code_edges: (72845,) int32 Group: machining Arrays: - machining_category: (100,) int32 - material_type: (100,) int32 - file_id_code_machining: (100,) int32 Metadata Files: - Info: cad_pipeline.infoset (file-level metadata) - Attributes: cad_pipeline.attribset (categorical metadata) Total Files: 100 Querying Data ============= The DatasetExplorer provides multiple methods for accessing data at different granularities: individual arrays, complete groups, or file-specific subsets. Get Array Data -------------- To retrieve a complete array for a specific group: .. code-block:: python # Get complete array data for a group face_areas = explorer.get_array_data(group_name="faces", array_name="face_areas") # Returns: xr.DataArray with shape [N_total_faces] # Access underlying NumPy array face_areas_np = face_areas.values print(f"Total faces: {len(face_areas_np)}") print(f"Mean area: {face_areas_np.mean():.2f}") The returned object is an xarray DataArray, which provides labeled multi-dimensional array functionality similar to pandas for higher-dimensional data. The ``.values`` attribute accesses the underlying NumPy array. Get Group Data -------------- To access all arrays within a group as a single dataset: .. code-block:: python # Get entire dataset for a group faces_ds = explorer.get_group_data("faces") print(faces_ds) # Output: # # Dimensions: (face: 48530) # Coordinates: # * face (face) int64 0 1 2 3 ... 48527 48528 48529 # Data variables: # face_indices (face) int32 ... # face_areas (face) float32 ... # face_types (face) int32 ... # file_id_code_faces (face) int32 ... # Access multiple arrays face_areas = faces_ds['face_areas'] face_types = faces_ds['face_types'] Each returned dataset is an xarray.Dataset object containing data variables (arrays) with their associated coordinates and dimensions. This provides a convenient way to work with related arrays together. Get File-Specific Data ---------------------- To retrieve data for a specific CAD file within the merged dataset: .. code-block:: python # Get data for a specific file file_id_code = 5 face_subset = explorer.file_dataset(file_id_code=file_id_code, group="faces") print(f"File {file_id_code} has {len(face_subset.face)} faces") # Access arrays for this file only file_face_areas = face_subset['face_areas'].values print(f"Face areas for file {file_id_code}: {file_face_areas}") The provenance tracking (``file_id_code_*`` arrays) enables efficient filtering to extract data belonging to a single file from the merged dataset. Filter by Condition ------------------- To identify files matching specific criteria: .. code-block:: python # Get files matching a boolean condition def high_complexity_filter(ds): """Filter for files with many faces""" # Example: faces with area > 100 return ds['face_areas'] > 100 file_codes = explorer.get_file_list( group="faces", where=high_complexity_filter ) print(f"Found {len(file_codes)} files with large faces") # Convert file codes to file names file_names = [explorer.decode_file_id_code(code) for code in file_codes] The ``where`` parameter accepts a callable (function or lambda) that receives an xarray.Dataset and returns an xarray.DataArray of boolean values. The method returns an array of file ID codes where the condition is True. Distribution Analysis ===================== Computing distributions and histograms helps understand data balance and inform stratification strategies for ML training. Creating Distributions ----------------------- To compute the distribution of attributes across the entire dataset: .. code-block:: python # Create distribution with automatic binning distribution = explorer.create_distribution( key="face_areas", group="faces", bins=20 ) # Access distribution components print(f"Bin edges: {distribution['bin_edges']}") print(f"Histogram counts: {distribution['hist']}") print(f"Files per bin: {distribution['file_ids_in_bins']}") # Example output: # bin_edges: [0.5, 1.5, 2.5, ..., 20.5] # hist: [145, 302, 567, ..., 89] # file_ids_in_bins: [['part_001', 'part_003'], ['part_002', 'part_005'], ...] When ``bins=None``, the method automatically detects categorical data and creates one bin per unique category. For continuous numeric variables, specify the number of bins to create evenly spaced bins spanning the data range. The returned dictionary contains: - ``bin_edges``: Array of bin boundaries - ``hist``: Count of items in each bin - ``file_ids_in_bins``: Lists of file IDs whose items fall in each bin Visualizing Distributions -------------------------- Distribution results can be visualized using standard plotting libraries: .. code-block:: python import matplotlib.pyplot as plt import numpy as np dist = explorer.create_distribution(key="face_areas", group="faces", bins=30) # Plot histogram bin_centers = 0.5 * (dist['bin_edges'][1:] + dist['bin_edges'][:-1]) plt.bar(bin_centers, dist['hist'], width=(dist['bin_edges'][1] - dist['bin_edges'][0])) plt.xlabel('Face Area') plt.ylabel('Count') plt.title('Face Area Distribution') plt.show() This visualization helps identify class imbalance and guides decisions about data augmentation or weighted loss functions during training. Metadata Queries ================ The DatasetExplorer provides methods to access file-level metadata and categorical descriptions stored in the Parquet files. File-Level Metadata ------------------- To retrieve metadata for all files in the dataset: .. code-block:: python # Get metadata for all files all_file_info = explorer.get_file_info_all() print(all_file_info.head()) # Output: # id name size_cadfile processing_time complexity_level subset # 0 0 part_001 1024000 12.5 3 train # 1 1 part_002 2048000 18.3 4 train # 2 2 part_003 512000 8.1 2 test The returned pandas DataFrame contains complete metadata for every file, enabling bulk analysis and reporting. To retrieve metadata for a specific file: .. code-block:: python # Get metadata for specific file file_info = explorer.get_parquet_info_by_code(file_id_code=5) print(file_info) Categorical Metadata (Labels/Descriptions) ------------------------------------------- To access label descriptions from the ``.attribset`` file: .. code-block:: python # Get label descriptions complexity_labels = explorer.get_descriptions(table_name="complexity_level") print(complexity_labels) # Output: # id name description # 0 1 Simple Basic geometry # 1 2 Medium Moderate complexity # 2 3 Complex High complexity # 3 4 Very Complex Advanced features # Get specific label description label_3 = explorer.get_descriptions(table_name="complexity_level", key_id=3) print(label_3['name'].values[0]) # Output: "Complex" Stream Cache Paths (Visualizations) ------------------------------------ To retrieve paths to visualization assets (PNG thumbnails and 3D stream cache files): .. code-block:: python # Get paths to PNG and 3D stream cache files stream_paths = explorer.get_stream_cache_paths() print(stream_paths[['id', 'name', 'stream_cache_png', 'stream_cache_3d']]) # Get stream cache for specific file file_stream = explorer.get_stream_cache_paths(file_id_code=10) png_path = file_stream['stream_cache_png'].values[0] scs_path = file_stream['stream_cache_3d'].values[0] Advanced Features ================= The DatasetExplorer provides advanced analytical capabilities for multi-label analysis and stratification. Membership Matrix ----------------- Membership matrices are critical for stratified dataset splitting. A membership matrix is a 2D array where each row represents a file and each column represents a category or bin. The cell value indicates membership: for binary matrices, a value of 1 means the file contains at least one item in that category; for count matrices, the value indicates how many items belong to that category. .. code-block:: python # Create membership matrix for multi-label analysis matrix, file_codes, categories = explorer.build_membership_matrix( group="faces", key="face_types", bins_or_categories=None, # Auto-discover categories as_counts=False # Boolean membership (True) or counts (False) ) print(f"Matrix shape: {matrix.shape}") # (N_files, N_categories) print(f"File codes: {file_codes[:10]}") print(f"Categories: {categories}") # Use for stratification analysis import pandas as pd df = pd.DataFrame(matrix, columns=categories) df['file_code'] = file_codes print(df.head()) Count-based matrices (``as_counts=True``) provide more detailed information about the distribution of features within each file, which can be valuable for certain analysis tasks. Resource Management =================== The DatasetExplorer creates internal resources (such as Dask clients) that should be cleaned up when no longer needed: .. code-block:: python # Close resources when done explorer.close(close_dask=True) Always close the explorer to free memory and terminate Dask workers, especially when working with large datasets or in interactive environments. ************** DatasetLoader ************** The :class:`DatasetLoader ` class manages dataset splitting and provides framework-agnostic access to training, validation, and test subsets. This class builds upon the DatasetExplorer to handle stratified splitting and subset management. Initialization ============== The DatasetLoader requires paths to the Zarr dataset and Parquet metadata file: .. code-block:: python from hoops_ai.dataset import DatasetLoader # Basic initialization loader = DatasetLoader( merged_store_path="path/to/flow_name.dataset", parquet_file_path="path/to/flow_name.infoset" ) You can optionally provide a custom ``item_loader_func`` that defines how to load individual items from files. If no loader function is provided, the loader returns raw file paths and metadata when items are accessed. Custom Item Loader (Experimental) ---------------------------------- For advanced use cases, you can define a custom item loader function: .. code-block:: python def custom_loader(graph_file, label_file, data_id): """Custom function to load and process items""" import dgl import numpy as np # Load graph graph = dgl.load_graphs(graph_file)[0][0] # Load label label = np.load(label_file) # Return as dictionary return { 'graph': graph, 'label': label, 'id': data_id, 'num_nodes': graph.number_of_nodes(), 'num_edges': graph.number_of_edges() } loader = DatasetLoader( merged_store_path="path/to/flow_name.dataset", parquet_file_path="path/to/flow_name.infoset", item_loader_func=custom_loader ) **Parameters:** - ``merged_store_path`` (str): Path to ``.dataset`` file - ``parquet_file_path`` (str): Path to ``.infoset`` file - ``item_loader_func`` (callable, optional): Custom function to load items - Signature: ``func(graph_file, label_file, data_id) -> item`` - If None, returns raw file paths Stratified Splitting ===================== The DatasetLoader's ``split`` method performs stratified splitting of the dataset into training, validation, and test subsets. Basic Stratified Split ----------------------- To perform a stratified split by a categorical key: .. code-block:: python # Perform stratified split by a categorical key train_size, val_size, test_size = loader.split( key="complexity_level", # Metadata key to stratify on group="faces", # Group containing the key train=0.7, # 70% training validation=0.15, # 15% validation test=0.15, # 15% testing random_state=42 # For reproducibility ) print(f"Dataset split:") print(f" Train: {train_size} files") print(f" Validation: {val_size} files") print(f" Test: {test_size} files") Stratification ensures that each subset (train, validation, test) maintains the same label distribution as the overall dataset. This is critical for training models that generalize well to unseen data. Mathematical Formulation ------------------------ For stratified splitting with key :math:`K` having :math:`C` categories, the split aims to preserve the distribution: .. math:: P(k_i | \text{train}) \approx P(k_i | \text{validation}) \approx P(k_i | \text{test}) \approx P(k_i) where :math:`k_i \in K` is a category and :math:`P(k_i)` is its proportion in the full dataset. Multi-Label Stratification --------------------------- For files with multiple labels (e.g., multiple face types per file), the loader uses ``MultilabelStratifiedShuffleSplit``. This approach creates a membership matrix: .. math:: \mathbf{M} \in \{0, 1\}^{N \times C} where: - :math:`N` = number of files - :math:`C` = number of categories - :math:`M_{ij} = 1` if file :math:`i` has category :math:`j`, else 0 The split preserves label co-occurrence patterns, ensuring that combinations of labels are proportionally represented in each subset. Dataset Access ============== After splitting, you can retrieve framework-agnostic dataset objects for each subset. Retrieving Subsets ------------------ To get datasets for each subset: .. code-block:: python # Get framework-agnostic datasets train_dataset = loader.get_dataset("train") val_dataset = loader.get_dataset("validation") test_dataset = loader.get_dataset("test") print(f"Train: {len(train_dataset)} samples") print(f"Val: {len(val_dataset)} samples") print(f"Test: {len(test_dataset)} samples") # Access individual items item = train_dataset.get_item(0) print(f"Item: {item}") CADDataset Class ---------------- The ``CADDataset`` is a framework-agnostic wrapper that provides consistent access to dataset subsets: .. code-block:: python # Properties train_dataset.indices # Indices into parent dataset train_dataset.parent_dataset # Reference to DatasetLoader # Methods item = train_dataset.get_item(i) # Get item by local index raw = train_dataset.get_raw_data(i) # Get file paths without loading ML Framework Integration ======================== The DatasetLoader provides integration with popular ML frameworks through adapter methods. PyTorch Integration ------------------- To convert a CADDataset to a PyTorch-compatible dataset: .. code-block:: python from torch.utils.data import DataLoader # Get training dataset train_dataset = loader.get_dataset("train") # Convert to PyTorch train_torch = train_dataset.to_torch() # Create DataLoader train_loader = DataLoader( train_torch, batch_size=32, shuffle=True, num_workers=4, pin_memory=True # For GPU training ) # Training loop for epoch in range(num_epochs): for batch in train_loader: # Unpack batch graphs = batch['graph'] labels = batch['label'] file_ids = batch['id'] # Your training code outputs = model(graphs) loss = criterion(outputs, labels) loss.backward() optimizer.step() The ``.to_torch()`` method returns a PyTorch-compatible dataset object that can be used directly with PyTorch's DataLoader for batching and parallel loading. Resource Management =================== The DatasetLoader creates internal resources (such as DatasetExplorer instances) that should be cleaned up when no longer needed: .. code-block:: python # Close resources loader.close_resources(clear_split_history=True) Always close the loader to free memory and clean up temporary resources, especially when running multiple experiments or in long-running processes. ************************** Complete Workflow Examples ************************** This section demonstrates complete end-to-end workflows integrating DatasetExplorer and DatasetLoader. Example 1: Basic Analysis and ML Preparation ============================================= The following example demonstrates the typical progression from dataset exploration through ML preparation: .. code-block:: python import hoops_ai from hoops_ai.flowmanager import flowtask from hoops_ai.dataset import DatasetExplorer, DatasetLoader import pathlib # Assume flow already executed and created: # - cad_pipeline.dataset # - cad_pipeline.infoset # - cad_pipeline.attribset # - cad_pipeline.flow flow_file = pathlib.Path("flow_output/flows/cad_pipeline/cad_pipeline.flow") # ===== STEP 1: Explore Dataset ===== print("="*70) print("STEP 1: DATASET EXPLORATION") print("="*70) explorer = DatasetExplorer(flow_output_file=str(flow_file)) # Print overview explorer.print_table_of_contents() # Analyze face area distribution face_dist = explorer.create_distribution(key="face_areas", group="faces", bins=20) print(f"\nFace area distribution:") print(f" Range: [{face_dist['bin_edges'][0]:.2f}, {face_dist['bin_edges'][-1]:.2f}]") print(f" Total faces: {face_dist['hist'].sum()}") print(f" Mean bin count: {face_dist['hist'].mean():.1f}") # Filter files by complexity high_complexity_filter = lambda ds: ds['complexity_level'] >= 4 complex_files = explorer.get_file_list(group="faces", where=high_complexity_filter) print(f"\nHigh complexity files: {len(complex_files)}") # Close explorer explorer.close() # ===== STEP 2: Prepare ML Dataset ===== print("\n" + "="*70) print("STEP 2: ML DATASET PREPARATION") print("="*70) # Initialize loader flow_path = pathlib.Path(flow_file) loader = DatasetLoader( merged_store_path=str(flow_path.parent / f"{flow_path.stem}.dataset"), parquet_file_path=str(flow_path.parent / f"{flow_path.stem}.infoset") ) # Stratified split train_size, val_size, test_size = loader.split( key="complexity_level", group="faces", train=0.7, validation=0.15, test=0.15, random_state=42 ) print(f"\nDataset split:") print(f" Train: {train_size} files") print(f" Validation: {val_size} files") print(f" Test: {test_size} files") # Get datasets train_dataset = loader.get_dataset("train") val_dataset = loader.get_dataset("validation") test_dataset = loader.get_dataset("test") # ===== STEP 3: Prepare for Training ===== print("\n" + "="*70) print("STEP 3: PYTORCH INTEGRATION") print("="*70) from torch.utils.data import DataLoader # Convert to PyTorch train_torch = train_dataset.to_torch() val_torch = val_dataset.to_torch() # Create data loaders train_loader = DataLoader(train_torch, batch_size=32, shuffle=True, num_workers=4) val_loader = DataLoader(val_torch, batch_size=32, shuffle=False, num_workers=4) print(f"\nDataLoaders created:") print(f" Train batches: {len(train_loader)}") print(f" Val batches: {len(val_loader)}") # Test iteration batch = next(iter(train_loader)) print(f"\nSample batch keys: {list(batch.keys())}") # ===== STEP 4: Training Loop (Skeleton) ===== print("\n" + "="*70) print("STEP 4: TRAINING (SKELETON)") print("="*70) num_epochs = 10 for epoch in range(num_epochs): print(f"\nEpoch {epoch+1}/{num_epochs}") # Training phase for batch_idx, batch in enumerate(train_loader): # Your training code here pass # Validation phase for batch in val_loader: # Your validation code here pass print("\nWorkflow complete!") loader.close_resources() This workflow illustrates the typical progression: explore and validate the merged dataset using DatasetExplorer, then prepare training data using DatasetLoader. Example 2: Advanced Analysis with Visualization ================================================ This example demonstrates multi-dimensional analysis with visualization: .. code-block:: python from hoops_ai.dataset import DatasetExplorer from hoops_ai.insights import DatasetViewer import matplotlib.pyplot as plt import numpy as np # Initialize explorer explorer = DatasetExplorer(flow_output_file="cad_pipeline.flow") # ===== Multi-Dimensional Analysis ===== # 1. Face area distribution face_dist = explorer.create_distribution(key="face_areas", group="faces", bins=30) # 2. Edge length distribution edge_dist = explorer.create_distribution(key="edge_lengths", group="edges", bins=30) # 3. Create visualization fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Plot face area histogram ax1 = axes[0, 0] bin_centers = 0.5 * (face_dist['bin_edges'][1:] + face_dist['bin_edges'][:-1]) ax1.bar(bin_centers, face_dist['hist'], width=(face_dist['bin_edges'][1] - face_dist['bin_edges'][0])) ax1.set_xlabel('Face Area') ax1.set_ylabel('Count') ax1.set_title('Face Area Distribution') # Plot edge length histogram ax2 = axes[0, 1] bin_centers = 0.5 * (edge_dist['bin_edges'][1:] + edge_dist['bin_edges'][:-1]) ax2.bar(bin_centers, edge_dist['hist'], width=(edge_dist['bin_edges'][1] - edge_dist['bin_edges'][0])) ax2.set_xlabel('Edge Length') ax2.set_ylabel('Count') ax2.set_title('Edge Length Distribution') # Plot file count per bin ax3 = axes[1, 0] file_counts = [len(files) for files in face_dist['file_ids_in_bins']] ax3.plot(bin_centers, file_counts, marker='o') ax3.set_xlabel('Face Area') ax3.set_ylabel('Number of Files') ax3.set_title('Files per Face Area Bin') # Plot complexity distribution complexity_stats = explorer.get_array_statistics(group_name="faces", array_name="complexity_level") ax4 = axes[1, 1] ax4.text(0.1, 0.9, f"Mean: {complexity_stats['mean']:.2f}", transform=ax4.transAxes) ax4.text(0.1, 0.8, f"Std: {complexity_stats['std']:.2f}", transform=ax4.transAxes) ax4.text(0.1, 0.7, f"Min: {complexity_stats['min']:.2f}", transform=ax4.transAxes) ax4.text(0.1, 0.6, f"Max: {complexity_stats['max']:.2f}", transform=ax4.transAxes) ax4.set_title('Dataset Statistics') ax4.axis('off') plt.tight_layout() plt.savefig('dataset_analysis.png', dpi=300) plt.show() # ===== Visual Inspection ===== # Get high complexity files for visual inspection high_complexity_filter = lambda ds: ds['complexity_level'] >= 4 complex_file_codes = explorer.get_file_list(group="faces", where=high_complexity_filter) # Use DatasetViewer for visual inspection viewer = DatasetViewer.from_explorer(explorer) fig = viewer.show_preview_as_image( complex_file_codes[:25], # First 25 complex files k=25, grid_cols=5, label_format='id', figsize=(15, 12) ) plt.savefig('complex_files_preview.png', dpi=300) plt.show() explorer.close() This example demonstrates how to perform comprehensive analysis combining statistical summaries, distribution analysis, and visual inspection of the dataset. ************************** Best Practices ************************** The following best practices help ensure efficient and correct usage of the dataset exploration and loading tools. For DatasetExplorer =================== **1. Use flow_output_file parameter** This simplifies initialization and ensures correct file paths: .. code-block:: python explorer = DatasetExplorer(flow_output_file="path/to/flow.flow") **2. Close resources** Always close when done to free memory and Dask workers: .. code-block:: python explorer.close(close_dask=True) **3. Check available groups first** Use ``available_groups()`` and ``available_arrays()`` before querying: .. code-block:: python groups = explorer.available_groups() if 'faces' in groups: face_data = explorer.get_group_data('faces') **4. Print table of contents early** Understand dataset structure before analysis: .. code-block:: python explorer.print_table_of_contents() For DatasetLoader ================= **1. Set random_state** Ensure reproducible splits: .. code-block:: python loader.split(key="label", random_state=42) **2. Clean up resources** Close explorer and clear caches: .. code-block:: python loader.close_resources(clear_split_history=True) ************************** Performance Considerations ************************** Understanding performance characteristics helps optimize dataset operations for different use cases. Memory Management ================= **DatasetExplorer:** The DatasetExplorer uses Dask for out-of-core processing, enabling work with data larger than RAM. Zarr chunking enables partial array loading. Configure Dask workers based on available memory: .. code-block:: python dask_params = { 'n_workers': 4, 'threads_per_worker': 2, 'memory_limit': '8GB' # Per worker } **DatasetLoader:** The DatasetLoader keeps only indices in memory, not full data. Custom loaders should be memory-efficient. Use PyTorch DataLoader ``num_workers`` for parallel loading. Parallel Processing =================== **DatasetExplorer Parallelism:** - Distribution computation: Dask parallel histogram - Cross-group queries: Parallel joins - Subgraph search: Parallel pattern matching **DatasetLoader Parallelism:** - PyTorch DataLoader ``num_workers``: Controls loading parallelism - Set based on CPU cores: ``num_workers = min(4, cpu_count())`` - Use ``pin_memory=True`` for GPU training ************** Summary ************** **DatasetExplorer and DatasetLoader** provide a complete solution for dataset analysis and ML preparation within the HOOPS AI pipeline. Key Capabilities ================ **DatasetExplorer: Analysis & Exploration** - ✅ Query arrays by group and key - ✅ Analyze distributions with histograms - ✅ Filter files by metadata conditions - ✅ Statistical analysis and visualization - ✅ Cross-group queries and joins **DatasetLoader: ML Preparation** - ✅ Stratified train/val/test splitting - ✅ Multi-label stratification support - ✅ Framework-agnostic CADDataset - ✅ PyTorch integration with ``.to_torch()`` - ✅ Custom item loaders for preprocessing Integration with HOOPS AI Pipeline =================================== - ✅ Automatic consumption of DatasetMerger outputs - ✅ Schema-driven group and array discovery - ✅ Seamless connection to Flow-based workflows - ✅ Support for visualization assets (PNG, 3D cache) These tools complete the HOOPS AI data pipeline, enabling users to go from raw CAD files to trained ML models with minimal custom code. ************** See Also ************** For related topics and additional information: - :doc:`dataset-merger` - Understanding the data merging process that produces the input files - :doc:`cad-data-encoding` - Encoding CAD data for machine learning - :doc:`flow` - Flow-based data processing pipelines - :doc:`storage` - Storage abstractions for data persistence