hoops_ai.storage.DatasetMerger
- class hoops_ai.storage.DatasetMerger(zip_files=None, merged_store_path=None, file_id_codes=None, dask_client_params=None, delete_source_files=True)
Bases:
objectHandles the merging of multiple .zip Zarr files into a single store (plus optional partial batching logic).
- Parameters:
- close_dask_resources(close_global=False)
Close the Dask client and cluster that this class manages. If close_global is True, also close any globally active Dask client.
- Parameters:
close_global (bool)
- Return type:
None
- discover_groups_from_files(max_files_to_scan=5)
Dynamically discover group and array structure from the input files. Uses stored schemas if available, falls back to naming heuristics for backward compatibility.
Parameters:
- max_files_to_scanint
Maximum number of files to scan for structure discovery
Returns:
- Dict[str, Dict[str, List[str]]]
Discovered group structure in DATA_GROUPS format
- get_discovered_groups()
Get the discovered groups structure.
Returns:
- Optional[Dict[str, Dict[str, List[str]]]]
The discovered groups or None if not discovered yet
- merge_data(face_chunk=500000, edge_chunk=500000, faceface_flat_chunk=100000, batch_size=None, consolidate_metadata=True, force_compression=True)
Merges all zip_files into a single Zarr store.
If force_compression is True, ensures the output uses ZipStore format with .dataset extension. If batch_size is None, merges all files in one shot. Otherwise merges them in partial batches and then merges those partial stores into a final store.
Parameters:
- face_chunkint
Chunk size for face arrays
- edge_chunkint
Chunk size for edge arrays
- faceface_flat_chunkint
Chunk size for faceface arrays
- batch_sizeOptional[int]
If provided, process files in batches of this size
- consolidate_metadatabool
Whether to consolidate metadata after merging
- force_compressionbool
If True, ensures output is a compressed .dataset file instead of a directory
- print_discovered_structure()
Print the discovered group and array structure in a readable format.
- Return type:
None