Package Manager =============== The package manager module is responsible for managing the creation and organization of data transfer packages in the CCAT data transfer system. While designed to be agnostic to specific observatories and archives, it is currently being developed for a specific use case: Current Implementation Context ------------------------------ The system is being developed primarily to facilitate data transfer from the CCAT-prime observatory to the CCAT Data Archive at the University of Cologne. This specific implementation serves as the initial use case, demonstrating the system's capabilities in a real-world scenario. Future Extensibility ~~~~~~~~~~~~~~~~~~~~ Although the current focus is on the CCAT to University of Cologne transfer, the system is designed with extensibility in mind. The architecture allows for potential future expansion to include additional data archives or observatories. Data Hierarchy -------------- RawDataFiles ~~~~~~~~~~~~ RawDataFiles are the fundamental units of data created by instruments at the CCAT-prime observatory. These files contain the raw observational data and are typically associated with specific observation units. RawDataPackages ~~~~~~~~~~~~~~~ RawDataPackages are the atomic units of storage in the CCAT system. They consist of one or more RawDataFiles grouped together, but are limited in size (typically not exceeding 50GB). This grouping allows for more efficient storage and management of data. DataTransferPackages ~~~~~~~~~~~~~~~~~~~~ DataTransferPackages are optimized collections of RawDataPackages created specifically for network transfer. They are designed to maximize bandwidth utilization in long-distance, high-latency networks, such as those between the CCAT-prime observatory and the University of Cologne. Data Transfer Optimization -------------------------- The package manager implements a strategy to optimize data transfer from the upstream observatory to downstream archives: 1. Grouping: RawDataPackages are grouped into DataTransferPackages of an optimal size (typically around 50GB) based on experience from projects like ALMA. 2. Parallel Transfer: Multiple DataTransferPackages can be sent in parallel, utilizing available network bandwidth more effectively. 3. Load Balancing: A round-robin system is implemented to distribute DataTransferPackages across multiple transfer routes. While currently there is only one primary route (CCAT to University of Cologne), this feature is in place to support potential future expansion to multiple archives. 4. Asynchronous Processing: Celery tasks are used to handle the creation and transfer of packages asynchronously, allowing for efficient use of computational resources. Primary and Secondary Data Transfers ------------------------------------ The system distinguishes between primary and secondary data transfers: - Primary transfers: From the CCAT-prime observatory to the University of Cologne archive. - Secondary transfers: While not currently in use, the system supports transfers between potential future downstream archives, using the DataTransferPackages created during primary transfers. Database Integration -------------------- The package manager interacts closely with the database, performing operations such as: - Querying for RawDataPackages not yet assigned to a DataTransferPackage - Creating new DataTransferPackage entries - Updating the status of RawDataPackages and RawDataFiles Error Handling and Logging -------------------------- Comprehensive error handling and logging are implemented throughout the module to ensure robust operation and facilitate debugging, especially important given the distributed nature of the system and the critical nature of astronomical data. Configuration ------------- The module uses configuration settings to control various aspects of its operation, such as maximum package sizes and Redis keys for managing transfer routes. These settings can be adjusted to optimize performance for the specific network conditions between CCAT and the University of Cologne. This module plays a crucial role in preparing data for transfer and managing the flow of data from the CCAT-prime observatory to the University of Cologne archive, ensuring efficient use of network resources and reliable data transmission. Its design allows for potential future expansion to include additional data archives, although this is not currently implemented.