Site, Data Location, and Operation Concepts ============================================ This document explains the key concepts in the CCAT data transfer system, focusing on how Sites, Data Locations, and Operations work together to manage data flow across the observatory infrastructure. .. uml:: _static/ccat_data_flow.puml :caption: CCAT Data Transfer System Architecture :align: center Overview -------- The data transfer system is built around a hierarchical structure that organizes data storage and processing across different physical and logical locations. Think of it like a warehouse and logistics system where: - **Sites** are like cities or countries - **Data Locations** are like specific warehouses or storage facilities within those cities - **Operations** are like the different warehouse activities (packing, unpacking, storing, shipping, etc.) Sites ----- A **Site** represents a physical or logical location where data can be stored or processed. Examples include: - **CCAT** (the telescope site in Chile) - **Cologne** (the CCAT data center) - **Other potential sites** (e.g. other data centers, observatories, etc.) Each site has: - A unique name (e.g., "CCAT", "Cologne") - A short name for technical use (e.g., "ccat", "cologne") - A geographical location description (e.g., "Atacama", "Germany", "USA") Think of sites as the major hubs in your data network - each represents a different physical location with its own infrastructure. Data Locations -------------- **Data Locations** are specific storage or processing areas within a site. They're like specific rooms or servers within a building. Each data location has: - A **Location Type** (what it's used for) - A **Storage Type** (how data is stored) - A **Priority** (which location to use first if multiple options exist) Location Types ~~~~~~~~~~~~~~ The system recognizes four main types of data locations: **SOURCE** - Telescope Instrument Computers These are the computers that collect raw data from observations. Think of them as the "data collection points" where observations first land. There can be multiple SOURCE locations for each instrument. **BUFFER** - Input/Output Buffers These are temporary storage areas where data can be transferred from and/or received to. They act like staging areas where data is organized and prepared for the next step. Multiple buffers can exist with different priorities. **LONG_TERM_ARCHIVE** - Permanent Storage These are the final destinations for data - permanent storage systems where data is kept for long-term access. Think of them as the "vaults" where data is safely stored. **PROCESSING** - Temporary Processing Areas These are specialized areas where data undergoes analysis or transformation. They're like "workshops" where data is processed before being stored or transferred. Storage Types ~~~~~~~~~~~~~ Each data location also has a storage type that determines how data is physically stored: **DISK** - Traditional disk storage Regular hard drives or SSDs. Fast access, good for temporary storage and processing. **S3** - Object storage (like AWS S3) Cloud-based storage that's good for large amounts of data and long-term archiving. **TAPE** - Tape storage Traditional tape drives for very long-term, cost-effective storage. The Type determines how to access the data location (e.g. via SSH, S3, tape, etc.). Operations ---------- **Operations** are the different types of work that can be performed on data at each location. The system automatically determines which operations are available based on the location type: Source Locations (Telescope Computers) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Raw Data Package Creation** - Organizing raw observation data into packages - **Deletion** - Removing data that has been successfully transferred to the long-term archive - **Monitoring** - Checking system health and performance Buffer Locations (Temporary Storage) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Data Transfer Package Creation** - Preparing data for transfer between sites - **Data Transfer Unpacking** - Extracting data from transfer packages - **Data Transfer** - Moving data between locations - **Deletion** - Cleaning up temporary files - **Long-Term Archive Transfer** - Moving data to permanent storage - **Monitoring** - System health checks Long-Term Archive Locations (Permanent Storage) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Long-Term Archive Transfer** - Moving data between archive locations - **Monitoring** - Ensuring data integrity Processing Locations (Analysis Areas) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Staging** - Retrieving data from the local long-term archive of this site - **Deletion** - Cleaning up processed data - **Monitoring** - System health checks How It All Works Together ------------------------- The system uses a "queue-based" approach where different types of work are automatically routed to the appropriate locations. Here's how it works: 1. **Data Collection**: Raw data arrives at SOURCE locations (telescope computers). This data enters the system as a RawDataFile that is registered via the OpsDB API and linked to ExecutedObsUnit and InstrumentModule it belongs to. 2. **Initial Processing**: Data is organized into packages at SOURCE locations (RawDataPackage). This is done by the raw_data_package_manager. 3. **Transfer Preparation**: Data packages are prepared for transfer at BUFFER locations (DataTransferPackage). This is done by the data_transfer_package_manager. 4. **Data Movement**: Data is transferred between sites using appropriate transfer methods (disk-to-disk, disk-to-S3, etc.). This is done by the transfer_manager. 5. **Unpacking**: Data is extracted and verified at destination BUFFER locations This is done by the data_integrity_manager. 6. **Archiving**: Data is moved to LONG_TERM_ARCHIVE locations for permanent storage This is done by the archive_manager. 7. **Processing**: Data can be staged to PROCESSING locations for analysis This is done by the staging_manager. 8. **Cleanup**: Temporary files are deleted as data moves through the system This is done by the deletion_manager. Queue Routing ~~~~~~~~~~~~~~ The system automatically creates "queues" (like different work stations in a warehouse) for each location and operation combination. For example: - ``ccat_telescope_computer_raw_data_package_creation`` - ``cologne_buffer_data_transfer`` - ``cornell_archive_long_term_archive_transfer`` This ensures that work is always sent to the right place and doesn't interfere with other operations. Priority and Failover ~~~~~~~~~~~~~~~~~~~~~ Multiple data locations of the same type can exist at a site (like having multiple backup servers). The system uses: - **Priority levels** (lower numbers = higher priority) - **Active/Inactive status** to handle maintenance or failures - **Automatic failover** when primary locations are unavailable This means if your main buffer is full or offline, the system automatically uses the next available buffer. Data Flow Example ----------------- Here's a typical data journey: 1. **Observation**: Telescope collects data → SOURCE location (telescope computer) 2. **Package Creation**: Raw data organized → SOURCE location creates packages 3. **Transfer**: Data moved → BUFFER location at destination site 4. **Unpacking**: Data extracted and verified → BUFFER location 5. **Archiving**: Data moved to permanent storage → LONG_TERM_ARCHIVE location 6. **Processing**: Data staged for analysis → PROCESSING location (if needed) 7. **Cleanup**: Temporary files deleted → Various locations The system handles all the routing, queuing, and error handling automatically, so data flows smoothly from collection to permanent storage without manual intervention. Key Benefits ------------ - **Automatic Routing**: Work goes to the right place automatically - **Fault Tolerance**: System continues working even if some locations fail - **Scalability**: Easy to add new sites and locations - **Flexibility**: Different storage types for different needs - **Monitoring**: Built-in health checks and performance tracking This architecture allows the CCAT observatory to efficiently manage large amounts of astronomical data across multiple international sites while maintaining data integrity and system reliability.