CDI-Workflow description of the EOSC Future WP6 Task 3, Science Project 9 ‘Climate Neutral and Smart Cities’

Main Process Sequence

Description: Main Sequence of the process

Processing Agent: EOSC project team at Sikt - Norwegian Agency for Shared Services in Education and Research
Purpose: Integrate climate data from ERA5 and air quality data from the EEA with the ESS survey data
Production Environment: Sikt - Norwegian Agency for Shared Services in Education and Research acting as a participant of SP9

Overview Diagram of the Process Activities (in sequential order)

Note

Move the mouse cursor over an activity to see more information. Click on an activity to go to the corresponding page.

digraph Diagram { graph [ stylesheet="../_static/custom.css" fontnames = "svg" # "... rock solid standards compliant SVG", see: https://graphviz.org/faq/font/#what-about-svg-fonts rankdir="LR" nodesep="0.7" ranksep="2" tooltip=" " ]; node [ shape="rect" style="filled, rounded" width="7" height="0.7" fontcolor="white" fillcolor="#4363d8" fontname="sans-serif" fontsize="20pt" ]; activity_9d3db1fd_79cc_4ef8_816d_3a28499a6d0a [ label="Integrate climate and air quality data with ESS" URL="../activity_9d3db1fd-79cc-4ef8-816d-3a28499a6d0a.html" target="_parent" tooltip="Process activity:\nIntegrate climate data from Copernicus ERA5 and air quality data from the European Environmental Agency (EEA) with data from the European Social Survey (ESS) for Berlin, Oslo, Stockholm, Brussels, London, Paris, Vienna, Prague, Budapest, and Madrid" ]; activity_34be5d18_c1f7_4e3d_82e7_81909e0989f4 [ label="ERA5 Data (Copernicus)" URL="../activity_34be5d18-c1f7-4e3d-82e7-81909e0989f4.html" target="_parent" tooltip="Process activity:\nIngest and prepare data from ERA5 data (Copernicus)" ]; activity_9d3db1fd_79cc_4ef8_816d_3a28499a6d0a -> activity_34be5d18_c1f7_4e3d_82e7_81909e0989f4; activity_70e828d0_4231_4b9b_bb69_7882483fb591 [ label="ERA5 Get raw input" URL="../activity_70e828d0-4231-4b9b-bb69-7882483fb591.html" target="_parent" tooltip="Process activity:\nThe process involves obtaining NUTS - Nomenclature of territorial units for statistics polygons for the relevant regions, followed by calling a public API with GPS coordinates derived from the polygons. A single API call is made per month, resulting in a gridded data response file with a default resolution of 0.1 degree latitude/longitude. Each month and region corresponds to one variable, resulting in over 20.000 files. The performance of this process is relatively slow, taking around a minute per call, and it requires a substantial number of calls to collect the complete dataset. There are over 12.000 raw input files in NetCDF4 format covering the period from 1990 to 2022." ]; activity_34be5d18_c1f7_4e3d_82e7_81909e0989f4 -> activity_70e828d0_4231_4b9b_bb69_7882483fb591; activity_8a364e23_1e2f_4f75_9b61_2ad1ad39fe68 [ label="ERA5 Marshalling data" URL="../activity_8a364e23-1e2f-4f75-9b61-2ad1ad39fe68.html" target="_parent" tooltip="Process activity:\nThe process involves reading NetCDF files into Panda dataframes, obtaining estimated population data for grids from the Global Human Settlements data based on Eurostat, merging the population data with ERA5 data, and writing the merged data to disk in Parquet format. External experts perform quality checks on the merged data, which could be either a one-off or a regular quality assurance check. The process utilizes over 12.000 NetCDF4 files as input as well as data from the GHSL - Global Human Settlement Layer. The output of the process is a single Parquet file named \"Interim data for review\" with its corresponding URI." ]; activity_34be5d18_c1f7_4e3d_82e7_81909e0989f4 -> activity_8a364e23_1e2f_4f75_9b61_2ad1ad39fe68; activity_f75ac23d_46b5_4e7f_a52d_638ab34a7b81 [ label="ERA5 Data Processing" URL="../activity_f75ac23d-46b5-4e7f-a52d-638ab34a7b81.html" target="_parent" tooltip="Process activity:\nThe process involves creating a date variable from timestamps based on the time zone of each region, considering that the data is recorded hourly. It also addresses unit differences, converting Kelvin to Celsius and meters to millimeters. The data is then grouped by date, variable, and region, and temperature is averaged while also obtaining maximum and minimum values, accumulating precipitation by date, and identifying the maximum wind gust value. Moving averages are calculated for variables using different time windows (7-day, 30-day, 90-day, 365-day). Baseline values for temperature, precipitation, wind gust, and deviations from the baseline (anomalies) are determined based on the period from 1991 to 2020. Data older than 2015 is removed, and a group-by operation is performed, collapsing the data by region using population-weighted averages. It is important to note that the ERA5 data may contain imputed and missing values. In memory, each row corresponds to a region, with mesh-blocks aggregated per day to calculate region-level values by taking the average of all variables weighted by the population of each block. The resulting data is stored to disk in CSV, SAV, or other suitable formats, as the data size remains manageable." ]; activity_34be5d18_c1f7_4e3d_82e7_81909e0989f4 -> activity_f75ac23d_46b5_4e7f_a52d_638ab34a7b81; activity_a142bdc5_8e35_4de1_94f6_c3a1e298ed79 [ label="EEA Air Quality" URL="../activity_a142bdc5-8e35-4de1-94f6-c3a1e298ed79.html" target="_parent" tooltip="Process activity:\nIngest and prepare data from EEA Air Quality" ]; activity_9d3db1fd_79cc_4ef8_816d_3a28499a6d0a -> activity_a142bdc5_8e35_4de1_94f6_c3a1e298ed79; activity_6a8934ad_b0c7_4a2c_8899_129ce9e9b4a3 [ label="EEA Get raw input" URL="../activity_6a8934ad-b0c7-4a2c-8899-129ce9e9b4a3.html" target="_parent" tooltip="Process activity:\nThe process involves obtaining a list of stations with GPS coordinates from a pan-European metadata CSV file and selecting specific stations based on their GPS coordinates and ID. Only background stations are selected. The NUTS regions, defined by their GPS coordinates for polygons, are taken into account when the data is collected." ]; activity_a142bdc5_8e35_4de1_94f6_c3a1e298ed79 -> activity_6a8934ad_b0c7_4a2c_8899_129ce9e9b4a3; activity_d79858ab_3d27_48d2_ba6a_64f37b7dd06d [ label="EEA Marshalling data" URL="../activity_d79858ab-3d27-48d2-ba6a-64f37b7dd06d.html" target="_parent" tooltip="Process activity:\nAll the collected data, including pollutant-by-station-by-hour information, is consolidated into a single file named \"eea-stations,\" with each row representing a specific pollutant, station, and hour. The merged data is stored in Parquet format, serving as a checkpoint for external reviewers to validate the analysis." ]; activity_a142bdc5_8e35_4de1_94f6_c3a1e298ed79 -> activity_d79858ab_3d27_48d2_ba6a_64f37b7dd06d; activity_c6b1e1d8_2e41_441c_83a7_3e809d7a07db [ label="EEA Data Processing" URL="../activity_c6b1e1d8-2e41-441c-83a7-3e809d7a07db.html" target="_parent" tooltip="Process activity:\nThe process involves collapsing the data on time to group it by day, using maximum values. It is then collapsed by region, also using the maximum value for each region. Index variables are calculated based on the concentrations provided by the EEA. These index variables help create a classification of air quality (e.g., Good, Fair, Moderate) are based on the European Air Quality Index as of August 2023. Derived variables are computed to determine the worst quality and the number of days with poor quality for different time periods, with specific dates serving as reference points. The resulting data is stored as a file, which can be of a suitable format based on the size requirements." ]; activity_a142bdc5_8e35_4de1_94f6_c3a1e298ed79 -> activity_c6b1e1d8_2e41_441c_83a7_3e809d7a07db; activity_25c58b1b_c5ed_48fc_8e53_4d2fdd682184 [ label="Merging of ERA5, EEA and ESS data" URL="../activity_25c58b1b-c5ed-48fc-8e53-4d2fdd682184.html" target="_parent" tooltip="Process activity:\nMerging of ERA5, EEA and ESS data" ]; activity_9d3db1fd_79cc_4ef8_816d_3a28499a6d0a -> activity_25c58b1b_c5ed_48fc_8e53_4d2fdd682184; activity_970e8311_50e6_480f_b227_749354e5f879 [ label="Merging of ERA5, EEA and ESS8 data" URL="../activity_970e8311-50e6-480f-b227-749354e5f879.html" target="_parent" tooltip="Process activity:\nThe process involves merging ERA5, EEA, and ESS8 data, selecting only the observations relevant to the specified regions based on coverage, and saving the integrated data as a single file for ESS Round 8." ]; activity_25c58b1b_c5ed_48fc_8e53_4d2fdd682184 -> activity_970e8311_50e6_480f_b227_749354e5f879; activity_b9e2a47a_1391_4169_a207_bd3a53a50dd2 [ label="Merging of ERA5, EEA and ESS9 data" URL="../activity_b9e2a47a-1391-4169-a207-bd3a53a50dd2.html" target="_parent" tooltip="Process activity:\nThe process involves merging ERA5, EEA, and ESS9 data, selecting only the observations relevant to the specified regions based on coverage, and saving the integrated data as a single file for ESS Round 9." ]; activity_25c58b1b_c5ed_48fc_8e53_4d2fdd682184 -> activity_b9e2a47a_1391_4169_a207_bd3a53a50dd2; activity_213886c1_9e5d_4d9d_8d3e_96b017181769 [ label="Merging of ERA5, EEA and ESS10 data" URL="../activity_213886c1-9e5d-4d9d-8d3e-96b017181769.html" target="_parent" tooltip="Process activity:\nThe process involves merging ERA5, EEA, and ESS10 face-to-face data, selecting only the observations relevant to the specified regions based on coverage, and saving the integrated data as a single file for ESS Round 10." ]; activity_25c58b1b_c5ed_48fc_8e53_4d2fdd682184 -> activity_213886c1_9e5d_4d9d_8d3e_96b017181769; activity_13e9d99b_d48f_4a8c_934e_2b9a8d942e4e [ label="Merging of ERA5, EEA and ESS10 Self-completion data" URL="../activity_13e9d99b-d48f-4a8c-934e-2b9a8d942e4e.html" target="_parent" tooltip="Process activity:\nThe process involves merging ERA5, EEA, and ESS10 Self-completion data, selecting only the observations relevant to the specified regions based on coverage, and saving the integrated data as a single file for ESS Round 10 (Self-completion)." ]; activity_25c58b1b_c5ed_48fc_8e53_4d2fdd682184 -> activity_13e9d99b_d48f_4a8c_934e_2b9a8d942e4e; }