{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3454c683",
   "metadata": {},
   "source": [
    "# MAUNA LOA CO₂ FLASK DATA ANALYSIS\n",
    "## A Beginner's Guide (Python Version)\n",
    "\n",
    "This tutorial will walk you through how to work with real atmospheric CO₂ data collected at the **Mauna Loa Observatory** in Hawaii. This data is part of NOAA's Global Monitoring Laboratory (GML) long-term record of greenhouse gases. The Mauna Loa Observatory provides one of the longest continuous records of atmospheric CO₂, making it a vital source for atmospheric research worldwide.\n",
    "\n",
    "Whether you're new to Python, atmospheric science, or data analysis, this guide is designed to help you get started step-by-step.\n",
    "\n",
    "---\n",
    "\n",
    "### In this tutorial, you'll learn how to:\n",
    "\n",
    "- **Load data** in two common scientific formats:\n",
    "  - Plain text (.txt)\n",
    "  - NetCDF (.nc)\n",
    "\n",
    "- **Explore the dataset structure**, including:\n",
    "  - Viewing metadata\n",
    "  - Listing available variables\n",
    "  - Checking data size and data types\n",
    "\n",
    "- **Filter the data**, such as:\n",
    "  - Removing rejected values (flagged entries)\n",
    "  - Narrowing down by date range or custom conditions\n",
    "\n",
    "- **Visualize CO₂ levels** over time using Python plotting tools\n",
    "\n",
    "- **Apply NOAA GML’s curve-fitting method** to:\n",
    "  - Identify long-term trends\n",
    "  - Separate out seasonal variations\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56824ebc",
   "metadata": {},
   "source": [
    "## Step 1: Import Required Python Packages\n",
    "\n",
    "To begin our analysis, we need to load some Python libraries. These packages will help us:\n",
    "- Load and work with data tables\n",
    "- Perform calculations and filter data\n",
    "- Create time series plots\n",
    "\n",
    "We’ll also import two custom Python modules provided by NOAA GML:\n",
    "- `ccg_filter.py`: contains the filtering and curve-fitting tools\n",
    "- `ccg_dates.py`: provides date conversion and handling functions used in GML’s analysis\n",
    "\n",
    "**Make sure these two files (`ccg_filter.py` and `ccg_dates.py`) are in the same folder as this notebook.**\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5616c75",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import common Python packages\n",
    "import pandas as pd  # For loading and working with tables (dataframes)\n",
    "import matplotlib.pyplot as plt  # For plotting\n",
    "import matplotlib.dates as mdates  # For formatting time axes in plots\n",
    "\n",
    "# Import NOAA GML custom modules\n",
    "import ccg_dates  # Date utilities\n",
    "import ccg_filter  # Data filtering and curve fitting"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f6872b9",
   "metadata": {},
   "source": [
    "#### Tip: Set Default Plotting Styles Early\n",
    "\n",
    "You can customize the default appearance of all your plots by setting `matplotlib` parameters at the beginning of your notebook. This isn’t required, but it helps keep your plots consistent and easier to read.\n",
    "\n",
    "This is especially useful when you're making multiple plots throughout your analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f8fa4f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set default plot styles for consistent formatting\n",
    "plt.rcParams['font.size'] = 16           # Base font size for text\n",
    "plt.rcParams['axes.titlesize'] = 18      # Title font size for each plot\n",
    "plt.rcParams['axes.labelsize'] = 16      # Font size for axis labels\n",
    "plt.rcParams['xtick.labelsize'] = 14     # Font size for tick labels on x-axis\n",
    "plt.rcParams['ytick.labelsize'] = 14     # Font size for tick labels on y-axis\n",
    "plt.rcParams['legend.fontsize'] = 14     # Font size for plot legends"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ef7d18a",
   "metadata": {},
   "source": [
    "## Step 2: Load the CO₂ Data (Text Format)\n",
    "\n",
    "NOAA's Global Monitoring Laboratory provides Mauna Loa CO₂ measurements in an ASCII text format. This file contains both metadata (information about the data) and the actual measurement values.\n",
    "If you prefer to use the NetCDF file instead of the text file, see the Appendix for loading and processing instructions.\n",
    "\n",
    "In this step, we'll:\n",
    "- Locate and read the file\n",
    "- Skip the metadata lines\n",
    "- Load the data into a table (called a DataFrame) using the `pandas` library\n",
    "\n",
    "To learn more about how these CO₂ flask samples are collected, visit:\n",
    "https://gml.noaa.gov/ccgg/behind_the_scenes/surface.html "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c2f05b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Insert the path to your downloaded ASCII text file\n",
    "file_path = \"co2_mlo_surface-flask_1_ccgg_event.txt\"\n",
    "\n",
    "# First, find out how many header lines there are (they start with \"#\")\n",
    "with open(file_path, \"r\") as file:\n",
    "    header_lines = [line.strip() for line in file if line.startswith('#')]\n",
    "\n",
    "# Count the number of header lines so we can skip them when reading the data\n",
    "n = len(header_lines)\n",
    "\n",
    "# Print the number of lines to skip\n",
    "print(f\"Number of header lines: {n}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "272ab99d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now read the data, skipping the header lines\n",
    "flask_data = pd.read_csv(\n",
    "    file_path,\n",
    "    sep=r'\\s+',       # Use whitespace as the separator\n",
    "    skiprows=n,      # Skip the metadata lines\n",
    "    header=0         # First remaining line is the column header\n",
    ")\n",
    "\n",
    "# Display the first 5 rows of the data\n",
    "flask_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c8a5740",
   "metadata": {},
   "source": [
    "## Step 3: Explore the Data\n",
    "\n",
    "Now that we have loaded the CO₂ data, let’s explore the dataset.  \n",
    "\n",
    "\n",
    "We will:\n",
    "- Review metadata or file header info\n",
    "- Examine column names and data types\n",
    "- Check dataset size (rows × columns)\n",
    "- Verify date/time formats\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fade73fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Print the metadata/header rows at the top of the file\n",
    "with open(file_path, \"r\") as file:\n",
    "    header_lines = [line.strip() for line in file if line.startswith('#')]\n",
    "\n",
    "print(\"File Header / Metadata:\\n\")\n",
    "for line in header_lines:\n",
    "    print(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cf19fde",
   "metadata": {},
   "source": [
    ">The header metadata gives context about how the data was collected and what each column means - don't skip it!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f75bae9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display the column names in the DataFrame\n",
    "flask_data.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ad64daee",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check the number of rows and columns (shape of the DataFrame)\n",
    "flask_data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "896ea463",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check the data type of each column\n",
    "flask_data.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0782867b",
   "metadata": {},
   "source": [
    "For easier plotting and analysis, we will convert the 'datetime' column values to a datetime object. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e417f56",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert the 'datetime' column to actual datetime objects, if it's not already\n",
    "flask_data['datetime'] = pd.to_datetime(flask_data['datetime'])\n",
    "\n",
    "# Check data types again to confirm the change\n",
    "flask_data.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f75dddf6",
   "metadata": {},
   "source": [
    "## Step 4: Filter the Data and Plot CO₂ Levels Over Time\n",
    "\n",
    "Quality control (QC) flags are included in this dataset to help identify measurements that should be rejected for scientific use. \n",
    "\n",
    "### Quality Control (QC) Flags\n",
    "\n",
    "The `qcflag` column uses a three-character code to describe measurement quality. Example: `\"C..\"`, `\".S.\"`, or `\"...\"`.\n",
    "\n",
    "- If the first character **is not** a dot (`.`), it means the measurement was **rejected** by the principal investigator.\n",
    "- If the first character **is** a dot (`.`), the measurement is considered **valid**.\n",
    "\n",
    "To ensure we're working with only high-quality data — and to keep our dataset manageable — we’ll:\n",
    "\n",
    "- Remove any rows where the **first character in `qcflag` is not `.`** (indicating the measurement was rejected)\n",
    "- Limit the dataset to a span of **15 years** (2008-2023)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0982177",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter to include only measurements from 2008-2023\n",
    "flask_data = flask_data[(flask_data['year'] >= 2008) &\n",
    "                        (flask_data['year'] <= 2023)]\n",
    "\n",
    "# Create a mask for QC flags that start with '.' (valid data)\n",
    "valid_qc_mask = flask_data['qcflag'].str.match(r'^\\..*')\n",
    "\n",
    "# Filter to include only valid data\n",
    "good_flask_data = flask_data[valid_qc_mask].reset_index(drop=True)\n",
    "\n",
    "# Display the first few rows of the cleaned data\n",
    "good_flask_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5c37582",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check the new shape of the filtered dataset\n",
    "# The number of columns will remain the same, but the amount of rows will decrease\n",
    "good_flask_data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3bea5ab",
   "metadata": {},
   "source": [
    "The plot below shows **both valid and rejected** flask CO₂ measurements. \n",
    "\n",
    "Rejected data points often appear as outliers or noisy scatter, making it harder to see the real pattern in atmospheric CO₂ levels.\n",
    "\n",
    "> *Each point represents one flask sample, plotted by sample date vs. CO₂ mole fraction.*\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a55a0c2b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot showing both valid and rejected measurements\n",
    "plt.figure(figsize=(25, 12))\n",
    "\n",
    "# Create mask for valid data (qcflag starts with '.')\n",
    "mask = flask_data['qcflag'].str.match(r'^\\..*')\n",
    "\n",
    "# Plot valid points\n",
    "plt.scatter(\n",
    "    flask_data.loc[mask, 'datetime'],\n",
    "    flask_data.loc[mask, 'value'],\n",
    "    s=25,\n",
    "    label='Valid Data (.**)'  # QC flag starts with a dot\n",
    ")\n",
    "\n",
    "# Plot rejected points (first character is not a dot)\n",
    "plt.scatter(\n",
    "    flask_data.loc[~mask, 'datetime'],\n",
    "    flask_data.loc[~mask, 'value'],\n",
    "    s=25,\n",
    "    color='red',\n",
    "    label='Rejected Data (*..)'\n",
    ")\n",
    "\n",
    "# Format axes\n",
    "plt.gca().xaxis.set_minor_locator(mdates.YearLocator(1))\n",
    "plt.gca().xaxis.set_major_locator(mdates.YearLocator(3))\n",
    "plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y'))\n",
    "plt.grid(True, which='both', axis='x', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "plt.grid(True, which='major', axis='y', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "\n",
    "plt.title(\"Atmospheric CO₂ at Mauna Loa Observatory: Full Dataset\")\n",
    "plt.xlabel('Year')\n",
    "plt.ylabel('CO₂ mole fraction (ppm)')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81fe8c62",
   "metadata": {},
   "source": [
    "### Now Plot Only the Valid Data\n",
    "\n",
    "Filtering out rejected and missing values reveals clearer patterns in the time series.\n",
    "\n",
    "> *Note: The range of CO₂ mole fraction values decreases significantly - from a spread of ~300 ppm to a much cleaner ~50 ppm.*\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c39e50f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot valid data only\n",
    "plt.figure(figsize=(25, 12))\n",
    "\n",
    "plt.scatter(good_flask_data['datetime'], good_flask_data['value'], s=25)\n",
    "\n",
    "# Format axes\n",
    "plt.gca().xaxis.set_minor_locator(mdates.YearLocator(1))\n",
    "plt.gca().xaxis.set_major_locator(mdates.YearLocator(3))\n",
    "plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y'))\n",
    "plt.grid(True, which='both', axis='x', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "plt.grid(True, which='major', axis='y', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "\n",
    "plt.title(\"Atmospheric CO₂ at Mauna Loa Observatory: Valid Data Only\")\n",
    "plt.xlabel('Year')\n",
    "plt.ylabel('CO₂ mole fraction (ppm)')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5af0c083",
   "metadata": {},
   "source": [
    "#### Additional Data Exploration\n",
    "\n",
    "To deepen our understanding, let's calculate some summary statistics on the filtered dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4b0092f",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Summary statistics for CO₂ levels 2008 - 2023 (valid data only):\")\n",
    "print(good_flask_data['value'].describe())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10a5d3dd",
   "metadata": {},
   "source": [
    "We can also visualize the distribution of CO₂ measurements using a histogram: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ca3e349",
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(10, 6))\n",
    "plt.hist(good_flask_data['value'], bins=30, color='skyblue', edgecolor='black')\n",
    "plt.title(\"Distribution of CO$_2$ Mole Fraction (ppm)\")\n",
    "plt.xlabel(\"CO$_2$ (ppm)\")\n",
    "plt.ylabel(\"Frequency\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63464bb9",
   "metadata": {},
   "source": [
    "## Step 5: Apply NOAA GML's Curve Fitting Method\n",
    "\n",
    "This method helps us to extract important patterns in the CO₂ data:  \n",
    "- A long-term trend (overall rise in CO₂)  \n",
    "- A yearly seasonal cycle (non-sinusoidal oscillations)  \n",
    "- Short-term variations\n",
    "\n",
    "\n",
    "This technique is based on the work by Thoning et al. (1989) and NOAA GML’s implementation. For more details and to follow along with the full tutorial, visit: https://gml.noaa.gov/ccgg/mbl/crvfit/crvfit.html\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e9daf89b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set x and y data for the filter\n",
    "xp = good_flask_data['time_decimal']\n",
    "yp = good_flask_data['value']\n",
    "\n",
    "# Create filter object using NOAA GML's curve fitting class\n",
    "filt = ccg_filter.ccgFilter(\n",
    "    xp,  # Time values in decimal years\n",
    "    yp,  # CO2 mole fraction values\n",
    "    shortterm=80,        # Short term smoothing cutoff in days\n",
    "    longterm=667,        # Long term trend cutoff in days\n",
    "    # Interval for evenly spaced data (0 means use original spacing)\n",
    "    sampleinterval=0,\n",
    "    numpolyterms=3,      # Polynomial terms (quadratic)\n",
    "    numharmonics=4,      # Number of harmonics for seasonal cycle\n",
    "    timezero=-1,         # Reference time zero for polynomial fitting\n",
    "    gap=0,               # No gap-filling in interpolation\n",
    "    use_gain_factor=False,  # Disable gain factor in harmonics\n",
    "    debug=False          # Disable debug messages\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a32b08c0",
   "metadata": {},
   "source": [
    "We can now extract components of the fitted model for plotting:\n",
    "\n",
    "- `x0`: equally spaced time points interpolated from xp  \n",
    "- `y1`: function fit value at time x\n",
    "- `y2`: polynomial fit value at time x\n",
    "- `y3`: smoothed data value at time x\n",
    "- `y4`: trend at time x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49a2819f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract components from the filter\n",
    "x0 = filt.xinterp\n",
    "y1 = filt.getFunctionValue(x0)\n",
    "y2 = filt.getPolyValue(x0)\n",
    "y3 = filt.getSmoothValue(x0)\n",
    "y4 = filt.getTrendValue(x0)\n",
    "\n",
    "# Convert decimal dates to datetime for plotting\n",
    "x0_dates = [ccg_dates.datetimeFromDecimalDate(i) for i in x0]\n",
    "xp_dates = [ccg_dates.datetimeFromDecimalDate(i) for i in xp]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "539faa1f",
   "metadata": {},
   "source": [
    "We can plot various aspects of the `filt` object to visualize the different components: a long-term trend, a non-sinusoidal yearly cycle, and short term variations."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88723280",
   "metadata": {},
   "source": [
    "### Plot 1: Function Fit vs. Measured Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a36d34c",
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(25, 12))\n",
    "plt.scatter(xp_dates, yp, s=30, alpha=0.9,\n",
    "            edgecolors='black', label=\"Measured Data\")\n",
    "plt.plot(x0_dates, y1, color='orange', linewidth=3,\n",
    "         alpha=0.9, label=\"Function Fit\")\n",
    "\n",
    "plt.gca().xaxis.set_minor_locator(mdates.YearLocator(1))  # Minor ticks every year\n",
    "plt.gca().xaxis.set_major_locator(\n",
    "    mdates.YearLocator(3))  # Major ticks every 3 years\n",
    "plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y'))\n",
    "plt.grid(True, which='both', axis='x', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "plt.grid(True, which='major', axis='y', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "\n",
    "plt.title(\"Atmospheric CO$_2$ at Mauna Loa Observatory\")\n",
    "plt.xlabel('Year')\n",
    "plt.ylabel('CO$_2$ mole fraction (ppm)')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4356de35",
   "metadata": {},
   "source": [
    "### Plot 2: Polynomial and Function Fit vs. Measured Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d8eed29",
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(25, 12))\n",
    "plt.scatter(xp_dates, yp, s=30, alpha=0.9,\n",
    "            edgecolors='black', label=\"Measured Data\")\n",
    "plt.plot(x0_dates, y1, color='orange', linewidth=3,\n",
    "         alpha=0.9, label=\"Function Fit\")\n",
    "plt.plot(x0_dates, y2, color='cyan', linewidth=3,\n",
    "         alpha=0.9, label=\"Polynomial\")\n",
    "\n",
    "plt.gca().xaxis.set_minor_locator(mdates.YearLocator(1))\n",
    "plt.gca().xaxis.set_major_locator(mdates.YearLocator(3))\n",
    "plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y'))\n",
    "plt.grid(True, which='both', axis='x', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "plt.grid(True, which='major', axis='y', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "\n",
    "plt.title(\"Atmospheric CO$_2$ at Mauna Loa Observatory\")\n",
    "plt.xlabel('Year')\n",
    "plt.ylabel('CO$_2$ mole fraction (ppm)')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9ab060a",
   "metadata": {},
   "source": [
    "\n",
    "### Plot 3: Residuals from Function Fit, Smoothed Residuals, and Trend Components"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "55488065",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Residuals from function, smoothed residuals, and trend of residuals\n",
    "resid_from_func = filt.resid\n",
    "resid_smooth = filt.smooth\n",
    "resid_trend = filt.trend\n",
    "\n",
    "plt.figure(figsize=(25, 12))\n",
    "\n",
    "plt.scatter(xp_dates, resid_from_func, s=30, edgecolors='black',\n",
    "            label=\"Residuals From Function Fit\")\n",
    "plt.plot(x0_dates, resid_smooth, c='red', linewidth=3,\n",
    "         alpha=0.8, label=\"Smoothed Residuals\")\n",
    "plt.plot(x0_dates, resid_trend, c='blue', linewidth=3,\n",
    "         alpha=0.8, label=\"Trend of Residuals\")\n",
    "\n",
    "plt.gca().xaxis.set_minor_locator(mdates.YearLocator(1))\n",
    "plt.gca().xaxis.set_major_locator(mdates.YearLocator(3))\n",
    "plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y'))\n",
    "plt.grid(True, which='both', axis='x', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "plt.grid(True, which='major', axis='y', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "\n",
    "plt.title(\"Atmospheric CO$_2$ at Mauna Loa Observatory\")\n",
    "plt.xlabel('Year')\n",
    "plt.ylabel('Residual CO$_2$ (ppm)')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "427db0f9",
   "metadata": {},
   "source": [
    "At this stage, the curve fitting process is complete. Now, we combine different components of the fitted function and filtered data to isolate the specific signals we're interested in analyzing. The key components are:\n",
    "\n",
    "- **Smoothed Data**: Represents the data with only short-term fluctuations removed. It combines the fitted function with residuals filtered using the short-term cutoff.\n",
    "- **Trend**: Captures the long-term upward movement in CO₂ levels after removing seasonal cycles. This includes the polynomial portion of the fit plus residuals filtered with the long-term cutoff.\n",
    "- **Detrended Seasonal Cycle**: Shows the yearly oscillation after removing the overall trend. This consists of the harmonic (seasonal) part of the fit combined with short-term filtered residuals.\n",
    "- **Seasonal Amplitude**: Measures the size of the seasonal cycle, calculated as the difference between peak and trough in the detrended seasonal cycle.\n",
    "- **Growth Rate**: The rate at which the long-term trend increases, found by taking the first derivative of the trend component.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0746346",
   "metadata": {},
   "source": [
    "## Plot 4: Measured Data with Smoothed and Trend Curves\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d03e665f",
   "metadata": {},
   "outputs": [],
   "source": [
    "trend = filt.getTrendValue(xp)\n",
    "\n",
    "plt.figure(figsize=(25, 12))\n",
    "\n",
    "plt.scatter(xp_dates, yp, s=35, edgecolors='black', label=\"Measured Data\")\n",
    "plt.plot(xp_dates, trend, c='blue', linewidth=3,\n",
    "         alpha=0.8, label=\"Trend Curve\")\n",
    "plt.plot(x0_dates, y3, c='red', linewidth=3, alpha=0.8, label=\"Smoothed Curve\")\n",
    "\n",
    "plt.gca().xaxis.set_minor_locator(mdates.YearLocator(1))\n",
    "plt.gca().xaxis.set_major_locator(mdates.YearLocator(3))\n",
    "plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y'))\n",
    "plt.grid(True, which='both', axis='x', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "plt.grid(True, which='major', axis='y', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "\n",
    "plt.title(\"Atmospheric CO$_2$ at Mauna Loa Observatory\")\n",
    "plt.xlabel('Year')\n",
    "plt.ylabel('CO$_2$ mole fraction (ppm)')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94b61e8c",
   "metadata": {},
   "source": [
    "### Plot 5: Smoothed Seasonal Cycle\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e047044",
   "metadata": {},
   "outputs": [],
   "source": [
    "detrend = yp - trend\n",
    "harmonics = filt.getHarmonicValue(x0)\n",
    "smooth_cycle = harmonics + filt.smooth - filt.trend\n",
    "\n",
    "plt.figure(figsize=(25, 12))\n",
    "\n",
    "plt.scatter(xp_dates, detrend, s=40, edgecolors='black',\n",
    "            label=\"Detrended Seasonal Cycle\")\n",
    "plt.plot(x0_dates, smooth_cycle, c='red', linewidth=3,\n",
    "         alpha=0.8, label=\"Smoothed Curve\")\n",
    "\n",
    "plt.gca().xaxis.set_minor_locator(mdates.YearLocator(1))\n",
    "plt.gca().xaxis.set_major_locator(mdates.YearLocator(3))\n",
    "plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y'))\n",
    "plt.grid(True, which='both', axis='x', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "plt.grid(True, which='major', axis='y', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "\n",
    "plt.title(\"Atmospheric CO$_2$ at Mauna Loa Observatory\")\n",
    "plt.xlabel('Year')\n",
    "plt.ylabel('Detrended CO$_2$ mole fraction (ppm)')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c8f9a1b",
   "metadata": {},
   "source": [
    "### Plot 6: Growth Rate of the Trend Curve"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e407a54",
   "metadata": {},
   "outputs": [],
   "source": [
    "growth_rate = filt.getGrowthRateValue(xp)\n",
    "\n",
    "plt.figure(figsize=(25, 12))\n",
    "\n",
    "plt.plot(xp_dates, growth_rate, c='green', linewidth=3,\n",
    "         alpha=0.8, label=\"Trend Curve Growth Rate\")\n",
    "\n",
    "plt.gca().xaxis.set_minor_locator(mdates.YearLocator(1))\n",
    "plt.gca().xaxis.set_major_locator(mdates.YearLocator(3))\n",
    "plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y'))\n",
    "plt.grid(True, which='both', axis='x', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "plt.grid(True, which='major', axis='y', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "\n",
    "plt.title(\"Atmospheric CO$_2$ at Mauna Loa Observatory\")\n",
    "plt.xlabel('Year')\n",
    "plt.ylabel('CO$_2$ Growth Rate (ppm/year)')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ddac603",
   "metadata": {},
   "source": [
    "## Step 6: Monthly Means\n",
    "\n",
    "Using monthly means computed by the filter, we can create a DataFrame and plot to visualize monthly variations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "216521b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get monthly means (year, month, avg, sd, n)\n",
    "mm = filt.getMonthlyMeans()\n",
    "\n",
    "mm_df = pd.DataFrame(mm, columns=['year', 'month', 'avg', 'sd', 'n'])\n",
    "\n",
    "# Convert year + month to datetime for plotting\n",
    "mm_df['date'] = pd.to_datetime(mm_df[['year', 'month']].assign(day=1))\n",
    "\n",
    "mm_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09c784ad",
   "metadata": {},
   "source": [
    "### Plot 7: Monthly Means with Measured Data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73254a96",
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(25, 12))\n",
    "\n",
    "plt.scatter(xp_dates, yp, s=20, edgecolors='black', label=\"Measured Data\")\n",
    "plt.plot(mm_df['date'], mm_df['avg'], linestyle='-', marker='o', markersize=8,\n",
    "         color='orange', markeredgecolor=\"black\", label=\"Monthly Mean\")\n",
    "\n",
    "plt.gca().xaxis.set_minor_locator(mdates.YearLocator(1))\n",
    "plt.gca().xaxis.set_major_locator(mdates.YearLocator(3))\n",
    "plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y'))\n",
    "plt.grid(True, which='both', axis='x', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "plt.grid(True, which='major', axis='y', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "\n",
    "plt.title(\"Atmospheric CO$_2$ at Mauna Loa Observatory\")\n",
    "plt.xlabel('Year')\n",
    "plt.ylabel('CO$_2$ mole fraction (ppm)')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d468087a",
   "metadata": {},
   "source": [
    "## Summary and Takeaways\n",
    "\n",
    "In this notebook, you learned how to work with real atmospheric CO₂ data from Mauna Loa Observatory. You practiced loading and exploring data, filtering out invalid measurements, and visualizing CO₂ levels over time. \n",
    "\n",
    "Then, you applied NOAA’s curve fitting method to separate long-term trends from seasonal and short-term variations in the data. This technique helps scientists better understand how CO₂ is changing in the atmosphere.\n",
    "\n",
    "By completing these steps, you gained hands-on experience with data cleaning, time series analysis, and advanced curve fitting - all important skills for studying environmental data.\n",
    "\n",
    "Keep exploring and experimenting with these tools to deepen your understanding of atmospheric science and data analysis!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a79f802",
   "metadata": {},
   "source": [
    "## Appendix: Loading CO₂ Data (NetCDF Format)\n",
    "\n",
    "In addition to the text file format, NOAA provides atmospheric CO₂ flask measurements in **NetCDF format** — a widely-used format for storing structured scientific data.\n",
    "\n",
    "This section shows how to:\n",
    "\n",
    "- Open the NetCDF file using the `xarray` package\n",
    "- View metadata and available variables\n",
    "- Convert selected variables into a `pandas.DataFrame`\n",
    "- Use the resulting DataFrame with the rest of this tutorial\n",
    "\n",
    "> *This is optional. The NetCDF and text files contain the same data. You only need to load one.*\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a482adf",
   "metadata": {},
   "outputs": [],
   "source": [
    "# *Note: You may need to install the package netcdf4, which is a dependency of xarray. This will likely not be included in your Anaconda Python environment.\n",
    "\n",
    "# if necessary, uncomment the following line and run this cell to install\n",
    "# pip install netcdf4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fec263c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import required libraries\n",
    "import xarray as xr\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.dates as mdates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75fed278",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Path to your NetCDF file\n",
    "file_path = \"co2_mlo_surface-flask_1_ccgg_event.nc\"\n",
    "\n",
    "# Open dataset\n",
    "ds = xr.open_dataset(file_path)\n",
    "\n",
    "# View dataset structure\n",
    "ds"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "720cbb5e",
   "metadata": {},
   "source": [
    "> The unique format of the NetCDF file allows the user to view both the metadata and dataset variables with only the command above - no need for extra print statements. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9034014",
   "metadata": {},
   "source": [
    "We'll extract a subset of useful fields to create a structured DataFrame for analysis comparable to that performed above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4e9456d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Variables to include in the DataFrame\n",
    "variables_to_use = [\n",
    "    'datetime', 'time_decimal', 'air_sample_container_id', 'value',\n",
    "    'value_unc', 'latitude', 'longitude', 'altitude', 'elevation',\n",
    "    'intake_height', 'method', 'event_number', 'instrument',\n",
    "    'analysis_datetime', 'qcflag'\n",
    "]\n",
    "\n",
    "# Create the DataFrame\n",
    "flask_data_nc = ds[variables_to_use].to_dataframe().reset_index()\n",
    "\n",
    "# Decode all byte string columns in flask_data_nc\n",
    "for col in flask_data_nc.columns:\n",
    "    if flask_data_nc[col].dtype == object and isinstance(flask_data_nc[col].iloc[0], bytes):\n",
    "        flask_data_nc[col] = flask_data_nc[col].str.decode('utf-8')\n",
    "\n",
    "\n",
    "# Preview the data\n",
    "flask_data_nc.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4f67f91",
   "metadata": {},
   "source": [
    "You can now use the `flask_data_nc` DataFrame (created from the NetCDF file) just like the text-based version. The structure is the same, so the rest of the analysis will work in the same way.\n",
    "\n",
    "\n",
    "So, continue with filtering, plotting, and curve fitting just as described in steps 3-5 above:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f04fe616",
   "metadata": {},
   "source": [
    "### Step 3(b): Explore the NetCDF DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f011c7c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Column names\n",
    "flask_data_nc.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4de6f7c6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Dimensions of the DataFrame: (rows, columns)\n",
    "flask_data_nc.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4e2e16f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data types of each column\n",
    "flask_data_nc.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "713f6700",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert the 'datetime' column to datetime format (if it isn't already)\n",
    "flask_data_nc['datetime'] = pd.to_datetime(flask_data_nc['datetime'])\n",
    "flask_data_nc.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0bdd9d7",
   "metadata": {},
   "source": [
    "### Step 4(b): Filter and Visualize the CO₂ Data\n",
    "\n",
    "We'll remove any rows that:\n",
    "\n",
    "- Have a `qcflag` starting with any character other than `\".\"`\n",
    "- Are not in the time range of 2008 - 2023\n",
    "\n",
    "For simplicity, we will only include the plot of the filtered, valid CO₂ data points. The process of filtering to remove invalid or questionable data works the same way regardless of the data source, whether from the text file or the NetCDF file.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ff49fee",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter to years 2008-2023\n",
    "flask_data_nc = flask_data_nc[(flask_data_nc['datetime'].dt.year >= 2008) &\n",
    "                              (flask_data_nc['datetime'].dt.year <= 2023)]\n",
    "\n",
    "# Keep only entries where the first character of the qcflag is \".\"\n",
    "good_nc_data = flask_data_nc[flask_data_nc['qcflag'].str.match(\n",
    "    r'^\\..*')].reset_index(drop=True)\n",
    "\n",
    "good_nc_data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f6f8035c",
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(25, 12))\n",
    "\n",
    "plt.scatter(good_nc_data['datetime'], good_nc_data['value'], s=25)\n",
    "\n",
    "plt.gca().xaxis.set_minor_locator(mdates.YearLocator(1))\n",
    "plt.gca().xaxis.set_major_locator(mdates.YearLocator(3))\n",
    "plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y'))\n",
    "plt.grid(True, which='both', axis='x', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "plt.grid(True, which='major', axis='y', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "\n",
    "plt.title(\"Atmospheric CO$_2$ at Mauna Loa Observatory: Valid Data Only (NetCDF)\")\n",
    "plt.xlabel('Year')\n",
    "plt.ylabel('CO$_2$ mole fraction (ppm)')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7568b2df",
   "metadata": {},
   "source": [
    "### Step 5(b): Apply Curve Fitting to NetCDF Data\n",
    "In this step, we begin the process of curve fitting and extracting the signal components from the CO₂ data.\n",
    "\n",
    "Since the NetCDF data structure can vary slightly, we'll demonstrate the first few cells here. The rest of the analysis follows the same process as with the text-based dataset, so you can apply those steps directly once the initial data preparation is done."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae25a5b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "xp = good_nc_data['time_decimal']\n",
    "yp = good_nc_data['value']\n",
    "\n",
    "filt = ccg_filter.ccgFilter(xp, yp,\n",
    "                            shortterm=80,\n",
    "                            longterm=667,\n",
    "                            sampleinterval=0,\n",
    "                            numpolyterms=3,\n",
    "                            numharmonics=4,\n",
    "                            timezero=-1,\n",
    "                            gap=0,\n",
    "                            use_gain_factor=False,\n",
    "                            debug=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3676a28",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get time and curve components\n",
    "x0 = filt.xinterp\n",
    "y1 = filt.getFunctionValue(x0)\n",
    "y2 = filt.getPolyValue(x0)\n",
    "y3 = filt.getSmoothValue(x0)\n",
    "y4 = filt.getTrendValue(x0)\n",
    "\n",
    "# Convert to datetime for plotting\n",
    "x0_dates = [ccg_dates.datetimeFromDecimalDate(i) for i in x0]\n",
    "xp_dates = [ccg_dates.datetimeFromDecimalDate(i) for i in xp]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ee21a85",
   "metadata": {},
   "source": [
    "#### Plot Function Fit with Flask Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ecc8889",
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(25, 12))\n",
    "plt.scatter(xp_dates, yp, s=30, edgecolors='black', label=\"Measured Data\")\n",
    "plt.plot(x0_dates, y1, color='orange', linewidth=3, label=\"Function Fit\")\n",
    "\n",
    "plt.gca().xaxis.set_minor_locator(mdates.YearLocator(1))\n",
    "plt.gca().xaxis.set_major_locator(mdates.YearLocator(3))\n",
    "plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y'))\n",
    "plt.grid(True, which='both', axis='x', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "plt.grid(True, which='major', axis='y', linestyle='--',\n",
    "         linewidth=0.7, color='lightgray')\n",
    "\n",
    "plt.title(\"Function Fit: Atmospheric CO₂ at Mauna Loa (NetCDF)\")\n",
    "plt.xlabel(\"Year\")\n",
    "plt.ylabel(\"CO₂ mole fraction (ppm)\")\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab1cc1e0",
   "metadata": {},
   "source": [
    "This appendix demonstrated how to load and prepare atmospheric CO₂ data from a NetCDF file using the `xarray` and `pandas` libraries. While the core dataset is the same as the text version, some formatting steps, like converting byte strings, are unique to NetCDF.\n",
    "\n",
    "Once processed, the NetCDF data can be used in exactly the same way as the text-based data for filtering, curve fitting, and visualization. This flexibility allows you to work with whichever format best suits your needs or workflow.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "prod6",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}