platform-demo-scripts/notebooks/Transforming mseeds from Lumineos to SeisBench dataset.ipynb

762 lines
234 KiB
Plaintext
Raw Normal View History

2023-07-05 09:58:06 +02:00
{
"cells": [
{
"cell_type": "markdown",
"id": "c6ec59ca-b58c-443c-9a98-25b824705bb5",
"metadata": {},
"source": [
"*This notebook provides an example on how to create a SeisBench dataset from an xls event catalog and a folder with mseed files*\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "a00a8204-932b-4488-85a0-6eea6f306523",
"metadata": {},
"outputs": [],
"source": [
"import seisbench\n",
"import seisbench.data as sbd\n",
"import seisbench.util as sbu\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"from pathlib import Path\n",
"import obspy\n",
"import os\n",
"from pathlib import Path\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "5e44f9bb-4ae8-412c-a14d-3cc0885504c6",
"metadata": {},
"source": [
"# Creating a dataset\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "70c64dc6-e4dd-4c01-939d-a28914866f5d",
"metadata": {},
"source": [
"##### The catalog has a custom format with the following properties: \n",
"###### 'Datetime', 'X', 'Y', 'Depth', 'Mw', 'Phases', 'mseed_name'\n",
"###### Phases is a string with detected phases seperated by comma: <Phase> <Station> <Datetime> e.g. \"Pg BRDW 2020-01-01 10:09:44.400, Sg BRDW 2020-01-01 10:09:45.696\""
]
},
2023-07-05 09:58:06 +02:00
{
"cell_type": "code",
"execution_count": 2,
"id": "143d04f7-e00a-4724-895e-f3dad72896e0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Datetime</th>\n",
" <th>X</th>\n",
" <th>Y</th>\n",
" <th>Depth</th>\n",
" <th>Mw</th>\n",
" <th>Phases</th>\n",
" <th>mseed_name</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2020-01-01 10:09:42.200</td>\n",
" <td>5.582503e+06</td>\n",
" <td>5.702646e+06</td>\n",
" <td>0.7</td>\n",
" <td>2.469231</td>\n",
" <td>Pg BRDW 2020-01-01 10:09:44.400, Sg BRDW 2020-...</td>\n",
" <td>20200101100941.mseed</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Datetime X Y Depth Mw \\\n",
"0 2020-01-01 10:09:42.200 5.582503e+06 5.702646e+06 0.7 2.469231 \n",
2023-07-05 09:58:06 +02:00
"\n",
" Phases mseed_name \n",
"0 Pg BRDW 2020-01-01 10:09:44.400, Sg BRDW 2020-... 20200101100941.mseed "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"input_path = str(Path.cwd().parent) + \"/datasets/igf/\"\n",
2023-07-05 09:58:06 +02:00
"catalog = pd.read_excel(input_path + \"Catalog_20_21.xlsx\", index_col=0)\n",
"catalog.head(1)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "03257d45-299d-4ed1-bc64-03303d2a9873",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Pg BRDW 2020-01-01 10:09:44.400, Sg BRDW 2020-01-01 10:09:45.696, Pg GROD 2020-01-01 10:09:45.206, Sg GROD 2020-01-01 10:09:46.655, Pg GUZI 2020-01-01 10:09:45.116, Sg GUZI 2020-01-01 10:09:46.561, Pg JEDR 2020-01-01 10:09:44.920, Sg JEDR 2020-01-01 10:09:46.285, Pg MOSK2 2020-01-01 10:09:45.417, Sg MOSK2 2020-01-01 10:09:46.921, Pg NWLU 2020-01-01 10:09:45.686, Sg NWLU 2020-01-01 10:09:47.175, Pg PCHB 2020-01-01 10:09:45.213, Sg PCHB 2020-01-01 10:09:46.565, Pg PPOL 2020-01-01 10:09:44.755, Sg PPOL 2020-01-01 10:09:46.069, Pg RUDN 2020-01-01 10:09:44.502, Sg RUDN 2020-01-01 10:09:45.756, Pg RYNR 2020-01-01 10:09:43.442, Sg RYNR 2020-01-01 10:09:44.394, Pg RZEC 2020-01-01 10:09:46.075, Sg RZEC 2020-01-01 10:09:47.587, Pg SGOR 2020-01-01 10:09:45.817, Sg SGOR 2020-01-01 10:09:47.284, Pg TRBC2 2020-01-01 10:09:44.833, Sg TRBC2 2020-01-01 10:09:46.095, Pg TRN2 2020-01-01 10:09:44.488, Sg TRN2 2020-01-01 10:09:45.698, Pg TRZS 2020-01-01 10:09:46.232, Sg TRZS 2020-01-01 10:09:47.727, Pg ZMST 2020-01-01 10:09:43.592, Sg ZMST 2020-01-01 10:09:44.553, Pg LUBW 2020-01-01 10:09:43.119, Sg LUBW 2020-01-01 10:09:43.929'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"catalog.Phases[0]"
]
},
2023-07-05 09:58:06 +02:00
{
"cell_type": "markdown",
"id": "fe0627b1-6fa0-4b5a-8a60-d80626b5c9be",
"metadata": {},
"source": [
"#### SeisBench dataset format \n",
"\n",
"A dataset consists of 2 components: \n",
"* a metadata file, called `metadata.csv`, with properties of assosiated waveforms\n",
"* a waveforms file, called `waveforms.hdf5`, containing the raw waveforms\n",
"\n",
"\n",
"A dataset is created with `WaveformDataWriter` provided by SeisBench\n",
"\n",
"### Define train/val/test split\n",
"\n",
"Strategy: \n",
"Assign chronologically 70% picks to train, 15% to val, 15% to test sets. \n",
"(Note: Counting picks based on `Phases` column in the `Catalog_20_21.xlsx` produces sets with slightly different proportion as not all traces are available.)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "07ab344c-f03f-49aa-8fa2-537fbb154716",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<Axes: xlabel='Datetime'>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAmgAAAEmCAYAAADWVWzIAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAABK5UlEQVR4nO3deViU5f4/8PcMyyD7vsmqKIsiLijimkliomVaP02PYlkeO+rXJU9mdTLKsiwzbdHsdLLF0uyUllsiiisq7qICgiiyDDsMIAwwc//+8DDjpJkoMDPwfl2XV87z3DPzeT4Nw9tnuR+JEEKAiIiIiAyGVN8FEBEREZEuBjQiIiIiA8OARkRERGRgGNCIiIiIDAwDGhEREZGBYUAjIiIiMjAMaEREREQGhgGNiIiIyMCY6rsAfVKr1cjLy4ONjQ0kEom+yyEiIqI2TAiByspKeHp6Qiq9+z6ydh3Q8vLy4O3tre8yiIiIqB25fv06vLy87jqmXQc0GxsbADcbZWtrq+dqiIiIqC1TKBTw9vbW5I+7adcBrfGwpq2tLQMaERERtYp7Oa2KFwkQERERGRgGNCIiIiIDw4BGREREZGAY0IiIiIgMDAMaEREREYDc8hrUq9T6LgMAAxoRERG1c1XKBjy7PhmD39uL3RcK9F0OgHY+zQYRERG1X6ezy7Aq4TIS04p0lsX08NBjVTcxoBEREVG789PJHCzcfFbz2M1WhtnDAjAl0k9/Rd2CAY2IiIjajbLqOsT9dgFbzuQBAILcbTB3eBeM7O5uUPflZkAjIiKidmH94Sy8szMVdQ03LwR4oldHLH+yB8xMDO+UfAY0IiIiatPKb9RhTWImPj9wBQBgY2GKf0YHYkp/X4Paa3YrBjQiIiJqs+IvFuD5b05oHk8I98a740MNNpg1YkAjIiKiNiezqApzN55GSq5Cs+zVUcF4brC/wYczoInzoC1btgx9+/aFjY0NXF1dMXbsWKSlpemMqa2txaxZs+Dk5ARra2uMHz8eBQW6c4pkZ2cjJiYGlpaWcHV1xT//+U80NDTojElMTETv3r0hk8kQEBCA9evX31bPp59+Cj8/P1hYWCAiIgLHjx9vyuYQERFRG1JUqcTyXakY+dEBDF+xXxPOBgU4Y+fcwXh+SCejCGdAEwPa/v37MWvWLBw9ehTx8fGor6/HiBEjUF1drRkzf/58/Pbbb9i8eTP279+PvLw8jBs3TrNepVIhJiYGdXV1OHLkCL7++musX78er7/+umZMVlYWYmJiMGzYMJw5cwbz5s3Dc889h99//10zZtOmTViwYAGWLFmCU6dOISwsDNHR0SgsLHyQfhAREZERir9YgL5v78FniZlIlVcCAIYFumDVxJ74dno/BHvY6rnCppEIIcT9PrmoqAiurq7Yv38/hgwZgoqKCri4uOD777/Hk08+CQBITU1FcHAwkpKS0L9/f+zcuROjR49GXl4e3NzcAABr167FokWLUFRUBHNzcyxatAjbt29HSkqK5r0mTpyI8vJy7Nq1CwAQERGBvn374pNPPgEAqNVqeHt7Y86cOXj55ZfvqX6FQgE7OztUVFTA1ta4/scRERERoGxQYeHmc/jt7M1pMyQSYFwvL0wb4IdQLzs9V6erKbnjga4rraioAAA4OjoCAE6ePIn6+npERUVpxgQFBcHHxwdJSUkAgKSkJISGhmrCGQBER0dDoVDgwoULmjG3vkbjmMbXqKurw8mTJ3XGSKVSREVFacYQERFR26VSC3yy9zL6v5OgCWcWZlJcenMkVvy/MIMLZ0113xcJqNVqzJs3DwMHDkT37t0BAHK5HObm5rC3t9cZ6+bmBrlcrhlzazhrXN+47m5jFAoFampqUFZWBpVKdccxqampf1qzUqmEUqnUPFYoFH86loiIiAyPSi3w9vZL2HomFyXVdQAAJytzPBrqjpdGBsHCzETPFTaP+w5os2bNQkpKCg4dOtSc9bSoZcuWIS4uTt9lEBER0X04eqUEb227iAt5N3ewWMtMsejRIDzVx6vNBLNG9xXQZs+ejW3btuHAgQPw8vLSLHd3d0ddXR3Ky8t19qIVFBTA3d1dM+aPV1s2XuV565g/XvlZUFAAW1tbdOjQASYmJjAxMbnjmMbXuJPFixdjwYIFmscKhQLe3t5N2HIiIiJqTRfzFPj++DUcu1KKy4VVmuV/H9IJ86K6ooN52wpmjZp0DpoQArNnz8Yvv/yCvXv3wt/fX2d9nz59YGZmhoSEBM2ytLQ0ZGdnIzIyEgAQGRmJ8+fP61xtGR8fD1tbW4SEhGjG3PoajWMaX8Pc3Bx9+vTRGaNWq5GQkKAZcycymQy2trY6f4iIiMiwCCGQX1GDn0/l4G9fHsN3R7NxubAKEgkwIsQN8fOHYPGo4DYbzoAm7kGbNWsWvv/+e2zduhU2Njaac8bs7OzQoUMH2NnZYfr06ViwYAEcHR1ha2uLOXPmIDIyEv379wcAjBgxAiEhIZgyZQqWL18OuVyO1157DbNmzYJMJgMAzJw5E5988gleeuklPPvss9i7dy9+/PFHbN++XVPLggULEBsbi/DwcPTr1w8fffQRqqur8cwzzzRXb4iIiKgVFVUqseuCHGsTM5FbXqOzbsVTYRga6AJna5meqmtdTZpm488md/vqq68wbdo0ADcnqn3xxRfxww8/QKlUIjo6Gp999pnOocdr167hhRdeQGJiIqysrBAbG4t3330XpqbavJiYmIj58+fj4sWL8PLywr/+9S/NezT65JNP8P7770Mul6Nnz55YvXo1IiIi7nnjOc0GERGR/tXWqzB342n8fkH31KUQD1sMDHDC34d2bhPBrCm544HmQTN2DGhERET6VdegxvPfnMD+9CLNsr/198E/o4Ng18FMj5U1v6bkDt6Lk4iIiPTi5LVSPP/NSZT+b7qMf0YH4oWhnSGVGsftmFoSAxoRERG1qn2phVgRn6ZzI/OFI7pi1rAAPVZlWBjQiIiIqFUUKmrx9+9O4nR2OYCbt2WKCfXAjCGd0MPLXq+1GRoGNCIiImoxtfUqfHf0Gn47l4+z18s1y5/u54PnB/ujk4u1/oozYAxoRERE1OzUaoFvj17Dx3szUFylvc2ik5U5nuzjhcWjgvVYneFjQCMiIqJmVVuvwhu/XsDG5OuaZdMH+WNc747o5mncNzFvLQxoRERE1CzKquuwKuEydpzPR2Hlzb1mfk6W+HZ6BLwdLfVcnXFhQCMiIqIHUn6jDl8cvIIvDmahrkENAPCws8CLIwIxvnfHP53onv4cAxoRERE1WUVNPY5eKcEvp3Kx51IBGtQ3573vYGaCd8eHIrqbOyzM2u69MlsaAxoRERHdszR5Jf7501mcy6m4bd30Qf6YObQzXGyM/7ZM+saARkRERH+prkGND3an4d8Hr+B/O8vg42iJMG97BLnb4KlwL7jaWOi3yDaEAY2IiIjuSAiB+IsF+PFEDk5cK0X5jXoAwIDOTngtJgQhnryPdUthQCMiIqLbZBRWYt6mMzq3Y3K0Msf/PRyA2AF+PPG/hTGgERERkcblgkq8vvUCkq6UaJb19rHHv0aHILSjHUxNpHqsrv1gQCMiIiJU1NTjs30Z+PboNdyoUwEAwrzs8NroEIT7OnCPWStjQCMiImrHLhdU4ouDV7DldB7qVDfnMOvkYoVFI4PwSLAbpFIGM31gQCMiImqHrpfewP9tPI3T2eWaZb5Olpg5tDOe6uPFQ5l6xoBGRETUjtTWqzDnh9OIv1igWfZwkCtmDu2Mvn48lGkoGNCIiIjaOGWDCmevV+DnUzk6NzB3tZHh3fGheDjITY/V0Z0woBEREbVBQgicy6nAljO52HomD6XVdZp1Egmw4qkwPNGL98k0VAxoREREbUhdgxrfJF3Ff0/l4lK+dg4zZ2sZune0hZ+TFZ4f0gkd7TvosUr6KwxoREREbURJlRJjPzuM66U1mmX9/B3xRK+OPPHfyDCgERERGbnLBZX48lAWfj6Vq5kqY3AXZ7z/ZBjc7Xh/TGPEgEZERGSEGlRq/PtQFrafy8f53ArN8kA3Gyx/sgfCvO31Vxw9MAY
"text/plain": [
"<Figure size 700x300 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"catalog['pick_count'] = catalog.Phases.apply(lambda x: x.count(\"Pg\"))\n",
"catalog.index = catalog.Datetime\n",
"catalog = catalog.sort_index()\n",
"catalog['pick_count_cumsum'] = catalog.pick_count.cumsum()\n",
"catalog.pick_count_cumsum.plot(figsize=(7,3))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "4fabb4c3-4056-4bad-94e6-f907977762a6",
"metadata": {},
"outputs": [],
"source": [
"train_th = 0.7 * catalog.pick_count_cumsum[-1]\n",
"dev_th = 0.85 * catalog.pick_count_cumsum[-1]\n",
"\n",
"catalog['split'] = 'test'\n",
"for i, event in catalog.iterrows(): \n",
" if event['pick_count_cumsum'] < train_th: \n",
" catalog.loc[i, 'split'] = 'train' \n",
" elif event['pick_count_cumsum'] < dev_th: \n",
" catalog.loc[i, 'split'] = 'dev' \n",
" else:\n",
" break"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "35254721-fe1e-447c-9195-84695868f1d7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.6996718237224566"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"catalog[catalog.split == 'train'].pick_count.sum() / catalog.pick_count.sum()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "95d297ed-0da7-4985-954e-645a8a89b6a0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.149929676511955"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"catalog[catalog.split == 'dev'].pick_count.sum() / catalog.pick_count.sum()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "28451050-6b1c-4fe6-a905-799383515d5b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.15039849976558836"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"catalog[catalog.split == 'test'].pick_count.sum() / catalog.pick_count.sum()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "be679508-478a-4195-b5e7-a9fea9bbf724",
"metadata": {},
"outputs": [],
"source": [
"def get_event_params(event): \n",
" event_params = {\n",
" 'source_origin_time': event.Datetime, \n",
" 'source_latitude_deg': event.Y, \n",
" 'source_longitude_deg': event.X, \n",
" 'source_depth_km': event.Depth, \n",
" 'source_magnitude': event.Mw, \n",
" 'split': event.split\n",
" }\n",
" return event_params\n",
"\n",
"def get_event_picks(event): \n",
" \n",
" picks = [ann.split(' ') for ann in event.Phases.split(', ')]\n",
" picks = pd.DataFrame(picks, columns = ['pick', 'station', 'date', 'hour'])\n",
" picks.index = pd.DatetimeIndex(picks.date + ' ' + picks.hour, tz= \"UTC\")\n",
"\n",
" return picks\n",
"\n",
"def get_mseed(fname):\n",
" return obspy.read(fname)\n",
"\n",
"\n",
"def get_trace_params(trace): \n",
" trace_params = {\n",
" \"station_network_code\": trace.stats.network,\n",
" \"station_code\": trace.stats.station,\n",
" \"trace_channel\": trace.stats.channel\n",
" }\n",
" return trace_params\n",
" \n",
"def get_waves_timestamps(station, phases_string): \n",
" \n",
" p_ts = None\n",
" s_ts = None\n",
"\n",
" return p_ts, s_ts\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "77373c87-019d-4c7d-90b3-54e36874750a",
"metadata": {},
"outputs": [],
"source": [
"output_path = input_path + \"seisbench_format/\"\n",
"metadata_path = output_path + \"metadata.csv\"\n",
"waveforms_path = output_path + \"waveforms.hdf5\"\n",
"train = 0.7\n",
"dev = 0.15\n",
"test = 0.15"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "6f098f39-85aa-43e0-90e8-b66c90a11d31",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Traces converted: 35784it [01:01, 578.58it/s]\n"
2023-07-05 09:58:06 +02:00
]
}
],
"source": [
"with sbd.WaveformDataWriter(metadata_path, waveforms_path) as writer:\n",
"\n",
" # Define data format\n",
" writer.data_format = {\n",
" \"dimension_order\": \"CW\",\n",
" \"component_order\": \"ZNE\",\n",
" }\n",
" \n",
" for event in catalog.itertuples():\n",
" # if \"2020-03-03 05:04:43\" not in str(event.Datetime): \n",
" # continue\n",
" event_params = get_event_params(event)\n",
" event_picks = get_event_picks(event)\n",
" if pd.isna(event.mseed_name): \n",
" continue\n",
" if os.path.exists(input_path + \"mseeds/mseeds_2020/\" + event.mseed_name):\n",
" mseed_path = input_path + \"mseeds/mseeds_2020/\" + event.mseed_name \n",
" elif os.path.exists(input_path + \"mseeds/mseeds_2021/\" + event.mseed_name):\n",
2023-07-05 09:58:06 +02:00
" mseed_path = input_path + \"mseeds/mseeds_2021/\" + event.mseed_name \n",
" else: \n",
" continue\n",
2023-07-05 09:58:06 +02:00
" \n",
" \n",
" stream = get_mseed(mseed_path)\n",
" \n",
" for pick_time, pick in event_picks.iterrows():\n",
" waveforms = stream.select(station=pick.station)\n",
" if len(waveforms) == 0:\n",
" # No waveform data available\n",
" continue\n",
" \n",
" trace_params = get_trace_params(waveforms[0])\n",
" \n",
" sampling_rate = waveforms[0].stats.sampling_rate\n",
" # Check that the traces have the same sampling rate\n",
" assert all(trace.stats.sampling_rate == sampling_rate for trace in waveforms)\n",
" \n",
" actual_t_start, data, _ = sbu.stream_to_array(\n",
" waveforms,\n",
" component_order=writer.data_format[\"component_order\"],\n",
" )\n",
" \n",
" trace_params[\"trace_sampling_rate_hz\"] = sampling_rate\n",
" trace_params[\"trace_start_time\"] = str(actual_t_start)\n",
"\n",
" pick_time = obspy.core.utcdatetime.UTCDateTime(pick_time)\n",
" pick_idx = (pick_time - actual_t_start) * sampling_rate\n",
"\n",
" trace_params[f\"trace_{pick.pick}_arrival_sample\"] = int(pick_idx)\n",
" # sample = (pick.time - actual_t_start) * sampling_rate\n",
" # trace_params[f\"trace_{pick.phase_hint}_arrival_sample\"] = int(sample)\n",
" # trace_params[f\"trace_{pick.phase_hint}_status\"] = pick.evaluation_mode\n",
" \n",
" writer.add_trace({**event_params, **trace_params}, data)\n",
"\n",
" # break\n",
2023-07-05 09:58:06 +02:00
" \n",
" "
]
},
{
"cell_type": "markdown",
"id": "a7a66d99-4dfa-4c3a-937b-6df437eb8833",
"metadata": {},
"source": [
"### Load converted dataset"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "cdb07cfb-96c5-4444-81c8-1362fa3ceea8",
"metadata": {},
"outputs": [],
"source": [
"data = sbd.WaveformDataset(output_path, sampling_rate=100)\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "33c77509-7aab-4833-a372-16030941395d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Unnamed dataset - 35784 traces\n"
]
}
],
"source": [
"print(data)"
2023-07-05 09:58:06 +02:00
]
},
{
"cell_type": "markdown",
"id": "4d3440c7-318b-41f3-8035-48ce3cd9a764",
"metadata": {},
"source": [
"#### Plot sample"
]
},
{
"cell_type": "code",
"execution_count": 13,
2023-07-05 09:58:06 +02:00
"id": "1753f65e-fe5d-4cfa-ab42-ae161ac4a253",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.lines.Line2D at 0x14d6c12d0>"
2023-07-05 09:58:06 +02:00
]
},
"execution_count": 13,
2023-07-05 09:58:06 +02:00
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABOgAAAGsCAYAAABnzpg0AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOy9ebhlVX3m/649nOGORU0UyCDgiCK0xCZ00kYjgdik022042OMGhIx2pAIJOiPtEEbkmii4NQgzpCgUYygTALFqMikBRRDMdVEQc3DHc+0p/X7Y6+19lp7OMO95w5V9f08Tz117rn77L32Pnvve9Z73vf7ZZxzDoIgCIIgCIIgCIIgCIIgFgRroQdAEARBEARBEARBEARBEAczJNARBEEQBEEQBEEQBEEQxAJCAh1BEARBEARBEARBEARBLCAk0BEEQRAEQRAEQRAEQRDEAkICHUEQBEEQBEEQBEEQBEEsICTQEQRBEARBEARBEARBEMQCQgIdQRAEQRAEQRAEQRAEQSwgzkIP4EAiiiJs27YNw8PDYIwt9HAIgiAIgiAIgiAIgiCIBYJzjqmpKRx++OGwrPYeORLo+si2bdtw5JFHLvQwCIIgCIIgCIIgCIIgiEXCSy+9hCOOOKLtMiTQ9ZHh4WEA8YEfGRlZ4NEQBEEQxP7LntoeHPeV44znNvz1BiwfXL5AIyIIgiAIgiCI3picnMSRRx6p9KJ2kEDXR2SsdWRkhAQ6giAIgpgFLbsFVMznhkeGMTJIf18JgiAIgiCI/YtuyqBRkwiCIAiCIAiCIAiCIAiCWEBIoCMIgiAIgiAIgiAIgiCIBYQEOoIgCIIgCIIgCIIgCIJYQEigIwiCIAiCIAiCIAiCIIgFhAQ6giAIgiAIgiAIgiAIglhASKAjCIIgCIIgCIIgCIIgiAWEBDqCIAiCIAiCIAiCIAiCWEBIoCMIgiAIgiAIgiAIgiCIBYQEOoIgCIIgCIIgCIIgCIJYQEigIwiCIAiCIAiCIAiCIIgFhAQ6giAIgiAIgiAIgiAIglhASKAjCIIgCIIgCIIgCIIgiAWEBDqCIAiCIAiCIAiCIAiCWEBIoCMIgiAIgiAIgiAWDX7kY/3YenDOF3ooBEEQ8wYJdARBEARBEARBEMSi4R8e+ge868Z34eaNNy/0UAiCIOYNEugIgiAIgiAIgiCIRcP1L1wPAPj6E19f4JEQBEHMHyTQEQRBEARBEARBEIsCPda6tLJ0AUdCEAQxv5BARxAEQRAEQRAEQSwK9jT2qMej5dEFHAlBEMT8QgIdQRAEQRAEQRAEsSjY29yrHjf8xgKOhCAIYn4hgY4gCIIgCIIgCIJYFDSDpno81hpbwJEQBEHMLyTQEQRBEARBEARBEIuCelBXj8db4ws3EIIgiHmGBDqCIAiCIAiCIAhiUaA76Mab4ws3EIIgiHmGBDqCIAiCIAiCIAhiUdAIkrpzXuQh4tECjoYgCGL+IIGOIAiCIAiCIAiCWBToDjoA8EJvgUZCEAQxv5BARxAEQRAEQRAEQSwKdAcdALTC1gKNhCAIYn5ZcIHula98JRhjmX/nnHMOAOBtb3tb5ncf/ehHjXVs2bIFZ555JgYGBrBy5UpceOGFCILAWObee+/Fm9/8ZpTLZbzqVa/C1VdfnRnLFVdcgVe+8pWoVCo45ZRT8Mgjj8zZfhMEQRAEQRAEQRAmzZAcdARBHJwsuED3q1/9Ctu3b1f/Vq9eDQD4X//rf6llzj77bGOZf/mXf1G/C8MQZ555JjzPwwMPPIBrrrkGV199NS6++GK1zKZNm3DmmWfi7W9/Ox5//HGcd955+PCHP4zbb79dLfPDH/4QF1xwAT796U/j0UcfxYknnogzzjgDu3btmoejQBAEQRAEQRAEQdT9uvEzOegIgjhYWHCBbsWKFVi1apX6d/PNN+O4447D7/zO76hlBgYGjGVGRkbU7+644w6sW7cO1157LU466SS8853vxKWXXoorrrgCnhd/23LVVVfhmGOOwWWXXYbXv/71OPfcc/Ge97wHX/ziF9V6Lr/8cpx99tk466yzcPzxx+Oqq67CwMAAvvOd7xSOvdVqYXJy0vhHEARBEARBEARBzAxy0BEEcbCy4AKdjud5uPbaa/Hnf/7nYIyp57/3ve9h+fLleOMb34iLLroI9XryrcqDDz6IE044AYceeqh67owzzsDk5CSefvpptcxpp51mbOuMM87Agw8+qLa7Zs0aYxnLsnDaaaepZfL47Gc/i9HRUfXvyCOPnN0BIAiCIAiCIAiCOIihGnQEQRysOAs9AJ2f/OQnGB8fx5/92Z+p5/7kT/4ERx99NA4//HA88cQT+OQnP4nnnnsO119/PQBgx44dhjgHQP28Y8eOtstMTk6i0WhgbGwMYRjmLvPss88Wjveiiy7CBRdcoH6enJwkkY4gCIIgCIIgCGKGpLu4kkBHEMTBwqIS6L797W/jne98Jw4//HD13Ec+8hH1+IQTTsBhhx2Gd7zjHdiwYQOOO+64hRimolwuo1wuL+gYCIIgCIIgiMXB5372LPbVWvjnd7/JSIMQBNE9aYGOIq4EQRwsLJqI64svvog777wTH/7wh9sud8oppwAA1q9fDwBYtWoVdu7caSwjf161alXbZUZGRlCtVrF8+XLYtp27jFwHQRAEQRAEQRRR9wJcdd8GXPfrl7F1vNH5BQRB5JKuQUcOOoIgDhYWjUD33e9+FytXrsSZZ57ZdrnHH38cAHDYYYcBAE499VQ8+eSTRrfV1atXY2RkBMcff7xa5q677jLWs3r1apx66qkAgFKphJNPPtlYJooi3HXXXWoZgiAIgiAIgihi23giKkTRAg6EIPZz0oIcOegIgjhYWBQCXRRF+O53v4sPfehDcJwkdbthwwZceumlWLNmDTZv3owbb7wRH/zgB/HWt74Vb3rTmwAAp59+Oo4//nh84AMfwNq1a3H77bfjU5/6FM455xwVP/3oRz+KjRs34hOf+ASeffZZXHnllbjuuutw/vnnq21dcMEF+OY3v4lrrrkGzzzzDD72sY+hVqvhrLPOmt+DQRAEQRAEQex36K65ZhAu4EgIYv8mLdCRg44giIOFRVGD7s4778SWLVvw53/+58bzpVIJd955J770pS+hVqvhyCOPxLvf/W586lOfUsvYto2bb74ZH/vYx3DqqadicHAQH/rQh3DJJZeoZY455hjccsstOP/88/HlL38ZRxxxBL71rW/hjDPOUMu8973vxe7du3HxxRdjx44dOOmkk3DbbbdlGkcQBEEQBEEQRJptukDnk0BHEDNFOuZsZiPkIQl0BEEcNCwKge70008H5zzz/JFHHon77ruv4+uPPvpo3HrrrW2Xedvb3obHHnus7TLnnnsuzj333I7bIwiCIAiCIAid7RNJxLXpU8aVIGaKFOSGS8MYb41TxJUgFhHjzXE8tfcp/JfD/wsstigCmQcUdEQJgiAIgiAIYpborjly0BHEzGkFsUA3UhqJfyYHHUEsGj542wfxsTs/hutfuH6hh3JAQgIdQRAEQRAEQcwSL0hccyTQEcTM0R10AOBFC++g45zjqrVXYfWLqxd6KASxoGya2AQAuGXjLQs8kgOTRRFxJQiCIAiCIIj9mUBr3doMKOJKEDNFRlqlQLcYHHTr9q3DFY9fAQB47AOPwbFoGk0c3IScvoiaC8hBRxAEQRAEQRCzxA+SesrkoCOImSMFuSF3CAAQRMFCDgcA0PCTJjDPjz2/gCMhiMVBGNHfubmABDqCIAiCIAiCmCW+5qBrkUBHEDMi4pGKtA66gwAAP/QXckgAgEaQCHRP7XlqAUdCEIsDP1r46/JAhAQ6giAIgiAIgpglfqg76CjiShAzQe/YKgW6gC+8g64W1NTjKW9qAUdCEIsDirjODSTQEQRBEARBEMQsCUJqEkEQs0WvN6cEukUWcV0MNfEIYiGIePJ3bjFclwciJNARBEEQBEEQxCzxdYEuIIGOIGaCdNBZzELFqQBYHEJAzU8cdCTQEQcretR7MVyXByIk0BEEQRAEQRDELNEjrg2PIq4EMROaYRMAULbLcC0
"text/plain": [
"<Figure size 1500x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fig = plt.figure(figsize=(15, 5))\n",
"ax = fig.add_subplot(111)\n",
"ax.plot(data.get_waveforms(0).T)\n",
"ax.axvline(data.metadata[\"trace_Pg_arrival_sample\"].iloc[0], color=\"green\", lw=3)\n",
"# ax.axvline(data.metadata[\"trace_Sg_arrival_sample\"].iloc[0], color=\"black\", lw=3)"
]
},
{
"cell_type": "markdown",
"id": "1110dd5f-a6ff-4cb0-bd94-4904116e3233",
"metadata": {},
"source": [
"#### Check train/dev/test proportions"
]
},
{
"cell_type": "code",
"execution_count": 14,
2023-07-05 09:58:06 +02:00
"id": "bf7dae75-c90b-44f8-a51d-44e8abaaa3c3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training examples: 24738 69.1%\n",
"Development examples: 5508 15.4%\n",
"Test examples: 5538 15.5 %\n"
]
}
],
"source": [
"all_samples = len(data.train()) + len(data.dev()) + len(data.test())\n",
"print(f\"Training examples: {len(data.train())} {len(data.train())/all_samples * 100:.1f}%\" )\n",
"print(f\"Development examples: {len(data.dev())} {len(data.dev())/all_samples * 100:.1f}%\")\n",
"print(f\"Test examples: {len(data.test())} {len(data.test())/all_samples * 100:.1f} %\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
2023-07-05 09:58:06 +02:00
"id": "de82db24-d983-4592-a0eb-f96beecb2f69",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>index</th>\n",
" <th>source_origin_time</th>\n",
" <th>source_latitude_deg</th>\n",
" <th>source_longitude_deg</th>\n",
" <th>source_depth_km</th>\n",
" <th>source_magnitude</th>\n",
" <th>split</th>\n",
" <th>station_network_code</th>\n",
" <th>station_code</th>\n",
" <th>trace_channel</th>\n",
" <th>trace_sampling_rate_hz</th>\n",
" <th>trace_start_time</th>\n",
" <th>trace_Pg_arrival_sample</th>\n",
" <th>trace_name</th>\n",
" <th>trace_Sg_arrival_sample</th>\n",
" <th>trace_chunk</th>\n",
" <th>trace_component_order</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>2020-01-01 10:09:42.200</td>\n",
" <td>5.702646e+06</td>\n",
" <td>5.582503e+06</td>\n",
" <td>0.7</td>\n",
" <td>2.469231</td>\n",
" <td>train</td>\n",
" <td>PL</td>\n",
" <td>BRDW</td>\n",
" <td>EHE</td>\n",
" <td>100.0</td>\n",
" <td>2020-01-01T10:09:36.480000Z</td>\n",
" <td>792.0</td>\n",
" <td>bucket0$0,:3,:2001</td>\n",
" <td>NaN</td>\n",
" <td></td>\n",
" <td>ZNE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>2020-01-01 10:09:42.200</td>\n",
" <td>5.702646e+06</td>\n",
" <td>5.582503e+06</td>\n",
" <td>0.7</td>\n",
" <td>2.469231</td>\n",
" <td>train</td>\n",
" <td>PL</td>\n",
" <td>BRDW</td>\n",
" <td>EHE</td>\n",
" <td>100.0</td>\n",
" <td>2020-01-01T10:09:36.480000Z</td>\n",
" <td>NaN</td>\n",
" <td>bucket0$1,:3,:2001</td>\n",
" <td>921.0</td>\n",
" <td></td>\n",
" <td>ZNE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>2020-01-01 10:09:42.200</td>\n",
" <td>5.702646e+06</td>\n",
" <td>5.582503e+06</td>\n",
" <td>0.7</td>\n",
" <td>2.469231</td>\n",
" <td>train</td>\n",
" <td>PL</td>\n",
" <td>GROD</td>\n",
" <td>EHE</td>\n",
" <td>100.0</td>\n",
" <td>2020-01-01T10:09:36.480000Z</td>\n",
" <td>872.0</td>\n",
" <td>bucket0$2,:3,:2001</td>\n",
" <td>NaN</td>\n",
" <td></td>\n",
" <td>ZNE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>2020-01-01 10:09:42.200</td>\n",
" <td>5.702646e+06</td>\n",
" <td>5.582503e+06</td>\n",
" <td>0.7</td>\n",
" <td>2.469231</td>\n",
" <td>train</td>\n",
" <td>PL</td>\n",
" <td>GROD</td>\n",
" <td>EHE</td>\n",
" <td>100.0</td>\n",
" <td>2020-01-01T10:09:36.480000Z</td>\n",
" <td>NaN</td>\n",
" <td>bucket0$3,:3,:2001</td>\n",
" <td>1017.0</td>\n",
" <td></td>\n",
" <td>ZNE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>2020-01-01 10:09:42.200</td>\n",
" <td>5.702646e+06</td>\n",
" <td>5.582503e+06</td>\n",
" <td>0.7</td>\n",
" <td>2.469231</td>\n",
" <td>train</td>\n",
" <td>PL</td>\n",
" <td>GUZI</td>\n",
" <td>CNE</td>\n",
" <td>100.0</td>\n",
" <td>2020-01-01T10:09:36.476000Z</td>\n",
" <td>864.0</td>\n",
" <td>bucket0$4,:3,:2001</td>\n",
" <td>NaN</td>\n",
" <td></td>\n",
" <td>ZNE</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" index source_origin_time source_latitude_deg source_longitude_deg \\\n",
"0 0 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
2023-07-05 09:58:06 +02:00
"1 1 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
"2 2 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
"3 3 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
"4 4 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
"\n",
" source_depth_km source_magnitude split station_network_code station_code \\\n",
"0 0.7 2.469231 train PL BRDW \n",
2023-07-05 09:58:06 +02:00
"1 0.7 2.469231 train PL BRDW \n",
"2 0.7 2.469231 train PL GROD \n",
"3 0.7 2.469231 train PL GROD \n",
"4 0.7 2.469231 train PL GUZI \n",
"\n",
" trace_channel trace_sampling_rate_hz trace_start_time \\\n",
"0 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
2023-07-05 09:58:06 +02:00
"1 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
"2 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
"3 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
"4 CNE 100.0 2020-01-01T10:09:36.476000Z \n",
"\n",
" trace_Pg_arrival_sample trace_name trace_Sg_arrival_sample \\\n",
"0 792.0 bucket0$0,:3,:2001 NaN \n",
2023-07-05 09:58:06 +02:00
"1 NaN bucket0$1,:3,:2001 921.0 \n",
"2 872.0 bucket0$2,:3,:2001 NaN \n",
"3 NaN bucket0$3,:3,:2001 1017.0 \n",
"4 864.0 bucket0$4,:3,:2001 NaN \n",
"\n",
" trace_chunk trace_component_order \n",
"0 ZNE \n",
"1 ZNE \n",
"2 ZNE \n",
"3 ZNE \n",
"4 ZNE "
]
},
"execution_count": 15,
2023-07-05 09:58:06 +02:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.metadata.head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "37fe0dd1-ba9b-46ff-9abd-eb40f73649e3",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "8ccd6908-cff7-42b2-a6a3-51ac0557a7dc",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
2023-07-05 09:58:06 +02:00
}
},
"nbformat": 4,
"nbformat_minor": 5
}