Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging

2023-09-26 10:50:46 +02:00
parent 78ac51478c
commit aa39980573
15 changed files with 1788 additions and 66 deletions
--- a/README.md
+++ b/README.md
@@ -2,10 +2,9 @@
 This repo contains notebooks and scripts demonstrating how to:
- Prepare IGF data for training a seisbench model detecting P phase (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](utils/Transforming%20mseeds%20to%20SeisBench%20dataset.ipynb).
+- Prepare data for training a seisbench model detecting P and S waves (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](utils/Transforming%20mseeds%20from%20Bogdanka%20to%20Seisbench%20format.ipynb) and the [script](utils/mseeds_to_seisbench.py)
-
+- [to update] Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb)
- Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb)
+- Train various cnn models available in seisbench library and compare their performance of detecting P and S waves, check the [script](scripts/pipeline.py) 
 - Train various cnn models available in seisbench library and compare their performance of detecting P phase, check the [script](scripts/pipeline.py) 
 - [to update] Validate model performance, check the [notebook](notebooks/Check%20model%20performance%20depending%20on%20station-random%20window.ipynb)
 - [to update] Use model for detecting P phase, check the [notebook](notebooks/Present%20model%20predictions.ipynb)
@@ -68,31 +67,68 @@ poetry shell
   WANDB_HOST="https://epos-ai.grid.cyfronet.pl/"
   WANDB_API_KEY="your key"
   WANDB_USER="your user"
-   WANDB_PROJECT="training_seisbench_models_on_igf_data"
+   WANDB_PROJECT="training_seisbench_models"
   BENCHMARK_DEFAULT_WORKER=2
-2. Transform data into seisbench format. (unofficial)
+2. Transform data into seisbench format. 
-   * Download original data from the [drive](https://drive.google.com/drive/folders/1InVI9DLaD7gdzraM2jMzeIrtiBSu-UIK?usp=drive_link)
+    
-   * Run the notebook: `utils/Transforming mseeds to SeisBench dataset.ipynb`
+    To utilize functionality of Seisbench library, data need to be transformed to [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)). If your data is in the MSEED format, you can use the prepared script `mseeds_to_seisbench.py` to perform the transformation. Please make sure that your data has the same structure as the data used in this project.
    The script assumes that:
   *  the data is stored in the following directory structure:
   `input_path/year/station_network_code/station_code/trace_channel.D` e.g.
   `input_path/2018/PL/ALBE/EHE.D/`
    * the file names follow the pattern:  
    `station_network_code.station_code..trace_channel.D.year.day_of_year`
   e.g. `PL.ALBE..EHE.D.2018.282`
    * events catalog is stored in quakeML format
    Run the script `mseeds_to_seisbench` located in the `utils` directory
-3. Run the pipeline script:
+    ```
    cd utils
    python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path
    ```
    If you want to run the script on a cluster, you can use the script `convert_data.sh` as a template (adjust the grant name, computing name and paths) and send the job to queue using sbatch command on login node of e.g. Ares: 
    ```
    cd utils
    sbatch convert_data.sh
   ```
    If your data has a different structure or format, use the notebooks to gain an understanding of the Seisbench format and what needs to be done to transform your data:
   * [Seisbench example](https://colab.research.google.com/github/seisbench/seisbench/blob/main/examples/01a_dataset_basics.ipynb) or
   * [Transforming mseeds from Bogdanka to Seisbench format](utils/Transforming mseeds from Bogdanka to Seisbench format.ipynb) notebook 
-   `python pipeline.py`
+3. Adjust the `config.json` and specify: 
   * `dataset_name` - the name of the dataset, which will be used to name the folder with evaluation targets and predictions
   * `data_path` - the path to the data in the Seisbench format
   * `experiment_count` - the number of experiments to run for each model type
 4. Run the pipeline script
 `python pipeline.py`
   The script performs the following steps:
-   * Generates evaluation targets
+   * Generates evaluation targets in `datasets/<dataset_name>/targets` directory. 
     * Trains multiple versions of GPD, PhaseNet and ... models to find the best hyperparameters, producing the lowest validation loss.
     This step utilizes the Weights & Biases platform to perform the hyperparameters search (called sweeping) and track the training process and store the results.
-     The results are available at   
+          The results are available at   
-     `https://epos-ai.grid.cyfronet.pl/<your user name>/<your project name>`
+          `https://epos-ai.grid.cyfronet.pl/<WANDB_USER>/<WANDB_PROJECT>`
-   * Uses the best performing model of each type to generate predictions
+     Weights and training logs can be downloaded from the platform. 
-   * Evaluates the performance of each model by comparing the predictions with the evaluation targets
+    Additionally, the most important data are saved locally in `weights/<dataset_name>_<model_name>/ ` directory:
-   * Saves the results in the `scripts/pred` directory  
+     * Weights of the best checkpoint of each model are saved as  `<dataset_name>_<model_name>_sweep=<sweep_id>-run=<run_id>-epoch=<epoch_number>-val_loss=<val_loss>.ckpt`
-   * 
+     * Metrics and hyperparams are saved  in <run_id> folders
-   The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script. 
+       
-   For example, to change the sweep configuration file for GPD model, run:
+   * Uses the best performing model of each type to generate predictions. The predictons are saved in the `scripts/pred/<dataset_name>_<model_name>/<run_id>` directory.
-   `python pipeline.py --gpd_config <new config file>`
+   * Evaluates the performance of each model by comparing the predictions with the evaluation targets. 
-   The new config file should be placed in the `experiments` or as specified in the `configs_path` parameter in the config.json file.
+   The results are saved in the `scripts/pred/results.csv` file.
  The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script. 
  For example, to change the sweep configuration file for GPD model, run:
  `python pipeline.py --gpd_config <new config file>`
  The new config file should be placed in the `experiments` folder or as specified in the `configs_path` parameter in the config.json file.
 ### Troubleshooting
--- a/config.json
+++ b/config.json
@@ -1,7 +1,7 @@
 {
-  "dataset_name": "igf",
+  "dataset_name": "bogdanka",
-  "data_path": "datasets/igf/seisbench_format/",
+  "data_path": "datasets/bogdanka/seisbench_format/",
-  "targets_path": "datasets/targets/igf",
+  "targets_path": "datasets/targets",
  "models_path": "weights",
  "configs_path": "experiments",
  "sampling_rate": 100,
--- a/poetry.lock
+++ b/poetry.lock
@@ -283,6 +283,14 @@ python-versions = "*"
 [package.dependencies]
 six = ">=1.4.0"
 [[package]]
 name = "et-xmlfile"
 version = "1.1.0"
 description = "An implementation of lxml.xmlfile for the standard library"
 category = "main"
 optional = false
 python-versions = ">=3.6"
 [[package]]
 name = "exceptiongroup"
 version = "1.1.2"
@@ -971,6 +979,17 @@ imaging = ["cartopy"]
 "io.shapefile" = ["pyshp"]
 tests = ["packaging", "pyproj", "pytest", "pytest-json-report"]
 [[package]]
 name = "openpyxl"
 version = "3.1.2"
 description = "A Python library to read/write Excel 2010 xlsx/xlsm files"
 category = "main"
 optional = false
 python-versions = ">=3.6"
 [package.dependencies]
 et-xmlfile = "*"
 [[package]]
 name = "overrides"
 version = "7.3.1"
@@ -1766,7 +1785,7 @@ test = ["websockets"]
 [metadata]
 lock-version = "1.1"
 python-versions = "^3.10"
-content-hash = "2f8790f8c3e1a78ff23f0a0f0e954c97d2b0033fc6a890d4ef1355c6922dcc64"
+content-hash = "86f528987bd303e300f586a26f506318d7bdaba445886a6a5a36f86f9e89b229"
 [metadata.files]
 anyio = [
@@ -2076,6 +2095,10 @@ docker-pycreds = [
    {file = "docker-pycreds-0.4.0.tar.gz", hash = "sha256:6ce3270bcaf404cc4c3e27e4b6c70d3521deae82fb508767870fdbf772d584d4"},
    {file = "docker_pycreds-0.4.0-py2.py3-none-any.whl", hash = "sha256:7266112468627868005106ec19cd0d722702d2b7d5912a28e19b826c3d37af49"},
 ]
 et-xmlfile = [
    {file = "et_xmlfile-1.1.0-py3-none-any.whl", hash = "sha256:a2ba85d1d6a74ef63837eed693bcb89c3f752169b0e3e7ae5b16ca5e1b3deada"},
    {file = "et_xmlfile-1.1.0.tar.gz", hash = "sha256:8eb9e2bc2f8c97e37a2dc85a09ecdcdec9d8a396530a6d5a33b30b9a92da0c5c"},
 ]
 exceptiongroup = [
    {file = "exceptiongroup-1.1.2-py3-none-any.whl", hash = "sha256:e346e69d186172ca7cf029c8c1d16235aa0e04035e5750b4b95039e65204328f"},
    {file = "exceptiongroup-1.1.2.tar.gz", hash = "sha256:12c3e887d6485d16943a309616de20ae5582633e0a2eda17f4e10fd61c1e8af5"},
@@ -2622,6 +2645,10 @@ obspy = [
    {file = "obspy-1.4.0-cp39-cp39-win_amd64.whl", hash = "sha256:2090a95b08b214575892c3d99bb3362b13a3b0f4689d4ee55f95ea4d8a2cbc26"},
    {file = "obspy-1.4.0.tar.gz", hash = "sha256:336a6e1d9a485732b08173cb5dc1dd720a8e53f3b54c180a62bb8ceaa5fe5c06"},
 ]
 openpyxl = [
    {file = "openpyxl-3.1.2-py2.py3-none-any.whl", hash = "sha256:f91456ead12ab3c6c2e9491cf33ba6d08357d802192379bb482f1033ade496f5"},
    {file = "openpyxl-3.1.2.tar.gz", hash = "sha256:a6f5977418eff3b2d5500d54d9db50c8277a368436f4e4f8ddb1be3422870184"},
 ]
 overrides = [
    {file = "overrides-7.3.1-py3-none-any.whl", hash = "sha256:6187d8710a935d09b0bcef8238301d6ee2569d2ac1ae0ec39a8c7924e27f58ca"},
    {file = "overrides-7.3.1.tar.gz", hash = "sha256:8b97c6c1e1681b78cbc9424b138d880f0803c2254c5ebaabdde57bb6c62093f2"},
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -16,6 +16,7 @@ wandb = "^0.15.4"
 torchmetrics = "^0.11.4"
 ipykernel = "^6.24.0"
 jupyterlab = "^4.0.2"
 openpyxl = "^3.1.2"
 [tool.poetry.dev-dependencies]
--- a/scripts/config_loader.py
+++ b/scripts/config_loader.py
@@ -15,8 +15,8 @@ config = load_config(config_path)
 data_path = f"{project_path}/{config['data_path']}"
 models_path = f"{project_path}/{config['models_path']}"
 targets_path = f"{project_path}/{config['targets_path']}"
 dataset_name = config['dataset_name']
 targets_path = f"{project_path}/{config['targets_path']}/{dataset_name}"
 configs_path = f"{project_path}/{config['configs_path']}"
 sweep_files = config['sweep_files']
--- a/scripts/eval.py
+++ b/scripts/eval.py
@@ -29,11 +29,11 @@ data_aliases = {
    "instance": "InstanceCountsCombined",
    "iquique": "Iquique",
    "lendb": "LenDB",
-    "scedc": "SCEDC"
+    "scedc": "SCEDC",
 }
-def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, sweep_id=None, test_run=False):
+def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, sweep_id=None):
    weights = Path(weights)
    targets = Path(os.path.abspath(targets))
    print(targets)
@@ -100,8 +100,6 @@ def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, swe
        for task in ["1", "23"]:
            task_csv = targets / f"task{task}.csv"
            print(task_csv)
            if not task_csv.is_file():
                continue
@@ -227,9 +225,7 @@ if __name__ == "__main__":
    parser.add_argument(
        "--sweep_id", type=str, help="wandb sweep_id", required=False, default=None
    )
-    parser.add_argument(
+
        "--test_run", action="store_true", required=False, default=False
    )
    args = parser.parse_args()
    main(
@@ -239,8 +235,7 @@ if __name__ == "__main__":
        batchsize=args.batchsize,
        num_workers=args.num_workers,
        sampling_rate=args.sampling_rate,
-        sweep_id=args.sweep_id,
+        sweep_id=args.sweep_id
        test_run=args.test_run
    )
    running_time = str(
        datetime.timedelta(seconds=time.perf_counter() - code_start_time)
--- a/scripts/hyperparameter_sweep.py
+++ b/scripts/hyperparameter_sweep.py
@@ -3,6 +3,7 @@
 # This work was partially funded by EPOS Project funded in frame of PL-POIR4.2
 # -----------------
 import os
 import os.path
 import argparse
 from pytorch_lightning.loggers import WandbLogger, CSVLogger
@@ -22,6 +23,7 @@ from config_loader import models_path, dataset_name, seed, experiment_count
 torch.multiprocessing.set_sharing_strategy('file_system')
 os.system("ulimit -n unlimited")
 load_dotenv()
 wandb_api_key = os.environ.get('WANDB_API_KEY')
--- a/scripts/pipeline.py
+++ b/scripts/pipeline.py
@@ -17,8 +17,8 @@ import eval
 import collect_results
 from config_loader import data_path, targets_path, sampling_rate, dataset_name, sweep_files
 logging.root.setLevel(logging.INFO)
 logger = logging.getLogger('pipeline')
 logger.setLevel(logging.INFO)
 def load_sweep_config(model_name, args):
@@ -76,16 +76,19 @@ def main():
    args = parser.parse_args()
    # generate labels
    logger.info("Started generating labels for the dataset.")
    generate_eval_targets.main(data_path, targets_path, "2,3", sampling_rate, None)
    # find the best hyperparams for the models
    logger.info("Started training the models.")
    for model_name in ["GPD", "PhaseNet"]:
        sweep_id = find_the_best_params(model_name, args)
        generate_predictions(sweep_id, model_name)
    # collect results
    logger.info("Collecting results.")
    collect_results.traverse_path("pred", "pred/results.csv")
-
+    logger.info("Results saved in pred/results.csv")
 if __name__ == "__main__":
    main()
--- a/scripts/train.py
+++ b/scripts/train.py
@@ -20,18 +20,13 @@ import torch
 import os
 import logging
 from pathlib import Path
 from dotenv import load_dotenv
 import models, data, util
 import time
 import datetime
 import wandb
-#
+
 # load_dotenv()
 # wandb_api_key = os.environ.get('WANDB_API_KEY')
 # if wandb_api_key is None:
 #     raise ValueError("WANDB_API_KEY environment variable is not set.")
 #
 # wandb.login(key=wandb_api_key)
 def train(config, experiment_name, test_run):
    """
@@ -210,6 +205,14 @@ def generate_phase_mask(dataset, phases):
 if __name__ == "__main__":
    load_dotenv()
    wandb_api_key = os.environ.get('WANDB_API_KEY')
    if wandb_api_key is None:
        raise ValueError("WANDB_API_KEY environment variable is not set.")
    wandb.login(key=wandb_api_key)
    code_start_time = time.perf_counter()
    torch.manual_seed(42)
--- a/scripts/util.py
+++ b/scripts/util.py
@@ -16,7 +16,7 @@ load_dotenv()
 logging.basicConfig()
-logging.getLogger().setLevel(logging.DEBUG)
+logging.getLogger().setLevel(logging.INFO)
 def load_best_model_data(sweep_id, weights):
--- a/utils/Transforming
+++ b/utils/Transforming
--- a/utils/Transforming
+++ b/utils/Transforming
@@ -88,8 +88,8 @@
       "</div>"
      ],
      "text/plain": [
-       "                 Datetime             X             Y  Depth        Mw   \n",
+       "                 Datetime             X             Y  Depth        Mw  \\\n",
-       "0 2020-01-01 10:09:42.200  5.582503e+06  5.702646e+06    0.7  2.469231  \\\n",
+       "0 2020-01-01 10:09:42.200  5.582503e+06  5.702646e+06    0.7  2.469231   \n",
       "\n",
       "                                              Phases            mseed_name  \n",
       "0  Pg BRDW 2020-01-01 10:09:44.400, Sg BRDW 2020-...  20200101100941.mseed  "
@@ -101,7 +101,7 @@
    }
   ],
   "source": [
-    "input_path = str(Path.cwd().parent) + \"/data/igf/\"\n",
+    "input_path = str(Path.cwd().parent) + \"/datasets/igf/\"\n",
    "catalog =  pd.read_excel(input_path + \"Catalog_20_21.xlsx\", index_col=0)\n",
    "catalog.head(1)"
   ]
@@ -317,7 +317,7 @@
     "name": "stderr",
     "output_type": "stream",
     "text": [
-      "Traces converted: 35784it [00:52, 679.39it/s]\n"
+      "Traces converted: 35784it [01:01, 578.58it/s]\n"
     ]
    }
   ],
@@ -339,8 +339,10 @@
    "            continue\n",
    "        if os.path.exists(input_path + \"mseeds/mseeds_2020/\" + event.mseed_name):\n",
    "            mseed_path = input_path + \"mseeds/mseeds_2020/\" + event.mseed_name \n",
-    "        else:\n",
+    "        elif os.path.exists(input_path + \"mseeds/mseeds_2021/\" + event.mseed_name):\n",
    "            mseed_path = input_path + \"mseeds/mseeds_2021/\" + event.mseed_name \n",
    "        else: \n",
    "            continue\n",
    "        \n",
    "        \n",
    "        stream = get_mseed(mseed_path)\n",
@@ -374,6 +376,8 @@
    "            # trace_params[f\"trace_{pick.phase_hint}_status\"] = pick.evaluation_mode\n",
    "            \n",
    "            writer.add_trace({**event_params, **trace_params}, data)\n",
    "\n",
    "        # break\n",
    "            \n",
    "    "
   ]
@@ -393,7 +397,25 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "data = sbd.WaveformDataset(output_path, sampling_rate=100)"
+    "data = sbd.WaveformDataset(output_path, sampling_rate=100)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "33c77509-7aab-4833-a372-16030941395d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Unnamed dataset - 35784 traces\n"
     ]
    }
   ],
   "source": [
    "print(data)"
   ]
  },
  {
@@ -406,17 +428,17 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 13,
   "id": "1753f65e-fe5d-4cfa-ab42-ae161ac4a253",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "<matplotlib.lines.Line2D at 0x7f7ed04a8820>"
+       "<matplotlib.lines.Line2D at 0x14d6c12d0>"
      ]
     },
-     "execution_count": 12,
+     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    },
@@ -449,7 +471,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 14,
   "id": "bf7dae75-c90b-44f8-a51d-44e8abaaa3c3",
   "metadata": {},
   "outputs": [
@@ -472,7 +494,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 15,
   "id": "de82db24-d983-4592-a0eb-f96beecb2f69",
   "metadata": {},
   "outputs": [
@@ -622,29 +644,29 @@
       "</div>"
      ],
      "text/plain": [
-       "   index       source_origin_time  source_latitude_deg  source_longitude_deg   \n",
+       "   index       source_origin_time  source_latitude_deg  source_longitude_deg  \\\n",
-       "0      0  2020-01-01 10:09:42.200         5.702646e+06          5.582503e+06  \\\n",
+       "0      0  2020-01-01 10:09:42.200         5.702646e+06          5.582503e+06   \n",
       "1      1  2020-01-01 10:09:42.200         5.702646e+06          5.582503e+06   \n",
       "2      2  2020-01-01 10:09:42.200         5.702646e+06          5.582503e+06   \n",
       "3      3  2020-01-01 10:09:42.200         5.702646e+06          5.582503e+06   \n",
       "4      4  2020-01-01 10:09:42.200         5.702646e+06          5.582503e+06   \n",
       "\n",
-       "   source_depth_km  source_magnitude  split station_network_code station_code   \n",
+       "   source_depth_km  source_magnitude  split station_network_code station_code  \\\n",
-       "0              0.7          2.469231  train                   PL         BRDW  \\\n",
+       "0              0.7          2.469231  train                   PL         BRDW   \n",
       "1              0.7          2.469231  train                   PL         BRDW   \n",
       "2              0.7          2.469231  train                   PL         GROD   \n",
       "3              0.7          2.469231  train                   PL         GROD   \n",
       "4              0.7          2.469231  train                   PL         GUZI   \n",
       "\n",
-       "  trace_channel  trace_sampling_rate_hz             trace_start_time   \n",
+       "  trace_channel  trace_sampling_rate_hz             trace_start_time  \\\n",
-       "0           EHE                   100.0  2020-01-01T10:09:36.480000Z  \\\n",
+       "0           EHE                   100.0  2020-01-01T10:09:36.480000Z   \n",
       "1           EHE                   100.0  2020-01-01T10:09:36.480000Z   \n",
       "2           EHE                   100.0  2020-01-01T10:09:36.480000Z   \n",
       "3           EHE                   100.0  2020-01-01T10:09:36.480000Z   \n",
       "4           CNE                   100.0  2020-01-01T10:09:36.476000Z   \n",
       "\n",
-       "   trace_Pg_arrival_sample          trace_name  trace_Sg_arrival_sample   \n",
+       "   trace_Pg_arrival_sample          trace_name  trace_Sg_arrival_sample  \\\n",
-       "0                    792.0  bucket0$0,:3,:2001                      NaN  \\\n",
+       "0                    792.0  bucket0$0,:3,:2001                      NaN   \n",
       "1                      NaN  bucket0$1,:3,:2001                    921.0   \n",
       "2                    872.0  bucket0$2,:3,:2001                      NaN   \n",
       "3                      NaN  bucket0$3,:3,:2001                   1017.0   \n",
@@ -658,7 +680,7 @@
       "4                               ZNE  "
      ]
     },
-     "execution_count": 14,
+     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -700,7 +722,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.7"
+   "version": "3.10.6"
  }
 },
 "nbformat": 4,
--- a/utils/convert_data.sh
+++ b/utils/convert_data.sh
@@ -0,0 +1,19 @@
 #!/bin/bash
 #SBATCH --job-name=mseeds_to_seisbench
 #SBATCH --time=1:00:00
 #SBATCH --account=plgeposai22gpu-gpu
 #SBATCH --partition plgrid
 #SBATCH --cpus-per-task=1
 #SBATCH --ntasks-per-node=1
 #SBATCH --mem=24gb
 ## activate conda environment
 source /net/pr2/projects/plgrid/plggeposai/kmilian/mambaforge/bin/activate
 conda activate epos-ai-train
 input_path="/net/pr2/projects/plgrid/plggeposai/datasets/bogdanka"
 catalog_path="/net/pr2/projects/plgrid/plggeposai/datasets/bogdanka/BOIS_all.xml"
 output_path="/net/pr2/projects/plgrid/plggeposai/kmilian/platform-demo-scripts/datasets/bogdanka/seisbench_format"
 python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path
--- a/utils/mseeds_to_seisbench.py
+++ b/utils/mseeds_to_seisbench.py
@@ -0,0 +1,250 @@
 import os
 import pandas as pd
 import glob
 from pathlib import Path
 import obspy
 from obspy.core.event import read_events
 import seisbench
 import seisbench.data as sbd
 import seisbench.util as sbu
 import sys
 import logging
 import argparse
 logging.basicConfig(filename="output.out",
                    filemode='a',
                    format='%(asctime)s,%(msecs)d %(name)s %(levelname)s %(message)s',
                    datefmt='%H:%M:%S',
                    level=logging.DEBUG)
 logger = logging.getLogger('converter')
 def create_traces_catalog(directory, years):
    for year in years:
        directory = f"{directory}/{year}"
        files = glob.glob(directory)
        traces = []
        for i, f in enumerate(files):
            st = obspy.read(f)
            for tr in st.traces:
                # trace_id = tr.id
                # start = tr.meta.starttime
                # end = tr.meta.endtime
                trs = pd.Series({
                    'trace_id': tr.id,
                    'trace_st': tr.meta.starttime,
                    'trace_end': tr.meta.endtime,
                    'stream_fname': f
                })
                traces.append(trs)
        traces_catalog = pd.DataFrame(pd.concat(traces)).transpose()
        traces_catalog.to_csv("data/bogdanka/traces_catalog.csv", append=True, index=False)
 def split_events(events, input_path):
    logger.info("Splitting available events into train, dev and test sets ...")
    events_stats = pd.DataFrame()
    events_stats.index.name = "event"
    for i, event in enumerate(events):
        #check if mseed exists
        actual_picks = 0
        for pick in event.picks:
            trace_params = get_trace_params(pick)
            trace_path = get_trace_path(input_path, trace_params)
            if os.path.isfile(trace_path):
                actual_picks += 1
        events_stats.loc[i, "pick_count"] = actual_picks
    events_stats['pick_count_cumsum'] = events_stats.pick_count.cumsum()
    train_th = 0.7 * events_stats.pick_count_cumsum.values[-1]
    dev_th = 0.85 * events_stats.pick_count_cumsum.values[-1]
    events_stats['split'] = 'test'
    for i, event in events_stats.iterrows():
        if event['pick_count_cumsum'] < train_th:
            events_stats.loc[i, 'split'] = 'train'
        elif event['pick_count_cumsum'] < dev_th:
            events_stats.loc[i, 'split'] = 'dev'
        else:
            break
    return events_stats
 def get_event_params(event):
    origin = event.preferred_origin()
    if origin is None:
        return {}
    # print(origin)
    mag = event.preferred_magnitude()
    source_id = str(event.resource_id)
    event_params = {
        "source_id": source_id,
        "source_origin_uncertainty_sec": origin.time_errors["uncertainty"],
        "source_latitude_deg": origin.latitude,
        "source_latitude_uncertainty_km": origin.latitude_errors["uncertainty"],
        "source_longitude_deg": origin.longitude,
        "source_longitude_uncertainty_km": origin.longitude_errors["uncertainty"],
        "source_depth_km": origin.depth / 1e3,
        "source_depth_uncertainty_km": origin.depth_errors["uncertainty"] / 1e3 if origin.depth_errors[
                                                                                       "uncertainty"] is not None else None,
    }
    if mag is not None:
        event_params["source_magnitude"] = mag.mag
        event_params["source_magnitude_uncertainty"] = mag.mag_errors["uncertainty"]
        event_params["source_magnitude_type"] = mag.magnitude_type
        event_params["source_magnitude_author"] = mag.creation_info.agency_id if mag.creation_info is not None else None
    return event_params
 def get_trace_params(pick):
    net = pick.waveform_id.network_code
    sta = pick.waveform_id.station_code
    trace_params = {
        "station_network_code": net,
        "station_code": sta,
        "trace_channel": pick.waveform_id.channel_code,
        "station_location_code": pick.waveform_id.location_code,
        "time": pick.time
    }
    return trace_params
 def find_trace(pick_time, traces):
    for tr in traces:
        if pick_time > tr.stats.endtime:
            continue
        if pick_time >= tr.stats.starttime:
            # print(pick_time, " - selected trace: ", tr)
            return tr
    logger.warning(f"no matching trace for peak: {pick_time}")
    return None
 def get_trace_path(input_path, trace_params):
    year = trace_params["time"].year
    day_of_year = pd.Timestamp(str(trace_params["time"])).day_of_year
    net = trace_params["station_network_code"]
    station = trace_params["station_code"]
    tr_channel = trace_params["trace_channel"]
    path = f"{input_path}/{year}/{net}/{station}/{tr_channel}.D/{net}.{station}..{tr_channel}.D.{year}.{day_of_year}"
    return path
 def load_trace(input_path, trace_params):
    trace_path = get_trace_path(input_path, trace_params)
    trace = None
    if not os.path.isfile(trace_path):
        logger.w(trace_path + " not found")
    else:
        stream = obspy.read(trace_path)
        if len(stream.traces) > 1:
            trace = find_trace(trace_params["time"], stream.traces)
        elif len(stream.traces) == 0:
            logger.warning(f"no data in: {trace_path}")
        else:
            trace = stream.traces[0]
    return trace
 def load_stream(input_path, trace_params, time_before=60, time_after=60):
    trace_path = get_trace_path(input_path, trace_params)
    sampling_rate, stream = None, None
    pick_time = trace_params["time"]
    if not os.path.isfile(trace_path):
        print(trace_path + " not found")
    else:
        stream = obspy.read(trace_path)
        stream = stream.slice(pick_time - time_before, pick_time + time_after)
        if len(stream.traces) == 0:
            print(f"no data in: {trace_path}")
        else:
            sampling_rate = stream.traces[0].stats.sampling_rate
    return sampling_rate, stream
 def convert_mseed_to_seisbench_format(input_path, catalog_path, output_path):
    """
    Convert mseed files to seisbench dataset format
    :param input_path: folder with mseed files
    :param catalog_path: path to events catalog in quakeml format
    :param output_path: folder to save seisbench dataset
    :return:
    """
    logger.info("Loading events catalog ...")
    events = read_events(catalog_path)
    events_stats = split_events(events, input_path)
    metadata_path = output_path + "/metadata.csv"
    waveforms_path = output_path + "/waveforms.hdf5"
    logger.debug("Catalog loaded, starting conversion ...")
    with sbd.WaveformDataWriter(metadata_path, waveforms_path) as writer:
        writer.data_format = {
            "dimension_order": "CW",
            "component_order": "ZNE",
        }
        for i, event in enumerate(events):
            logger.debug(f"Converting {i} event")
            event_params = get_event_params(event)
            event_params["split"] = events_stats.loc[i, "split"]
            for pick in event.picks:
                trace_params = get_trace_params(pick)
                sampling_rate, stream = load_stream(input_path, trace_params)
                if stream is None:
                    continue
                actual_t_start, data, _ = sbu.stream_to_array(
                    stream,
                    component_order=writer.data_format["component_order"],
                )
                trace_params["trace_sampling_rate_hz"] = sampling_rate
                trace_params["trace_start_time"] = str(actual_t_start)
                pick_time = obspy.core.utcdatetime.UTCDateTime(trace_params["time"])
                pick_idx = (pick_time - actual_t_start) * sampling_rate
                trace_params[f"trace_{pick.phase_hint}_arrival_sample"] = int(pick_idx)
                writer.add_trace({**event_params, **trace_params}, data)
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Convert mseed files to seisbench format')
    parser.add_argument('--input_path', type=str, help='Path to mseed files')
    parser.add_argument('--catalog_path', type=str, help='Path to events catalog in quakeml format')
    parser.add_argument('--output_path', type=str, help='Path to output files')
    args = parser.parse_args()
    convert_mseed_to_seisbench_format(args.input_path, args.catalog_path, args.output_path)
--- a/utils/utils.py
+++ b/utils/utils.py
@@ -0,0 +1,230 @@
 import os
 import pandas as pd
 import glob
 from pathlib import Path
 import obspy
 from obspy.core.event import read_events
 import seisbench.data as sbd
 import seisbench.util as sbu
 import sys
 import logging
 logging.basicConfig(filename="output.out",
                    filemode='a',
                    format='%(asctime)s,%(msecs)d %(name)s %(levelname)s %(message)s',
                    datefmt='%H:%M:%S',
                    level=logging.DEBUG)
 logger = logging.getLogger('converter')
 def create_traces_catalog(directory, years):
    for year in years:
        directory = f"{directory}/{year}"
        files = glob.glob(directory)
        traces = []
        for i, f in enumerate(files):
            st = obspy.read(f)
            for tr in st.traces:
                # trace_id = tr.id
                # start = tr.meta.starttime
                # end = tr.meta.endtime
                trs = pd.Series({
                    'trace_id': tr.id,
                    'trace_st': tr.meta.starttime,
                    'trace_end': tr.meta.endtime,
                    'stream_fname': f
                })
                traces.append(trs)
        traces_catalog = pd.DataFrame(pd.concat(traces)).transpose()
        traces_catalog.to_csv("data/bogdanka/traces_catalog.csv", append=True, index=False)
 def split_events(events, input_path):
    logger.info("Splitting available events into train, dev and test sets ...")
    events_stats = pd.DataFrame()
    events_stats.index.name = "event"
    for i, event in enumerate(events):
        #check if mseed exists
        actual_picks = 0
        for pick in event.picks:
            trace_params = get_trace_params(pick)
            trace_path = get_trace_path(input_path, trace_params)
            if os.path.isfile(trace_path):
                actual_picks += 1
        events_stats.loc[i, "pick_count"] = actual_picks
    events_stats['pick_count_cumsum'] = events_stats.pick_count.cumsum()
    train_th = 0.7 * events_stats.pick_count_cumsum.values[-1]
    dev_th = 0.85 * events_stats.pick_count_cumsum.values[-1]
    events_stats['split'] = 'test'
    for i, event in events_stats.iterrows():
        if event['pick_count_cumsum'] < train_th:
            events_stats.loc[i, 'split'] = 'train'
        elif event['pick_count_cumsum'] < dev_th:
            events_stats.loc[i, 'split'] = 'dev'
        else:
            break
    return events_stats
 def get_event_params(event):
    origin = event.preferred_origin()
    if origin is None:
        return {}
    # print(origin)
    mag = event.preferred_magnitude()
    source_id = str(event.resource_id)
    event_params = {
        "source_id": source_id,
        "source_origin_uncertainty_sec": origin.time_errors["uncertainty"],
        "source_latitude_deg": origin.latitude,
        "source_latitude_uncertainty_km": origin.latitude_errors["uncertainty"],
        "source_longitude_deg": origin.longitude,
        "source_longitude_uncertainty_km": origin.longitude_errors["uncertainty"],
        "source_depth_km": origin.depth / 1e3,
        "source_depth_uncertainty_km": origin.depth_errors["uncertainty"] / 1e3 if origin.depth_errors[
                                                                                       "uncertainty"] is not None else None,
    }
    if mag is not None:
        event_params["source_magnitude"] = mag.mag
        event_params["source_magnitude_uncertainty"] = mag.mag_errors["uncertainty"]
        event_params["source_magnitude_type"] = mag.magnitude_type
        event_params["source_magnitude_author"] = mag.creation_info.agency_id if mag.creation_info is not None else None
    return event_params
 def get_trace_params(pick):
    net = pick.waveform_id.network_code
    sta = pick.waveform_id.station_code
    trace_params = {
        "station_network_code": net,
        "station_code": sta,
        "trace_channel": pick.waveform_id.channel_code,
        "station_location_code": pick.waveform_id.location_code,
        "time": pick.time
    }
    return trace_params
 def find_trace(pick_time, traces):
    for tr in traces:
        if pick_time > tr.stats.endtime:
            continue
        if pick_time >= tr.stats.starttime:
            # print(pick_time, " - selected trace: ", tr)
            return tr
    logger.warning(f"no matching trace for peak: {pick_time}")
    return None
 def get_trace_path(input_path, trace_params):
    year = trace_params["time"].year
    day_of_year = pd.Timestamp(str(trace_params["time"])).day_of_year
    net = trace_params["station_network_code"]
    station = trace_params["station_code"]
    tr_channel = trace_params["trace_channel"]
    path = f"{input_path}/{year}/{net}/{station}/{tr_channel}.D/{net}.{station}..{tr_channel}.D.{year}.{day_of_year}"
    return path
 def load_trace(input_path, trace_params):
    trace_path = get_trace_path(input_path, trace_params)
    trace = None
    if not os.path.isfile(trace_path):
        logger.w(trace_path + " not found")
    else:
        stream = obspy.read(trace_path)
        if len(stream.traces) > 1:
            trace = find_trace(trace_params["time"], stream.traces)
        elif len(stream.traces) == 0:
            logger.warning(f"no data in: {trace_path}")
        else:
            trace = stream.traces[0]
    return trace
 def load_stream(input_path, trace_params, time_before=60, time_after=60):
    trace_path = get_trace_path(input_path, trace_params)
    sampling_rate, stream = None, None
    pick_time = trace_params["time"]
    if not os.path.isfile(trace_path):
        print(trace_path + " not found")
    else:
        stream = obspy.read(trace_path)
        stream = stream.slice(pick_time - time_before, pick_time + time_after)
        if len(stream.traces) == 0:
            print(f"no data in: {trace_path}")
        else:
            sampling_rate = stream.traces[0].stats.sampling_rate
    return sampling_rate, stream
 def convert_mseed_to_seisbench_format():
    input_path = "/net/pr2/projects/plgrid/plggeposai"
    logger.info("Loading events catalog ...")
    events = read_events(input_path + "/BOIS_all.xml")
    events_stats = split_events(events)
    output_path = input_path + "/seisbench_format"
    metadata_path = output_path + "/metadata.csv"
    waveforms_path = output_path + "/waveforms.hdf5"
    with sbd.WaveformDataWriter(metadata_path, waveforms_path) as writer:
        writer.data_format = {
            "dimension_order": "CW",
            "component_order": "ZNE",
        }
        for i, event in enumerate(events):
            logger.debug(f"Converting {i} event")
            event_params = get_event_params(event)
            event_params["split"] = events_stats.loc[i, "split"]
            #             b = False
            for pick in event.picks:
                trace_params = get_trace_params(pick)
                sampling_rate, stream = load_stream(input_path, trace_params)
                if stream is None:
                    continue
                actual_t_start, data, _ = sbu.stream_to_array(
                    stream,
                    component_order=writer.data_format["component_order"],
                )
                trace_params["trace_sampling_rate_hz"] = sampling_rate
                trace_params["trace_start_time"] = str(actual_t_start)
                pick_time = obspy.core.utcdatetime.UTCDateTime(trace_params["time"])
                pick_idx = (pick_time - actual_t_start) * sampling_rate
                trace_params[f"trace_{pick.phase_hint}_arrival_sample"] = int(pick_idx)
                writer.add_trace({**event_params, **trace_params}, data)
 if __name__ == "__main__":
    convert_mseed_to_seisbench_format()
    # create_traces_catalog("/net/pr2/projects/plgrid/plggeposai/", ["2018", "2019"])