Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging

2023-09-26 10:50:46 +02:00
parent 78ac51478c
commit aa39980573
15 changed files with 1788 additions and 66 deletions
--- a/README.md
+++ b/README.md
@@ -2,10 +2,9 @@


 This repo contains notebooks and scripts demonstrating how to:
- Prepare IGF data for training a seisbench model detecting P phase (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](utils/Transforming%20mseeds%20to%20SeisBench%20dataset.ipynb).
-
- Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb)
- Train various cnn models available in seisbench library and compare their performance of detecting P phase, check the [script](scripts/pipeline.py) 
+- Prepare data for training a seisbench model detecting P and S waves (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](utils/Transforming%20mseeds%20from%20Bogdanka%20to%20Seisbench%20format.ipynb) and the [script](utils/mseeds_to_seisbench.py)
+- [to update] Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb)
+- Train various cnn models available in seisbench library and compare their performance of detecting P and S waves, check the [script](scripts/pipeline.py) 
  
 - [to update] Validate model performance, check the [notebook](notebooks/Check%20model%20performance%20depending%20on%20station-random%20window.ipynb)
 - [to update] Use model for detecting P phase, check the [notebook](notebooks/Present%20model%20predictions.ipynb)
@@ -68,31 +67,68 @@ poetry shell
   WANDB_HOST="https://epos-ai.grid.cyfronet.pl/"
   WANDB_API_KEY="your key"
   WANDB_USER="your user"
-   WANDB_PROJECT="training_seisbench_models_on_igf_data"
+   WANDB_PROJECT="training_seisbench_models"
   BENCHMARK_DEFAULT_WORKER=2

-2. Transform data into seisbench format. (unofficial)
-   * Download original data from the [drive](https://drive.google.com/drive/folders/1InVI9DLaD7gdzraM2jMzeIrtiBSu-UIK?usp=drive_link)
-   * Run the notebook: `utils/Transforming mseeds to SeisBench dataset.ipynb`
+2. Transform data into seisbench format. 
    
-3. Run the pipeline script:
+    To utilize functionality of Seisbench library, data need to be transformed to [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)). If your data is in the MSEED format, you can use the prepared script `mseeds_to_seisbench.py` to perform the transformation. Please make sure that your data has the same structure as the data used in this project.
+    The script assumes that:
+   *  the data is stored in the following directory structure:
+   `input_path/year/station_network_code/station_code/trace_channel.D` e.g.
+   `input_path/2018/PL/ALBE/EHE.D/`
+    * the file names follow the pattern:  
+    `station_network_code.station_code..trace_channel.D.year.day_of_year`
+   e.g. `PL.ALBE..EHE.D.2018.282`
+    * events catalog is stored in quakeML format
   
-   `python pipeline.py`
+    Run the script `mseeds_to_seisbench` located in the `utils` directory
+
+    ```
+    cd utils
+    python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path
+    ```
+    If you want to run the script on a cluster, you can use the script `convert_data.sh` as a template (adjust the grant name, computing name and paths) and send the job to queue using sbatch command on login node of e.g. Ares: 
+   
+    ```
+    cd utils
+    sbatch convert_data.sh
+   ```
+   
+    If your data has a different structure or format, use the notebooks to gain an understanding of the Seisbench format and what needs to be done to transform your data:
+   * [Seisbench example](https://colab.research.google.com/github/seisbench/seisbench/blob/main/examples/01a_dataset_basics.ipynb) or
+   * [Transforming mseeds from Bogdanka to Seisbench format](utils/Transforming mseeds from Bogdanka to Seisbench format.ipynb) notebook 
+    
+
+3. Adjust the `config.json` and specify: 
+   * `dataset_name` - the name of the dataset, which will be used to name the folder with evaluation targets and predictions
+   * `data_path` - the path to the data in the Seisbench format
+   * `experiment_count` - the number of experiments to run for each model type
+   
+
+4. Run the pipeline script
+`python pipeline.py`

   The script performs the following steps:
-   * Generates evaluation targets
+   * Generates evaluation targets in `datasets/<dataset_name>/targets` directory. 
     * Trains multiple versions of GPD, PhaseNet and ... models to find the best hyperparameters, producing the lowest validation loss.
+     
     This step utilizes the Weights & Biases platform to perform the hyperparameters search (called sweeping) and track the training process and store the results.
          The results are available at   
-     `https://epos-ai.grid.cyfronet.pl/<your user name>/<your project name>`
-   * Uses the best performing model of each type to generate predictions
-   * Evaluates the performance of each model by comparing the predictions with the evaluation targets
-   * Saves the results in the `scripts/pred` directory  
-   * 
+          `https://epos-ai.grid.cyfronet.pl/<WANDB_USER>/<WANDB_PROJECT>`
+     Weights and training logs can be downloaded from the platform. 
+    Additionally, the most important data are saved locally in `weights/<dataset_name>_<model_name>/ ` directory:
+     * Weights of the best checkpoint of each model are saved as  `<dataset_name>_<model_name>_sweep=<sweep_id>-run=<run_id>-epoch=<epoch_number>-val_loss=<val_loss>.ckpt`
+     * Metrics and hyperparams are saved  in <run_id> folders
+       
+   * Uses the best performing model of each type to generate predictions. The predictons are saved in the `scripts/pred/<dataset_name>_<model_name>/<run_id>` directory.
+   * Evaluates the performance of each model by comparing the predictions with the evaluation targets. 
+   The results are saved in the `scripts/pred/results.csv` file.
+
  The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script. 
  For example, to change the sweep configuration file for GPD model, run:
  `python pipeline.py --gpd_config <new config file>`
-   The new config file should be placed in the `experiments` or as specified in the `configs_path` parameter in the config.json file.
+  The new config file should be placed in the `experiments` folder or as specified in the `configs_path` parameter in the config.json file.

 ### Troubleshooting

--- a/config.json
+++ b/config.json
@@ -1,7 +1,7 @@
 {
-  "dataset_name": "igf",
-  "data_path": "datasets/igf/seisbench_format/",
-  "targets_path": "datasets/targets/igf",
+  "dataset_name": "bogdanka",
+  "data_path": "datasets/bogdanka/seisbench_format/",
+  "targets_path": "datasets/targets",
  "models_path": "weights",
  "configs_path": "experiments",
  "sampling_rate": 100,
--- a/poetry.lock
+++ b/poetry.lock
@@ -283,6 +283,14 @@ python-versions = "*"
 [package.dependencies]
 six = ">=1.4.0"

+[[package]]
+name = "et-xmlfile"
+version = "1.1.0"
+description = "An implementation of lxml.xmlfile for the standard library"
+category = "main"
+optional = false
+python-versions = ">=3.6"
+
 [[package]]
 name = "exceptiongroup"
 version = "1.1.2"
@@ -971,6 +979,17 @@ imaging = ["cartopy"]
 "io.shapefile" = ["pyshp"]
 tests = ["packaging", "pyproj", "pytest", "pytest-json-report"]

+[[package]]
+name = "openpyxl"
+version = "3.1.2"
+description = "A Python library to read/write Excel 2010 xlsx/xlsm files"
+category = "main"
+optional = false
+python-versions = ">=3.6"
+
+[package.dependencies]
+et-xmlfile = "*"
+
 [[package]]
 name = "overrides"
 version = "7.3.1"
@@ -1766,7 +1785,7 @@ test = ["websockets"]
 [metadata]
 lock-version = "1.1"
 python-versions = "^3.10"
-content-hash = "2f8790f8c3e1a78ff23f0a0f0e954c97d2b0033fc6a890d4ef1355c6922dcc64"
+content-hash = "86f528987bd303e300f586a26f506318d7bdaba445886a6a5a36f86f9e89b229"

 [metadata.files]
 anyio = [
@@ -2076,6 +2095,10 @@ docker-pycreds = [
    {file = "docker-pycreds-0.4.0.tar.gz", hash = "sha256:6ce3270bcaf404cc4c3e27e4b6c70d3521deae82fb508767870fdbf772d584d4"},
    {file = "docker_pycreds-0.4.0-py2.py3-none-any.whl", hash = "sha256:7266112468627868005106ec19cd0d722702d2b7d5912a28e19b826c3d37af49"},
 ]
+et-xmlfile = [
+    {file = "et_xmlfile-1.1.0-py3-none-any.whl", hash = "sha256:a2ba85d1d6a74ef63837eed693bcb89c3f752169b0e3e7ae5b16ca5e1b3deada"},
+    {file = "et_xmlfile-1.1.0.tar.gz", hash = "sha256:8eb9e2bc2f8c97e37a2dc85a09ecdcdec9d8a396530a6d5a33b30b9a92da0c5c"},
+]
 exceptiongroup = [
    {file = "exceptiongroup-1.1.2-py3-none-any.whl", hash = "sha256:e346e69d186172ca7cf029c8c1d16235aa0e04035e5750b4b95039e65204328f"},
    {file = "exceptiongroup-1.1.2.tar.gz", hash = "sha256:12c3e887d6485d16943a309616de20ae5582633e0a2eda17f4e10fd61c1e8af5"},
@@ -2622,6 +2645,10 @@ obspy = [
    {file = "obspy-1.4.0-cp39-cp39-win_amd64.whl", hash = "sha256:2090a95b08b214575892c3d99bb3362b13a3b0f4689d4ee55f95ea4d8a2cbc26"},
    {file = "obspy-1.4.0.tar.gz", hash = "sha256:336a6e1d9a485732b08173cb5dc1dd720a8e53f3b54c180a62bb8ceaa5fe5c06"},
 ]
+openpyxl = [
+    {file = "openpyxl-3.1.2-py2.py3-none-any.whl", hash = "sha256:f91456ead12ab3c6c2e9491cf33ba6d08357d802192379bb482f1033ade496f5"},
+    {file = "openpyxl-3.1.2.tar.gz", hash = "sha256:a6f5977418eff3b2d5500d54d9db50c8277a368436f4e4f8ddb1be3422870184"},
+]
 overrides = [
    {file = "overrides-7.3.1-py3-none-any.whl", hash = "sha256:6187d8710a935d09b0bcef8238301d6ee2569d2ac1ae0ec39a8c7924e27f58ca"},
    {file = "overrides-7.3.1.tar.gz", hash = "sha256:8b97c6c1e1681b78cbc9424b138d880f0803c2254c5ebaabdde57bb6c62093f2"},
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -16,6 +16,7 @@ wandb = "^0.15.4"
 torchmetrics = "^0.11.4"
 ipykernel = "^6.24.0"
 jupyterlab = "^4.0.2"
+openpyxl = "^3.1.2"

 [tool.poetry.dev-dependencies]

--- a/scripts/config_loader.py
+++ b/scripts/config_loader.py
@@ -15,8 +15,8 @@ config = load_config(config_path)

 data_path = f"{project_path}/{config['data_path']}"
 models_path = f"{project_path}/{config['models_path']}"
-targets_path = f"{project_path}/{config['targets_path']}"
 dataset_name = config['dataset_name']
+targets_path = f"{project_path}/{config['targets_path']}/{dataset_name}"
 configs_path = f"{project_path}/{config['configs_path']}"

 sweep_files = config['sweep_files']
--- a/scripts/eval.py
+++ b/scripts/eval.py
@@ -29,11 +29,11 @@ data_aliases = {
    "instance": "InstanceCountsCombined",
    "iquique": "Iquique",
    "lendb": "LenDB",
-    "scedc": "SCEDC"
+    "scedc": "SCEDC",
 }


-def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, sweep_id=None, test_run=False):
+def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, sweep_id=None):
    weights = Path(weights)
    targets = Path(os.path.abspath(targets))
    print(targets)
@@ -100,8 +100,6 @@ def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, swe
        for task in ["1", "23"]:
            task_csv = targets / f"task{task}.csv"

-            print(task_csv)
-
            if not task_csv.is_file():
                continue

@@ -227,9 +225,7 @@ if __name__ == "__main__":
    parser.add_argument(
        "--sweep_id", type=str, help="wandb sweep_id", required=False, default=None
    )
-    parser.add_argument(
-        "--test_run", action="store_true", required=False, default=False
-    )
+
    args = parser.parse_args()

    main(
@@ -239,8 +235,7 @@ if __name__ == "__main__":
        batchsize=args.batchsize,
        num_workers=args.num_workers,
        sampling_rate=args.sampling_rate,
-        sweep_id=args.sweep_id,
-        test_run=args.test_run
+        sweep_id=args.sweep_id
    )
    running_time = str(
        datetime.timedelta(seconds=time.perf_counter() - code_start_time)
--- a/scripts/hyperparameter_sweep.py
+++ b/scripts/hyperparameter_sweep.py
@@ -3,6 +3,7 @@
 # This work was partially funded by EPOS Project funded in frame of PL-POIR4.2
 # -----------------

+import os
 import os.path
 import argparse
 from pytorch_lightning.loggers import WandbLogger, CSVLogger
@@ -22,6 +23,7 @@ from config_loader import models_path, dataset_name, seed, experiment_count


 torch.multiprocessing.set_sharing_strategy('file_system')
+os.system("ulimit -n unlimited")

 load_dotenv()
 wandb_api_key = os.environ.get('WANDB_API_KEY')
--- a/scripts/pipeline.py
+++ b/scripts/pipeline.py
@@ -17,8 +17,8 @@ import eval
 import collect_results
 from config_loader import data_path, targets_path, sampling_rate, dataset_name, sweep_files

+logging.root.setLevel(logging.INFO)
 logger = logging.getLogger('pipeline')
-logger.setLevel(logging.INFO)


 def load_sweep_config(model_name, args):
@@ -76,16 +76,19 @@ def main():
    args = parser.parse_args()

    # generate labels
+    logger.info("Started generating labels for the dataset.")
    generate_eval_targets.main(data_path, targets_path, "2,3", sampling_rate, None)

    # find the best hyperparams for the models
+    logger.info("Started training the models.")
    for model_name in ["GPD", "PhaseNet"]:
        sweep_id = find_the_best_params(model_name, args)
        generate_predictions(sweep_id, model_name)

    # collect results
+    logger.info("Collecting results.")
    collect_results.traverse_path("pred", "pred/results.csv")
-
+    logger.info("Results saved in pred/results.csv")

 if __name__ == "__main__":
    main()
--- a/scripts/train.py
+++ b/scripts/train.py
@@ -20,18 +20,13 @@ import torch
 import os
 import logging
 from pathlib import Path
+from dotenv import load_dotenv

 import models, data, util
 import time
 import datetime
 import wandb
-#
-# load_dotenv()
-# wandb_api_key = os.environ.get('WANDB_API_KEY')
-# if wandb_api_key is None:
-#     raise ValueError("WANDB_API_KEY environment variable is not set.")
-#
-# wandb.login(key=wandb_api_key)
+

 def train(config, experiment_name, test_run):
    """
@@ -210,6 +205,14 @@ def generate_phase_mask(dataset, phases):


 if __name__ == "__main__":
+
+    load_dotenv()
+    wandb_api_key = os.environ.get('WANDB_API_KEY')
+    if wandb_api_key is None:
+        raise ValueError("WANDB_API_KEY environment variable is not set.")
+
+    wandb.login(key=wandb_api_key)
+
    code_start_time = time.perf_counter()

    torch.manual_seed(42)
--- a/scripts/util.py
+++ b/scripts/util.py
@@ -16,7 +16,7 @@ load_dotenv()


 logging.basicConfig()
-logging.getLogger().setLevel(logging.DEBUG)
+logging.getLogger().setLevel(logging.INFO)


 def load_best_model_data(sweep_id, weights):
--- a/utils/Transforming
+++ b/utils/Transforming
--- a/utils/Transforming
+++ b/utils/Transforming
@@ -88,8 +88,8 @@
       "</div>"
      ],
      "text/plain": [
-       "                 Datetime             X             Y  Depth        Mw   \n",
-       "0 2020-01-01 10:09:42.200  5.582503e+06  5.702646e+06    0.7  2.469231  \\\n",
+       "                 Datetime             X             Y  Depth        Mw  \\\n",
+       "0 2020-01-01 10:09:42.200  5.582503e+06  5.702646e+06    0.7  2.469231   \n",
       "\n",
       "                                              Phases            mseed_name  \n",
       "0  Pg BRDW 2020-01-01 10:09:44.400, Sg BRDW 2020-...  20200101100941.mseed  "
@@ -101,7 +101,7 @@
    }
   ],
   "source": [
-    "input_path = str(Path.cwd().parent) + \"/data/igf/\"\n",
+    "input_path = str(Path.cwd().parent) + \"/datasets/igf/\"\n",
    "catalog =  pd.read_excel(input_path + \"Catalog_20_21.xlsx\", index_col=0)\n",
    "catalog.head(1)"
   ]
@@ -317,7 +317,7 @@
     "name": "stderr",
     "output_type": "stream",
     "text": [
-      "Traces converted: 35784it [00:52, 679.39it/s]\n"
+      "Traces converted: 35784it [01:01, 578.58it/s]\n"
     ]
    }
   ],
@@ -339,8 +339,10 @@
    "            continue\n",
    "        if os.path.exists(input_path + \"mseeds/mseeds_2020/\" + event.mseed_name):\n",
    "            mseed_path = input_path + \"mseeds/mseeds_2020/\" + event.mseed_name \n",
-    "        else:\n",
+    "        elif os.path.exists(input_path + \"mseeds/mseeds_2021/\" + event.mseed_name):\n",
    "            mseed_path = input_path + \"mseeds/mseeds_2021/\" + event.mseed_name \n",
+    "        else: \n",
+    "            continue\n",
    "        \n",
    "        \n",
    "        stream = get_mseed(mseed_path)\n",
@@ -374,6 +376,8 @@
    "            # trace_params[f\"trace_{pick.phase_hint}_status\"] = pick.evaluation_mode\n",
    "            \n",
    "            writer.add_trace({**event_params, **trace_params}, data)\n",
+    "\n",
+    "        # break\n",
    "            \n",
    "    "
   ]
@@ -393,7 +397,25 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "data = sbd.WaveformDataset(output_path, sampling_rate=100)"
+    "data = sbd.WaveformDataset(output_path, sampling_rate=100)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "33c77509-7aab-4833-a372-16030941395d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Unnamed dataset - 35784 traces\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(data)"
   ]
  },
  {
@@ -406,17 +428,17 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 13,
   "id": "1753f65e-fe5d-4cfa-ab42-ae161ac4a253",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "<matplotlib.lines.Line2D at 0x7f7ed04a8820>"
+       "<matplotlib.lines.Line2D at 0x14d6c12d0>"
      ]
     },
-     "execution_count": 12,
+     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    },
@@ -449,7 +471,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 14,
   "id": "bf7dae75-c90b-44f8-a51d-44e8abaaa3c3",
   "metadata": {},
   "outputs": [
@@ -472,7 +494,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 15,
   "id": "de82db24-d983-4592-a0eb-f96beecb2f69",
   "metadata": {},
   "outputs": [
@@ -622,29 +644,29 @@
       "</div>"
      ],
      "text/plain": [
-       "   index       source_origin_time  source_latitude_deg  source_longitude_deg   \n",
-       "0      0  2020-01-01 10:09:42.200         5.702646e+06          5.582503e+06  \\\n",
+       "   index       source_origin_time  source_latitude_deg  source_longitude_deg  \\\n",
+       "0      0  2020-01-01 10:09:42.200         5.702646e+06          5.582503e+06   \n",
       "1      1  2020-01-01 10:09:42.200         5.702646e+06          5.582503e+06   \n",
       "2      2  2020-01-01 10:09:42.200         5.702646e+06          5.582503e+06   \n",
       "3      3  2020-01-01 10:09:42.200         5.702646e+06          5.582503e+06   \n",
       "4      4  2020-01-01 10:09:42.200         5.702646e+06          5.582503e+06   \n",
       "\n",
-       "   source_depth_km  source_magnitude  split station_network_code station_code   \n",
-       "0              0.7          2.469231  train                   PL         BRDW  \\\n",
+       "   source_depth_km  source_magnitude  split station_network_code station_code  \\\n",
+       "0              0.7          2.469231  train                   PL         BRDW   \n",
       "1              0.7          2.469231  train                   PL         BRDW   \n",
       "2              0.7          2.469231  train                   PL         GROD   \n",
       "3              0.7          2.469231  train                   PL         GROD   \n",
       "4              0.7          2.469231  train                   PL         GUZI   \n",
       "\n",
-       "  trace_channel  trace_sampling_rate_hz             trace_start_time   \n",
-       "0           EHE                   100.0  2020-01-01T10:09:36.480000Z  \\\n",
+       "  trace_channel  trace_sampling_rate_hz             trace_start_time  \\\n",
+       "0           EHE                   100.0  2020-01-01T10:09:36.480000Z   \n",
       "1           EHE                   100.0  2020-01-01T10:09:36.480000Z   \n",
       "2           EHE                   100.0  2020-01-01T10:09:36.480000Z   \n",
       "3           EHE                   100.0  2020-01-01T10:09:36.480000Z   \n",
       "4           CNE                   100.0  2020-01-01T10:09:36.476000Z   \n",
       "\n",
-       "   trace_Pg_arrival_sample          trace_name  trace_Sg_arrival_sample   \n",
-       "0                    792.0  bucket0$0,:3,:2001                      NaN  \\\n",
+       "   trace_Pg_arrival_sample          trace_name  trace_Sg_arrival_sample  \\\n",
+       "0                    792.0  bucket0$0,:3,:2001                      NaN   \n",
       "1                      NaN  bucket0$1,:3,:2001                    921.0   \n",
       "2                    872.0  bucket0$2,:3,:2001                      NaN   \n",
       "3                      NaN  bucket0$3,:3,:2001                   1017.0   \n",
@@ -658,7 +680,7 @@
       "4                               ZNE  "
      ]
     },
-     "execution_count": 14,
+     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -700,7 +722,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.7"
+   "version": "3.10.6"
  }
 },
 "nbformat": 4,
--- a/utils/convert_data.sh
+++ b/utils/convert_data.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+#SBATCH --job-name=mseeds_to_seisbench
+#SBATCH --time=1:00:00
+#SBATCH --account=plgeposai22gpu-gpu
+#SBATCH --partition plgrid
+#SBATCH --cpus-per-task=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --mem=24gb
+
+
+## activate conda environment
+source /net/pr2/projects/plgrid/plggeposai/kmilian/mambaforge/bin/activate
+conda activate epos-ai-train
+
+input_path="/net/pr2/projects/plgrid/plggeposai/datasets/bogdanka"
+catalog_path="/net/pr2/projects/plgrid/plggeposai/datasets/bogdanka/BOIS_all.xml"
+output_path="/net/pr2/projects/plgrid/plggeposai/kmilian/platform-demo-scripts/datasets/bogdanka/seisbench_format"
+
+python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path
--- a/utils/mseeds_to_seisbench.py
+++ b/utils/mseeds_to_seisbench.py
@@ -0,0 +1,250 @@
+import os
+import pandas as pd
+import glob
+from pathlib import Path
+
+import obspy
+from obspy.core.event import read_events
+
+import seisbench
+import seisbench.data as sbd
+import seisbench.util as sbu
+import sys
+import logging
+import argparse
+
+
+logging.basicConfig(filename="output.out",
+                    filemode='a',
+                    format='%(asctime)s,%(msecs)d %(name)s %(levelname)s %(message)s',
+                    datefmt='%H:%M:%S',
+                    level=logging.DEBUG)
+
+
+
+logger = logging.getLogger('converter')
+
+def create_traces_catalog(directory, years):
+    for year in years:
+        directory = f"{directory}/{year}"
+        files = glob.glob(directory)
+        traces = []
+        for i, f in enumerate(files):
+            st = obspy.read(f)
+
+            for tr in st.traces:
+                # trace_id = tr.id
+                # start = tr.meta.starttime
+                # end = tr.meta.endtime
+
+                trs = pd.Series({
+                    'trace_id': tr.id,
+                    'trace_st': tr.meta.starttime,
+                    'trace_end': tr.meta.endtime,
+                    'stream_fname': f
+                })
+                traces.append(trs)
+
+        traces_catalog = pd.DataFrame(pd.concat(traces)).transpose()
+        traces_catalog.to_csv("data/bogdanka/traces_catalog.csv", append=True, index=False)
+
+
+def split_events(events, input_path):
+
+    logger.info("Splitting available events into train, dev and test sets ...")
+    events_stats = pd.DataFrame()
+    events_stats.index.name = "event"
+
+    for i, event in enumerate(events):
+        #check if mseed exists
+        actual_picks = 0
+        for pick in event.picks:
+            trace_params = get_trace_params(pick)
+            trace_path = get_trace_path(input_path, trace_params)
+            if os.path.isfile(trace_path):
+                actual_picks += 1
+
+        events_stats.loc[i, "pick_count"] = actual_picks
+
+    events_stats['pick_count_cumsum'] = events_stats.pick_count.cumsum()
+
+    train_th = 0.7 * events_stats.pick_count_cumsum.values[-1]
+    dev_th = 0.85 * events_stats.pick_count_cumsum.values[-1]
+
+    events_stats['split'] = 'test'
+    for i, event in events_stats.iterrows():
+        if event['pick_count_cumsum'] < train_th:
+            events_stats.loc[i, 'split'] = 'train'
+        elif event['pick_count_cumsum'] < dev_th:
+            events_stats.loc[i, 'split'] = 'dev'
+        else:
+            break
+
+    return events_stats
+
+
+def get_event_params(event):
+    origin = event.preferred_origin()
+    if origin is None:
+        return {}
+    # print(origin)
+
+    mag = event.preferred_magnitude()
+
+    source_id = str(event.resource_id)
+
+    event_params = {
+        "source_id": source_id,
+        "source_origin_uncertainty_sec": origin.time_errors["uncertainty"],
+        "source_latitude_deg": origin.latitude,
+        "source_latitude_uncertainty_km": origin.latitude_errors["uncertainty"],
+        "source_longitude_deg": origin.longitude,
+        "source_longitude_uncertainty_km": origin.longitude_errors["uncertainty"],
+        "source_depth_km": origin.depth / 1e3,
+        "source_depth_uncertainty_km": origin.depth_errors["uncertainty"] / 1e3 if origin.depth_errors[
+                                                                                       "uncertainty"] is not None else None,
+    }
+
+    if mag is not None:
+        event_params["source_magnitude"] = mag.mag
+        event_params["source_magnitude_uncertainty"] = mag.mag_errors["uncertainty"]
+        event_params["source_magnitude_type"] = mag.magnitude_type
+        event_params["source_magnitude_author"] = mag.creation_info.agency_id if mag.creation_info is not None else None
+
+    return event_params
+
+
+def get_trace_params(pick):
+    net = pick.waveform_id.network_code
+    sta = pick.waveform_id.station_code
+
+    trace_params = {
+        "station_network_code": net,
+        "station_code": sta,
+        "trace_channel": pick.waveform_id.channel_code,
+        "station_location_code": pick.waveform_id.location_code,
+        "time": pick.time
+    }
+
+    return trace_params
+
+
+def find_trace(pick_time, traces):
+    for tr in traces:
+        if pick_time > tr.stats.endtime:
+            continue
+        if pick_time >= tr.stats.starttime:
+            # print(pick_time, " - selected trace: ", tr)
+            return tr
+
+    logger.warning(f"no matching trace for peak: {pick_time}")
+    return None
+
+
+def get_trace_path(input_path, trace_params):
+    year = trace_params["time"].year
+    day_of_year = pd.Timestamp(str(trace_params["time"])).day_of_year
+    net = trace_params["station_network_code"]
+    station = trace_params["station_code"]
+    tr_channel = trace_params["trace_channel"]
+
+    path = f"{input_path}/{year}/{net}/{station}/{tr_channel}.D/{net}.{station}..{tr_channel}.D.{year}.{day_of_year}"
+    return path
+
+
+def load_trace(input_path, trace_params):
+    trace_path = get_trace_path(input_path, trace_params)
+    trace = None
+
+    if not os.path.isfile(trace_path):
+        logger.w(trace_path + " not found")
+    else:
+        stream = obspy.read(trace_path)
+        if len(stream.traces) > 1:
+            trace = find_trace(trace_params["time"], stream.traces)
+        elif len(stream.traces) == 0:
+            logger.warning(f"no data in: {trace_path}")
+        else:
+            trace = stream.traces[0]
+
+    return trace
+
+
+def load_stream(input_path, trace_params, time_before=60, time_after=60):
+    trace_path = get_trace_path(input_path, trace_params)
+    sampling_rate, stream = None, None
+    pick_time = trace_params["time"]
+
+    if not os.path.isfile(trace_path):
+        print(trace_path + " not found")
+    else:
+        stream = obspy.read(trace_path)
+        stream = stream.slice(pick_time - time_before, pick_time + time_after)
+        if len(stream.traces) == 0:
+            print(f"no data in: {trace_path}")
+        else:
+            sampling_rate = stream.traces[0].stats.sampling_rate
+
+    return sampling_rate, stream
+
+
+def convert_mseed_to_seisbench_format(input_path, catalog_path, output_path):
+    """
+    Convert mseed files to seisbench dataset format
+    :param input_path: folder with mseed files
+    :param catalog_path: path to events catalog in quakeml format
+    :param output_path: folder to save seisbench dataset
+    :return:
+    """
+    logger.info("Loading events catalog ...")
+    events = read_events(catalog_path)
+    events_stats = split_events(events, input_path)
+
+    metadata_path = output_path + "/metadata.csv"
+    waveforms_path = output_path + "/waveforms.hdf5"
+
+    logger.debug("Catalog loaded, starting conversion ...")
+
+    with sbd.WaveformDataWriter(metadata_path, waveforms_path) as writer:
+        writer.data_format = {
+            "dimension_order": "CW",
+            "component_order": "ZNE",
+        }
+        for i, event in enumerate(events):
+            logger.debug(f"Converting {i} event")
+            event_params = get_event_params(event)
+            event_params["split"] = events_stats.loc[i, "split"]
+
+            for pick in event.picks:
+                trace_params = get_trace_params(pick)
+                sampling_rate, stream = load_stream(input_path, trace_params)
+                if stream is None:
+                    continue
+
+                actual_t_start, data, _ = sbu.stream_to_array(
+                    stream,
+                    component_order=writer.data_format["component_order"],
+                )
+
+                trace_params["trace_sampling_rate_hz"] = sampling_rate
+                trace_params["trace_start_time"] = str(actual_t_start)
+
+                pick_time = obspy.core.utcdatetime.UTCDateTime(trace_params["time"])
+                pick_idx = (pick_time - actual_t_start) * sampling_rate
+
+                trace_params[f"trace_{pick.phase_hint}_arrival_sample"] = int(pick_idx)
+
+                writer.add_trace({**event_params, **trace_params}, data)
+
+
+
+if __name__ == "__main__":
+
+    parser = argparse.ArgumentParser(description='Convert mseed files to seisbench format')
+    parser.add_argument('--input_path', type=str, help='Path to mseed files')
+    parser.add_argument('--catalog_path', type=str, help='Path to events catalog in quakeml format')
+    parser.add_argument('--output_path', type=str, help='Path to output files')
+    args = parser.parse_args()
+
+
+    convert_mseed_to_seisbench_format(args.input_path, args.catalog_path, args.output_path)
--- a/utils/utils.py
+++ b/utils/utils.py
@@ -0,0 +1,230 @@
+import os
+import pandas as pd
+import glob
+from pathlib import Path
+
+import obspy
+from obspy.core.event import read_events
+
+import seisbench.data as sbd
+import seisbench.util as sbu
+import sys
+import logging
+
+logging.basicConfig(filename="output.out",
+                    filemode='a',
+                    format='%(asctime)s,%(msecs)d %(name)s %(levelname)s %(message)s',
+                    datefmt='%H:%M:%S',
+                    level=logging.DEBUG)
+
+logger = logging.getLogger('converter')
+
+def create_traces_catalog(directory, years):
+    for year in years:
+        directory = f"{directory}/{year}"
+        files = glob.glob(directory)
+        traces = []
+        for i, f in enumerate(files):
+            st = obspy.read(f)
+
+            for tr in st.traces:
+                # trace_id = tr.id
+                # start = tr.meta.starttime
+                # end = tr.meta.endtime
+
+                trs = pd.Series({
+                    'trace_id': tr.id,
+                    'trace_st': tr.meta.starttime,
+                    'trace_end': tr.meta.endtime,
+                    'stream_fname': f
+                })
+                traces.append(trs)
+
+        traces_catalog = pd.DataFrame(pd.concat(traces)).transpose()
+        traces_catalog.to_csv("data/bogdanka/traces_catalog.csv", append=True, index=False)
+
+
+def split_events(events, input_path):
+
+    logger.info("Splitting available events into train, dev and test sets ...")
+    events_stats = pd.DataFrame()
+    events_stats.index.name = "event"
+
+    for i, event in enumerate(events):
+        #check if mseed exists
+        actual_picks = 0
+        for pick in event.picks:
+            trace_params = get_trace_params(pick)
+            trace_path = get_trace_path(input_path, trace_params)
+            if os.path.isfile(trace_path):
+                actual_picks += 1
+
+        events_stats.loc[i, "pick_count"] = actual_picks
+
+    events_stats['pick_count_cumsum'] = events_stats.pick_count.cumsum()
+
+    train_th = 0.7 * events_stats.pick_count_cumsum.values[-1]
+    dev_th = 0.85 * events_stats.pick_count_cumsum.values[-1]
+
+    events_stats['split'] = 'test'
+    for i, event in events_stats.iterrows():
+        if event['pick_count_cumsum'] < train_th:
+            events_stats.loc[i, 'split'] = 'train'
+        elif event['pick_count_cumsum'] < dev_th:
+            events_stats.loc[i, 'split'] = 'dev'
+        else:
+            break
+
+    return events_stats
+
+
+def get_event_params(event):
+    origin = event.preferred_origin()
+    if origin is None:
+        return {}
+    # print(origin)
+
+    mag = event.preferred_magnitude()
+
+    source_id = str(event.resource_id)
+
+    event_params = {
+        "source_id": source_id,
+        "source_origin_uncertainty_sec": origin.time_errors["uncertainty"],
+        "source_latitude_deg": origin.latitude,
+        "source_latitude_uncertainty_km": origin.latitude_errors["uncertainty"],
+        "source_longitude_deg": origin.longitude,
+        "source_longitude_uncertainty_km": origin.longitude_errors["uncertainty"],
+        "source_depth_km": origin.depth / 1e3,
+        "source_depth_uncertainty_km": origin.depth_errors["uncertainty"] / 1e3 if origin.depth_errors[
+                                                                                       "uncertainty"] is not None else None,
+    }
+
+    if mag is not None:
+        event_params["source_magnitude"] = mag.mag
+        event_params["source_magnitude_uncertainty"] = mag.mag_errors["uncertainty"]
+        event_params["source_magnitude_type"] = mag.magnitude_type
+        event_params["source_magnitude_author"] = mag.creation_info.agency_id if mag.creation_info is not None else None
+
+    return event_params
+
+
+def get_trace_params(pick):
+    net = pick.waveform_id.network_code
+    sta = pick.waveform_id.station_code
+
+    trace_params = {
+        "station_network_code": net,
+        "station_code": sta,
+        "trace_channel": pick.waveform_id.channel_code,
+        "station_location_code": pick.waveform_id.location_code,
+        "time": pick.time
+    }
+
+    return trace_params
+
+
+def find_trace(pick_time, traces):
+    for tr in traces:
+        if pick_time > tr.stats.endtime:
+            continue
+        if pick_time >= tr.stats.starttime:
+            # print(pick_time, " - selected trace: ", tr)
+            return tr
+
+    logger.warning(f"no matching trace for peak: {pick_time}")
+    return None
+
+
+def get_trace_path(input_path, trace_params):
+    year = trace_params["time"].year
+    day_of_year = pd.Timestamp(str(trace_params["time"])).day_of_year
+    net = trace_params["station_network_code"]
+    station = trace_params["station_code"]
+    tr_channel = trace_params["trace_channel"]
+
+    path = f"{input_path}/{year}/{net}/{station}/{tr_channel}.D/{net}.{station}..{tr_channel}.D.{year}.{day_of_year}"
+    return path
+
+
+def load_trace(input_path, trace_params):
+    trace_path = get_trace_path(input_path, trace_params)
+    trace = None
+
+    if not os.path.isfile(trace_path):
+        logger.w(trace_path + " not found")
+    else:
+        stream = obspy.read(trace_path)
+        if len(stream.traces) > 1:
+            trace = find_trace(trace_params["time"], stream.traces)
+        elif len(stream.traces) == 0:
+            logger.warning(f"no data in: {trace_path}")
+        else:
+            trace = stream.traces[0]
+
+    return trace
+
+
+def load_stream(input_path, trace_params, time_before=60, time_after=60):
+    trace_path = get_trace_path(input_path, trace_params)
+    sampling_rate, stream = None, None
+    pick_time = trace_params["time"]
+
+    if not os.path.isfile(trace_path):
+        print(trace_path + " not found")
+    else:
+        stream = obspy.read(trace_path)
+        stream = stream.slice(pick_time - time_before, pick_time + time_after)
+        if len(stream.traces) == 0:
+            print(f"no data in: {trace_path}")
+        else:
+            sampling_rate = stream.traces[0].stats.sampling_rate
+
+    return sampling_rate, stream
+
+
+def convert_mseed_to_seisbench_format():
+    input_path = "/net/pr2/projects/plgrid/plggeposai"
+    logger.info("Loading events catalog ...")
+    events = read_events(input_path + "/BOIS_all.xml")
+    events_stats = split_events(events)
+    output_path = input_path + "/seisbench_format"
+    metadata_path = output_path + "/metadata.csv"
+    waveforms_path = output_path + "/waveforms.hdf5"
+
+    with sbd.WaveformDataWriter(metadata_path, waveforms_path) as writer:
+        writer.data_format = {
+            "dimension_order": "CW",
+            "component_order": "ZNE",
+        }
+        for i, event in enumerate(events):
+            logger.debug(f"Converting {i} event")
+            event_params = get_event_params(event)
+            event_params["split"] = events_stats.loc[i, "split"]
+            #             b = False
+
+            for pick in event.picks:
+                trace_params = get_trace_params(pick)
+                sampling_rate, stream = load_stream(input_path, trace_params)
+                if stream is None:
+                    continue
+
+                actual_t_start, data, _ = sbu.stream_to_array(
+                    stream,
+                    component_order=writer.data_format["component_order"],
+                )
+
+                trace_params["trace_sampling_rate_hz"] = sampling_rate
+                trace_params["trace_start_time"] = str(actual_t_start)
+
+                pick_time = obspy.core.utcdatetime.UTCDateTime(trace_params["time"])
+                pick_idx = (pick_time - actual_t_start) * sampling_rate
+
+                trace_params[f"trace_{pick.phase_hint}_arrival_sample"] = int(pick_idx)
+
+                writer.add_trace({**event_params, **trace_params}, data)
+
+
+if __name__ == "__main__":
+    convert_mseed_to_seisbench_format()
+    # create_traces_catalog("/net/pr2/projects/plgrid/plggeposai/", ["2018", "2019"])