platform-demo-scripts/README.md

# Demo notebooks and scripts for EPOS AI Platform


This repo contains notebooks and scripts demonstrating how to:
- Prepare data for training a seisbench model detecting P and S waves (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](notebooks/Transforming%20mseeds%20from%20Bogdanka%20to%20Seisbench%20format.ipynb) and the [script](scripts/mseeds_to_seisbench.py)

[//]: # (- [to update] Explore available data, check the [notebook]&#40;notebooks/Explore%20igf%20data.ipynb&#41;)
- Train various cnn models available in seisbench library and compare their performance of detecting P and S waves, check the [script](scripts/pipeline.py) 
  
[//]: # (- [to update] Validate model performance, check the [notebook]&#40;notebooks/Check%20model%20performance%20depending%20on%20station-random%20window.ipynb&#41;)
[//]: # (- [to update] Use model for detecting P phase, check the [notebook]&#40;notebooks/Present%20model%20predictions.ipynb&#41;)


### Acknowledgments
This code is based on the [pick-benchmark](https://github.com/seisbench/pick-benchmark), the repository accompanying the paper:
[Which picker fits my data? A quantitative evaluation of deep learning based seismic pickers](https://doi.org/10.1029/2021JB023499)

### Installation method 1

Please download and install [Mambaforge](https://github.com/conda-forge/miniforge#mambaforge) following the [official guide](https://github.com/conda-forge/miniforge#install).

After successful installation and within the Mambaforge environment please clone this repository: 

```
git clone ssh://git@git.plgrid.pl:7999/eai/platform-demo-scripts.git
```
and please run for Linux or Windows platforms:

```
cd platform-demo-scripts
mambaforge env create -f epos-ai-train.yml
```
or for OSX:
```
cd platform-demo-scripts
mambaforge env create -f epos-ai-train-osx.yml
```

This will create a conda environment named `platform-demo-scripts` with all required packages installed.

To run the notebooks and scripts from this repository it is necessary to activate the `platform-demo-scripts` environment by running:

```
conda activate platform-demo-scripts
```

### Installation method 2

Please [install Poetry](https://python-poetry.org/docs/#installation), a tool for dependency management and packaging in Python.
Then we will use only Poetry for creating Python environment and installing dependencies.

Install all dependencies with poetry, run: 

```
poetry install
```

To run the notebooks and scripts from this repository it is necessary to activate the poetry environment by running:

```
poetry shell
```

### Usage

1. Prepare .env file with content:
   ```
   WANDB_HOST="https://epos-ai.grid.cyfronet.pl/"
   WANDB_API_KEY="your key"
   WANDB_USER="your user"
   WANDB_PROJECT="training_seisbench_models"
   BENCHMARK_DEFAULT_WORKER=2
   ```

2. Transform data into seisbench format. 
    
    To utilize functionality of Seisbench library, data need to be transformed to [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)). 

    If your data is stored in the MSEED format and catalog in the QuakeML format, you can use the prepared script `mseeds_to_seisbench.py` to perform the transformation. Please make sure that your data has the same structure as the data used in this project.
    The script assumes that:
   *  the data is stored in the following directory structure:
   `input_path/year/station_network_code/station_code/trace_channel.D` e.g.
   `input_path/2018/PL/ALBE/EHE.D/`
    * the file names follow the pattern:  
    `station_network_code.station_code..trace_channel.D.year.day_of_year`
   e.g. `PL.ALBE..EHE.D.2018.282`

    Run the `mseeds_to_seisbench.py` script with the following arguments:
    ```
    python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path
    ```
    If you want to run the script on a cluster, you can use the template script `convert_data_template.sh`. 
After adjusting the grant name, the paths to conda env and the paths to data send the job to queue using sbatch command on a login node of e.g. Ares: 
   ```
    sbatch convert_data_template.sh
   ```
   
    If your data has a different structure or format, check the notebooks to gain an understanding of the Seisbench format and what needs to be done to transform your data:
   * [Seisbench example](https://colab.research.google.com/github/seisbench/seisbench/blob/main/examples/01a_dataset_basics.ipynb) or
   * [Transforming mseeds from Bogdanka to Seisbench format](notebooks/Transforming mseeds from Bogdanka to Seisbench format.ipynb) notebook 
    

3. Adjust the `config.json` and specify: 
   * `dataset_name` - the name of the dataset, which will be used to name the folder with evaluation targets and predictions
   * `data_path` - the path to the data in the Seisbench format
   * `experiment_count` - the number of experiments to run for each model type
   

4. Run the pipeline script
`python pipeline.py`

   The script performs the following steps:
   1. Generates evaluation targets in `datasets/<dataset_name>/targets` directory. 
   1. Trains multiple versions of GPD, PhaseNet and ... models to find the best hyperparameters, producing the lowest validation loss.
     
        This step utilizes the Weights & Biases platform to perform the hyperparameters search (called sweeping) and track the training process and store the results.
             The results are available at   
             `https://epos-ai.grid.cyfronet.pl/<WANDB_USER>/<WANDB_PROJECT>`
        Weights and training logs can be downloaded from the platform. 
       Additionally, the most important data are saved locally in `weights/<dataset_name>_<model_name>/ ` directory:
        * Weights of the best checkpoint of each model are saved as  `<dataset_name>_<model_name>_sweep=<sweep_id>-run=<run_id>-epoch=<epoch_number>-val_loss=<val_loss>.ckpt`
        * Metrics and hyperparams are saved  in <run_id> folders
       
   1. Uses the best performing model of each type to generate predictions. The predictons are saved in the `scripts/pred/<dataset_name>_<model_name>/<run_id>` directory.
   1. Evaluates the performance of each model by comparing the predictions with the evaluation targets and calculating MAE metrics.
   The results are saved in the `scripts/pred/results.csv` file. They are additionally logged in Weights & Biases platform as summary metrics of corresponding runs. 
    
   <br/>
    The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script. For example, to change the sweep configuration file for GPD model, run:

   ```python pipeline.py --gpd_config <new config file>```
      
   The new config file should be placed in the `experiments` folder or as specified in the `configs_path` parameter in the config.json file.
    
   If you have multiple datasets, you can run the pipeline for each dataset separately by specifying the dataset name as an argument:
    
   ```python pipeline.py --dataset <dataset_name>```

### Troubleshooting

* Problem with reading the catalog file: please make sure that your quakeML xml file has the following opening and closing tags:
```
<?xml version="1.0"?>
<q:quakeml xmlns="http://quakeml.org/xmlns/bed/1.2" xmlns:q="http://quakeml.org/xmlns/quakeml/1.2">
  ....
</q:quakeml>
```

* `wandb: ERROR Run .. errored: OSError(24, 'Too many open files')`
-> https://github.com/wandb/wandb/issues/2825

### Licence

The code is licenced under the GNU General Public License v3.0. See the [LICENSE](LICENSE.txt) file for details.

### Copyright

Copyright © 2023 ACK Cyfronet AGH, Poland.
 
This work was partially funded by EPOS Project funded in frame of PL-POIR4.2
initial commit 2023-07-05 09:58:06 +02:00			`# Demo notebooks and scripts for EPOS AI Platform`


			`This repo contains notebooks and scripts demonstrating how to:`
Added logging MAE for the best runs, option to run a pipeline on a specific dataset, template bash scripts, GPLv3 license. Modified behavior of generating eval targets, it is skipped if targets already exist. 2023-10-12 14:27:53 +02:00			`- Prepare data for training a seisbench model detecting P and S waves (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](notebooks/Transforming%20mseeds%20from%20Bogdanka%20to%20Seisbench%20format.ipynb) and the [script](scripts/mseeds_to_seisbench.py)`

			`[//]: # (- [to update] Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb))`
Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging 2023-09-26 10:50:46 +02:00			`- Train various cnn models available in seisbench library and compare their performance of detecting P and S waves, check the [script](scripts/pipeline.py)`
initial commit with the pipeline for training and evaluating seisbench models 2023-08-29 09:59:31 +02:00
Added logging MAE for the best runs, option to run a pipeline on a specific dataset, template bash scripts, GPLv3 license. Modified behavior of generating eval targets, it is skipped if targets already exist. 2023-10-12 14:27:53 +02:00			`[//]: # (- [to update] Validate model performance, check the [notebook](notebooks/Check%20model%20performance%20depending%20on%20station-random%20window.ipynb))`
			`[//]: # (- [to update] Use model for detecting P phase, check the [notebook](notebooks/Present%20model%20predictions.ipynb))`
initial commit 2023-07-05 09:58:06 +02:00

initial commit with the pipeline for training and evaluating seisbench models 2023-08-29 09:59:31 +02:00			`### Acknowledgments`
			`This code is based on the [pick-benchmark](https://github.com/seisbench/pick-benchmark), the repository accompanying the paper:`
Improve README 2023-08-29 12:15:36 +02:00			`[Which picker fits my data? A quantitative evaluation of deep learning based seismic pickers](https://doi.org/10.1029/2021JB023499)`

Add conda environment and installation 2023-09-08 09:51:52 +02:00			`### Installation method 1`

			`Please download and install [Mambaforge](https://github.com/conda-forge/miniforge#mambaforge) following the [official guide](https://github.com/conda-forge/miniforge#install).`

Add installation method for OSX 2023-09-20 22:38:58 +02:00			`After successful installation and within the Mambaforge environment please clone this repository:`
Add conda environment and installation 2023-09-08 09:51:52 +02:00
			```
			`git clone ssh://git@git.plgrid.pl:7999/eai/platform-demo-scripts.git`
Add installation method for OSX 2023-09-20 22:38:58 +02:00			```
			`and please run for Linux or Windows platforms:`
Add conda environment and installation 2023-09-08 09:51:52 +02:00
Add installation method for OSX 2023-09-20 22:38:58 +02:00			```
			`cd platform-demo-scripts`
			`mambaforge env create -f epos-ai-train.yml`
			```
			`or for OSX:`
			```
Add conda environment and installation 2023-09-08 09:51:52 +02:00			`cd platform-demo-scripts`
Add installation method for OSX 2023-09-20 22:38:58 +02:00			`mambaforge env create -f epos-ai-train-osx.yml`
Add conda environment and installation 2023-09-08 09:51:52 +02:00			```

			This will create a conda environment named `platform-demo-scripts` with all required packages installed.

			To run the notebooks and scripts from this repository it is necessary to activate the `platform-demo-scripts` environment by running:

			```
			`conda activate platform-demo-scripts`
			```

			`### Installation method 2`
Improve README 2023-08-29 12:15:36 +02:00
			`Please [install Poetry](https://python-poetry.org/docs/#installation), a tool for dependency management and packaging in Python.`
			`Then we will use only Poetry for creating Python environment and installing dependencies.`

Add conda environment and installation 2023-09-08 09:51:52 +02:00			`Install all dependencies with poetry, run:`

			```
			`poetry install`
			```
initial commit 2023-07-05 09:58:06 +02:00
Add conda environment and installation 2023-09-08 09:51:52 +02:00			`To run the notebooks and scripts from this repository it is necessary to activate the poetry environment by running:`
updated readme 2023-07-05 10:21:22 +02:00
Add conda environment and installation 2023-09-08 09:51:52 +02:00			```
			`poetry shell`
			```

			`### Usage`

			`1. Prepare .env file with content:`
initial commit with the pipeline for training and evaluating seisbench models 2023-08-29 09:59:31 +02:00			```
			`WANDB_HOST="https://epos-ai.grid.cyfronet.pl/"`
			`WANDB_API_KEY="your key"`
			`WANDB_USER="your user"`
Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging 2023-09-26 10:50:46 +02:00			`WANDB_PROJECT="training_seisbench_models"`
initial commit with the pipeline for training and evaluating seisbench models 2023-08-29 09:59:31 +02:00			`BENCHMARK_DEFAULT_WORKER=2`
Added logging MAE for the best runs, option to run a pipeline on a specific dataset, template bash scripts, GPLv3 license. Modified behavior of generating eval targets, it is skipped if targets already exist. 2023-10-12 14:27:53 +02:00			```
initial commit with the pipeline for training and evaluating seisbench models 2023-08-29 09:59:31 +02:00
Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging 2023-09-26 10:50:46 +02:00			`2. Transform data into seisbench format.`

Added logging MAE for the best runs, option to run a pipeline on a specific dataset, template bash scripts, GPLv3 license. Modified behavior of generating eval targets, it is skipped if targets already exist. 2023-10-12 14:27:53 +02:00			`To utilize functionality of Seisbench library, data need to be transformed to [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)).`

			If your data is stored in the MSEED format and catalog in the QuakeML format, you can use the prepared script `mseeds_to_seisbench.py` to perform the transformation. Please make sure that your data has the same structure as the data used in this project.
Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging 2023-09-26 10:50:46 +02:00			`The script assumes that:`
			`* the data is stored in the following directory structure:`
			`input_path/year/station_network_code/station_code/trace_channel.D` e.g.
			`input_path/2018/PL/ALBE/EHE.D/`
			`* the file names follow the pattern:`
			`station_network_code.station_code..trace_channel.D.year.day_of_year`
			e.g. `PL.ALBE..EHE.D.2018.282`

Added logging MAE for the best runs, option to run a pipeline on a specific dataset, template bash scripts, GPLv3 license. Modified behavior of generating eval targets, it is skipped if targets already exist. 2023-10-12 14:27:53 +02:00			Run the `mseeds_to_seisbench.py` script with the following arguments:
Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging 2023-09-26 10:50:46 +02:00			```
			`python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path`
			```
Added logging MAE for the best runs, option to run a pipeline on a specific dataset, template bash scripts, GPLv3 license. Modified behavior of generating eval targets, it is skipped if targets already exist. 2023-10-12 14:27:53 +02:00			If you want to run the script on a cluster, you can use the template script `convert_data_template.sh`.
			`After adjusting the grant name, the paths to conda env and the paths to data send the job to queue using sbatch command on a login node of e.g. Ares:`
			```
			`sbatch convert_data_template.sh`
Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging 2023-09-26 10:50:46 +02:00			```

Added logging MAE for the best runs, option to run a pipeline on a specific dataset, template bash scripts, GPLv3 license. Modified behavior of generating eval targets, it is skipped if targets already exist. 2023-10-12 14:27:53 +02:00			`If your data has a different structure or format, check the notebooks to gain an understanding of the Seisbench format and what needs to be done to transform your data:`
Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging 2023-09-26 10:50:46 +02:00			`* [Seisbench example](https://colab.research.google.com/github/seisbench/seisbench/blob/main/examples/01a_dataset_basics.ipynb) or`
Added logging MAE for the best runs, option to run a pipeline on a specific dataset, template bash scripts, GPLv3 license. Modified behavior of generating eval targets, it is skipped if targets already exist. 2023-10-12 14:27:53 +02:00			`* [Transforming mseeds from Bogdanka to Seisbench format](notebooks/Transforming mseeds from Bogdanka to Seisbench format.ipynb) notebook`
Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging 2023-09-26 10:50:46 +02:00
initial commit with the pipeline for training and evaluating seisbench models 2023-08-29 09:59:31 +02:00
Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging 2023-09-26 10:50:46 +02:00			3. Adjust the `config.json` and specify:
			* `dataset_name` - the name of the dataset, which will be used to name the folder with evaluation targets and predictions
			* `data_path` - the path to the data in the Seisbench format
			* `experiment_count` - the number of experiments to run for each model type

initial commit with the pipeline for training and evaluating seisbench models 2023-08-29 09:59:31 +02:00
Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging 2023-09-26 10:50:46 +02:00			`4. Run the pipeline script`
			`python pipeline.py`
initial commit with the pipeline for training and evaluating seisbench models 2023-08-29 09:59:31 +02:00
			`The script performs the following steps:`
Added logging MAE for the best runs, option to run a pipeline on a specific dataset, template bash scripts, GPLv3 license. Modified behavior of generating eval targets, it is skipped if targets already exist. 2023-10-12 14:27:53 +02:00			1. Generates evaluation targets in `datasets/<dataset_name>/targets` directory.
			`1. Trains multiple versions of GPD, PhaseNet and ... models to find the best hyperparameters, producing the lowest validation loss.`
Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging 2023-09-26 10:50:46 +02:00
Added logging MAE for the best runs, option to run a pipeline on a specific dataset, template bash scripts, GPLv3 license. Modified behavior of generating eval targets, it is skipped if targets already exist. 2023-10-12 14:27:53 +02:00			`This step utilizes the Weights & Biases platform to perform the hyperparameters search (called sweeping) and track the training process and store the results.`
			`The results are available at`
			`https://epos-ai.grid.cyfronet.pl/<WANDB_USER>/<WANDB_PROJECT>`
			`Weights and training logs can be downloaded from the platform.`
			Additionally, the most important data are saved locally in `weights/<dataset_name>_<model_name>/ ` directory:
			* Weights of the best checkpoint of each model are saved as `<dataset_name>_<model_name>_sweep=<sweep_id>-run=<run_id>-epoch=<epoch_number>-val_loss=<val_loss>.ckpt`
			`* Metrics and hyperparams are saved in <run_id> folders`
Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging 2023-09-26 10:50:46 +02:00
Added logging MAE for the best runs, option to run a pipeline on a specific dataset, template bash scripts, GPLv3 license. Modified behavior of generating eval targets, it is skipped if targets already exist. 2023-10-12 14:27:53 +02:00			1. Uses the best performing model of each type to generate predictions. The predictons are saved in the `scripts/pred/<dataset_name>_<model_name>/<run_id>` directory.
			`1. Evaluates the performance of each model by comparing the predictions with the evaluation targets and calculating MAE metrics.`
			The results are saved in the `scripts/pred/results.csv` file. They are additionally logged in Weights & Biases platform as summary metrics of corresponding runs.

			`<br/>`
			`The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script. For example, to change the sweep configuration file for GPD model, run:`
Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging 2023-09-26 10:50:46 +02:00
Added logging MAE for the best runs, option to run a pipeline on a specific dataset, template bash scripts, GPLv3 license. Modified behavior of generating eval targets, it is skipped if targets already exist. 2023-10-12 14:27:53 +02:00			```python pipeline.py --gpd_config <new config file>```

			The new config file should be placed in the `experiments` folder or as specified in the `configs_path` parameter in the config.json file.

			`If you have multiple datasets, you can run the pipeline for each dataset separately by specifying the dataset name as an argument:`

			```python pipeline.py --dataset <dataset_name>```
initial commit with the pipeline for training and evaluating seisbench models 2023-08-29 09:59:31 +02:00
			`### Troubleshooting`

Added logging MAE for the best runs, option to run a pipeline on a specific dataset, template bash scripts, GPLv3 license. Modified behavior of generating eval targets, it is skipped if targets already exist. 2023-10-12 14:27:53 +02:00			`* Problem with reading the catalog file: please make sure that your quakeML xml file has the following opening and closing tags:`
			```
			`<?xml version="1.0"?>`
			`<q:quakeml xmlns="http://quakeml.org/xmlns/bed/1.2" xmlns:q="http://quakeml.org/xmlns/quakeml/1.2">`
			`....`
			`</q:quakeml>`
			```

initial commit with the pipeline for training and evaluating seisbench models 2023-08-29 09:59:31 +02:00			* `wandb: ERROR Run .. errored: OSError(24, 'Too many open files')`
			`-> https://github.com/wandb/wandb/issues/2825`
Improve README 2023-08-29 12:15:36 +02:00
			`### Licence`

Added logging MAE for the best runs, option to run a pipeline on a specific dataset, template bash scripts, GPLv3 license. Modified behavior of generating eval targets, it is skipped if targets already exist. 2023-10-12 14:27:53 +02:00			`The code is licenced under the GNU General Public License v3.0. See the [LICENSE](LICENSE.txt) file for details.`
Improve README 2023-08-29 12:15:36 +02:00
			`### Copyright`

			`Copyright © 2023 ACK Cyfronet AGH, Poland.`

			`This work was partially funded by EPOS Project funded in frame of PL-POIR4.2`