Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging

This commit is contained in:
2023-09-26 10:50:46 +02:00
parent 78ac51478c
commit aa39980573
15 changed files with 1788 additions and 66 deletions
+57 -21
View File
@@ -2,10 +2,9 @@
This repo contains notebooks and scripts demonstrating how to:
- Prepare IGF data for training a seisbench model detecting P phase (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](utils/Transforming%20mseeds%20to%20SeisBench%20dataset.ipynb).
- Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb)
- Train various cnn models available in seisbench library and compare their performance of detecting P phase, check the [script](scripts/pipeline.py)
- Prepare data for training a seisbench model detecting P and S waves (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](utils/Transforming%20mseeds%20from%20Bogdanka%20to%20Seisbench%20format.ipynb) and the [script](utils/mseeds_to_seisbench.py)
- [to update] Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb)
- Train various cnn models available in seisbench library and compare their performance of detecting P and S waves, check the [script](scripts/pipeline.py)
- [to update] Validate model performance, check the [notebook](notebooks/Check%20model%20performance%20depending%20on%20station-random%20window.ipynb)
- [to update] Use model for detecting P phase, check the [notebook](notebooks/Present%20model%20predictions.ipynb)
@@ -68,31 +67,68 @@ poetry shell
WANDB_HOST="https://epos-ai.grid.cyfronet.pl/"
WANDB_API_KEY="your key"
WANDB_USER="your user"
WANDB_PROJECT="training_seisbench_models_on_igf_data"
WANDB_PROJECT="training_seisbench_models"
BENCHMARK_DEFAULT_WORKER=2
2. Transform data into seisbench format. (unofficial)
* Download original data from the [drive](https://drive.google.com/drive/folders/1InVI9DLaD7gdzraM2jMzeIrtiBSu-UIK?usp=drive_link)
* Run the notebook: `utils/Transforming mseeds to SeisBench dataset.ipynb`
2. Transform data into seisbench format.
To utilize functionality of Seisbench library, data need to be transformed to [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)). If your data is in the MSEED format, you can use the prepared script `mseeds_to_seisbench.py` to perform the transformation. Please make sure that your data has the same structure as the data used in this project.
The script assumes that:
* the data is stored in the following directory structure:
`input_path/year/station_network_code/station_code/trace_channel.D` e.g.
`input_path/2018/PL/ALBE/EHE.D/`
* the file names follow the pattern:
`station_network_code.station_code..trace_channel.D.year.day_of_year`
e.g. `PL.ALBE..EHE.D.2018.282`
* events catalog is stored in quakeML format
Run the script `mseeds_to_seisbench` located in the `utils` directory
3. Run the pipeline script:
```
cd utils
python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path
```
If you want to run the script on a cluster, you can use the script `convert_data.sh` as a template (adjust the grant name, computing name and paths) and send the job to queue using sbatch command on login node of e.g. Ares:
```
cd utils
sbatch convert_data.sh
```
If your data has a different structure or format, use the notebooks to gain an understanding of the Seisbench format and what needs to be done to transform your data:
* [Seisbench example](https://colab.research.google.com/github/seisbench/seisbench/blob/main/examples/01a_dataset_basics.ipynb) or
* [Transforming mseeds from Bogdanka to Seisbench format](utils/Transforming mseeds from Bogdanka to Seisbench format.ipynb) notebook
`python pipeline.py`
3. Adjust the `config.json` and specify:
* `dataset_name` - the name of the dataset, which will be used to name the folder with evaluation targets and predictions
* `data_path` - the path to the data in the Seisbench format
* `experiment_count` - the number of experiments to run for each model type
4. Run the pipeline script
`python pipeline.py`
The script performs the following steps:
* Generates evaluation targets
* Generates evaluation targets in `datasets/<dataset_name>/targets` directory.
* Trains multiple versions of GPD, PhaseNet and ... models to find the best hyperparameters, producing the lowest validation loss.
This step utilizes the Weights & Biases platform to perform the hyperparameters search (called sweeping) and track the training process and store the results.
The results are available at
`https://epos-ai.grid.cyfronet.pl/<your user name>/<your project name>`
* Uses the best performing model of each type to generate predictions
* Evaluates the performance of each model by comparing the predictions with the evaluation targets
* Saves the results in the `scripts/pred` directory
*
The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script.
For example, to change the sweep configuration file for GPD model, run:
`python pipeline.py --gpd_config <new config file>`
The new config file should be placed in the `experiments` or as specified in the `configs_path` parameter in the config.json file.
The results are available at
`https://epos-ai.grid.cyfronet.pl/<WANDB_USER>/<WANDB_PROJECT>`
Weights and training logs can be downloaded from the platform.
Additionally, the most important data are saved locally in `weights/<dataset_name>_<model_name>/ ` directory:
* Weights of the best checkpoint of each model are saved as `<dataset_name>_<model_name>_sweep=<sweep_id>-run=<run_id>-epoch=<epoch_number>-val_loss=<val_loss>.ckpt`
* Metrics and hyperparams are saved in <run_id> folders
* Uses the best performing model of each type to generate predictions. The predictons are saved in the `scripts/pred/<dataset_name>_<model_name>/<run_id>` directory.
* Evaluates the performance of each model by comparing the predictions with the evaluation targets.
The results are saved in the `scripts/pred/results.csv` file.
The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script.
For example, to change the sweep configuration file for GPD model, run:
`python pipeline.py --gpd_config <new config file>`
The new config file should be placed in the `experiments` folder or as specified in the `configs_path` parameter in the config.json file.
### Troubleshooting