Hubert Siejkowski 281c73764d | ||
---|---|---|
experiments | ||
notebooks | ||
scripts | ||
.gitignore | ||
README.md | ||
config.json | ||
epos-ai-train-osx.yml | ||
epos-ai-train.yml | ||
poetry.lock | ||
pyproject.toml |
README.md
Demo notebooks and scripts for EPOS AI Platform
This repo contains notebooks and scripts demonstrating how to:
- Prepare data for training a seisbench model detecting P and S waves (i.e. transform mseeds into SeisBench data format), check the notebook and the script
- Train various cnn models available in seisbench library and compare their performance of detecting P and S waves, check the script
Acknowledgments
This code is based on the pick-benchmark, the repository accompanying the paper: Which picker fits my data? A quantitative evaluation of deep learning based seismic pickers
Installation method 1
Please download and install Mambaforge following the official guide.
After successful installation and within the Mambaforge environment please clone this repository:
git clone https://epos-apps.grid.cyfronet.pl/epos-ai/platform-demo-scripts.git
and please run for Linux or Windows platforms:
cd platform-demo-scripts
mambaforge env create -f epos-ai-train.yml
or for OSX:
cd platform-demo-scripts
mambaforge env create -f epos-ai-train-osx.yml
This will create a conda environment named platform-demo-scripts
with all required packages installed.
To run the notebooks and scripts from this repository it is necessary to activate the platform-demo-scripts
environment by running:
conda activate platform-demo-scripts
Installation method 2
Please install Poetry, a tool for dependency management and packaging in Python. Then we will use only Poetry for creating Python environment and installing dependencies.
Install all dependencies with poetry, run:
poetry install
To run the notebooks and scripts from this repository it is necessary to activate the poetry environment by running:
poetry shell
Usage
-
Prepare .env file with content:
WANDB_HOST="https://epos-ai.grid.cyfronet.pl/" WANDB_API_KEY="your key" WANDB_USER="your user" WANDB_PROJECT="training_seisbench_models" BENCHMARK_DEFAULT_WORKER=2
-
Transform data into seisbench format.
To utilize functionality of Seisbench library, data need to be transformed to SeisBench data format).
If your data is stored in the MSEED format and catalog in the QuakeML format, you can use the prepared script
mseeds_to_seisbench.py
to perform the transformation. Please make sure that your data has the same structure as the data used in this project. The script assumes that:- the data is stored in the following directory structure:
input_path/year/station_network_code/station_code/trace_channel.D
e.g.input_path/2018/PL/ALBE/EHE.D/
- the file names follow the pattern:
station_network_code.station_code..trace_channel.D.year.day_of_year
e.g.PL.ALBE..EHE.D.2018.282
Run the
mseeds_to_seisbench.py
script with the following arguments:python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path
If you want to run the script on a cluster, you can use the template script
convert_data_template.sh
. After adjusting the grant name, the paths to conda env and the paths to data send the job to queue using sbatch command on a login node of e.g. Ares:sbatch convert_data_template.sh
If your data has a different structure or format, check the notebooks to gain an understanding of the Seisbench format and what needs to be done to transform your data:
- Seisbench example or
- [Transforming mseeds from Bogdanka to Seisbench format](notebooks/Transforming mseeds from Bogdanka to Seisbench format.ipynb) notebook
- the data is stored in the following directory structure:
-
Adjust the
config.json
and specify:dataset_name
- the name of the dataset, which will be used to name the folder with evaluation targets and predictionsdata_path
- the path to the data in the Seisbench formatexperiment_count
- the number of experiments to run for each model type
-
Run the pipeline script
python pipeline.py
The script performs the following steps:
-
Generates evaluation targets in
datasets/<dataset_name>/targets
directory. -
Trains multiple versions of GPD, PhaseNet, BasicPhaseAE, and EQTransformer models to find the best hyperparameters, producing the lowest validation loss.
This step utilizes the Weights & Biases platform to perform the hyperparameters search (called sweeping) and track the training process and store the results. The results are available at
https://epos-ai.grid.cyfronet.pl/<WANDB_USER>/<WANDB_PROJECT>
Weights and training logs can be downloaded from the platform. Additionally, the most important data are saved locally inweights/<dataset_name>_<model_name>/
directory:- Weights of the best checkpoint of each model are saved as
<dataset_name>_<model_name>_sweep=<sweep_id>-run=<run_id>-epoch=<epoch_number>-val_loss=<val_loss>.ckpt
- Metrics and hyperparams are saved in <run_id> folders
- Weights of the best checkpoint of each model are saved as
-
Uses the best performing model of each type to generate predictions. The predictons are saved in the
scripts/pred/<dataset_name>_<model_name>/<run_id>
directory. -
Evaluates the performance of each model by comparing the predictions with the evaluation targets and calculating MAE metrics. The results are saved in the
scripts/pred/results.csv
file. They are additionally logged in Weights & Biases platform as summary metrics of corresponding runs.
The default settings for max number of experiments and paths are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script. For example, to change the sweep configuration file for the GPD model, run:python pipeline.py --gpd_config <new config file>
The new config file should be placed in the
experiments
folder or as specified in theconfigs_path
parameter in the config.json file.Sweep configs are used to define the max number of epochs to run and the hyperparameters search space for the following parameters:
batch_size
learning_rate
Phasenet model has additional available parameters:
norm
- normalization method, options ('peak', 'std')pretrained
- pretrained seisbench models used for transfer learningfinetuning
- the type of layers to finetune first, options ('all', 'top', 'encoder', 'decoder')lr_reduce_factor
- factor to reduce learning rate after unfreezing layers
GPD model has additional parameters for filtering:
highpass
- highpass filter frequencylowpass
- lowpass filter frequency
The sweep configs are saved in the
experiments
folder.If you have multiple datasets, you can run the pipeline for each dataset separately by specifying the dataset name as an argument:
python pipeline.py --dataset <dataset_name>
-
Troubleshooting
- Problem with reading the catalog file: please make sure that your quakeML xml file has the following opening and closing tags:
<?xml version="1.0"?>
<q:quakeml xmlns="http://quakeml.org/xmlns/bed/1.2" xmlns:q="http://quakeml.org/xmlns/quakeml/1.2">
....
</q:quakeml>
wandb: ERROR Run .. errored: OSError(24, 'Too many open files')
-> https://github.com/wandb/wandb/issues/2825
Licence
The code is licenced under the GNU General Public License v3.0. See the LICENSE file for details.
Copyright
Copyright © 2023 ACK Cyfronet AGH, Poland.
This work was partially funded by EPOS Project funded in frame of PL-POIR4.2