Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging

2023-09-26 10:50:46 +02:00
parent 78ac51478c
commit aa39980573
15 changed files with 1788 additions and 66 deletions
@@ -2,10 +2,9 @@


 This repo contains notebooks and scripts demonstrating how to:
- Prepare IGF data for training a seisbench model detecting P phase (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](utils/Transforming%20mseeds%20to%20SeisBench%20dataset.ipynb).
-
- Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb)
- Train various cnn models available in seisbench library and compare their performance of detecting P phase, check the [script](scripts/pipeline.py) 
+- Prepare data for training a seisbench model detecting P and S waves (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](utils/Transforming%20mseeds%20from%20Bogdanka%20to%20Seisbench%20format.ipynb) and the [script](utils/mseeds_to_seisbench.py)
+- [to update] Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb)
+- Train various cnn models available in seisbench library and compare their performance of detecting P and S waves, check the [script](scripts/pipeline.py) 
  
 - [to update] Validate model performance, check the [notebook](notebooks/Check%20model%20performance%20depending%20on%20station-random%20window.ipynb)
 - [to update] Use model for detecting P phase, check the [notebook](notebooks/Present%20model%20predictions.ipynb)
@@ -68,31 +67,68 @@ poetry shell
   WANDB_HOST="https://epos-ai.grid.cyfronet.pl/"
   WANDB_API_KEY="your key"
   WANDB_USER="your user"
-   WANDB_PROJECT="training_seisbench_models_on_igf_data"
+   WANDB_PROJECT="training_seisbench_models"
   BENCHMARK_DEFAULT_WORKER=2

-2. Transform data into seisbench format. (unofficial)
-   * Download original data from the [drive](https://drive.google.com/drive/folders/1InVI9DLaD7gdzraM2jMzeIrtiBSu-UIK?usp=drive_link)
-   * Run the notebook: `utils/Transforming mseeds to SeisBench dataset.ipynb`
+2. Transform data into seisbench format. 
+    
+    To utilize functionality of Seisbench library, data need to be transformed to [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)). If your data is in the MSEED format, you can use the prepared script `mseeds_to_seisbench.py` to perform the transformation. Please make sure that your data has the same structure as the data used in this project.
+    The script assumes that:
+   *  the data is stored in the following directory structure:
+   `input_path/year/station_network_code/station_code/trace_channel.D` e.g.
+   `input_path/2018/PL/ALBE/EHE.D/`
+    * the file names follow the pattern:  
+    `station_network_code.station_code..trace_channel.D.year.day_of_year`
+   e.g. `PL.ALBE..EHE.D.2018.282`
+    * events catalog is stored in quakeML format
+   
+    Run the script `mseeds_to_seisbench` located in the `utils` directory

-3. Run the pipeline script:
+    ```
+    cd utils
+    python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path
+    ```
+    If you want to run the script on a cluster, you can use the script `convert_data.sh` as a template (adjust the grant name, computing name and paths) and send the job to queue using sbatch command on login node of e.g. Ares: 
+   
+    ```
+    cd utils
+    sbatch convert_data.sh
+   ```
+   
+    If your data has a different structure or format, use the notebooks to gain an understanding of the Seisbench format and what needs to be done to transform your data:
+   * [Seisbench example](https://colab.research.google.com/github/seisbench/seisbench/blob/main/examples/01a_dataset_basics.ipynb) or
+   * [Transforming mseeds from Bogdanka to Seisbench format](utils/Transforming mseeds from Bogdanka to Seisbench format.ipynb) notebook 
+    

-   `python pipeline.py`
+3. Adjust the `config.json` and specify: 
+   * `dataset_name` - the name of the dataset, which will be used to name the folder with evaluation targets and predictions
+   * `data_path` - the path to the data in the Seisbench format
+   * `experiment_count` - the number of experiments to run for each model type
+   
+
+4. Run the pipeline script
+`python pipeline.py`

   The script performs the following steps:
-   * Generates evaluation targets
+   * Generates evaluation targets in `datasets/<dataset_name>/targets` directory. 
     * Trains multiple versions of GPD, PhaseNet and ... models to find the best hyperparameters, producing the lowest validation loss.
+     
     This step utilizes the Weights & Biases platform to perform the hyperparameters search (called sweeping) and track the training process and store the results.
-     The results are available at   
-     `https://epos-ai.grid.cyfronet.pl/<your user name>/<your project name>`
-   * Uses the best performing model of each type to generate predictions
-   * Evaluates the performance of each model by comparing the predictions with the evaluation targets
-   * Saves the results in the `scripts/pred` directory  
-   * 
-   The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script. 
-   For example, to change the sweep configuration file for GPD model, run:
-   `python pipeline.py --gpd_config <new config file>`
-   The new config file should be placed in the `experiments` or as specified in the `configs_path` parameter in the config.json file.
+          The results are available at   
+          `https://epos-ai.grid.cyfronet.pl/<WANDB_USER>/<WANDB_PROJECT>`
+     Weights and training logs can be downloaded from the platform. 
+    Additionally, the most important data are saved locally in `weights/<dataset_name>_<model_name>/ ` directory:
+     * Weights of the best checkpoint of each model are saved as  `<dataset_name>_<model_name>_sweep=<sweep_id>-run=<run_id>-epoch=<epoch_number>-val_loss=<val_loss>.ckpt`
+     * Metrics and hyperparams are saved  in <run_id> folders
+       
+   * Uses the best performing model of each type to generate predictions. The predictons are saved in the `scripts/pred/<dataset_name>_<model_name>/<run_id>` directory.
+   * Evaluates the performance of each model by comparing the predictions with the evaluation targets. 
+   The results are saved in the `scripts/pred/results.csv` file.
+
+  The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script. 
+  For example, to change the sweep configuration file for GPD model, run:
+  `python pipeline.py --gpd_config <new config file>`
+  The new config file should be placed in the `experiments` folder or as specified in the `configs_path` parameter in the config.json file.

 ### Troubleshooting