Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging
This commit is contained in:
parent
78ac51478c
commit
aa39980573
78
README.md
78
README.md
@ -2,10 +2,9 @@
|
|||||||
|
|
||||||
|
|
||||||
This repo contains notebooks and scripts demonstrating how to:
|
This repo contains notebooks and scripts demonstrating how to:
|
||||||
- Prepare IGF data for training a seisbench model detecting P phase (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](utils/Transforming%20mseeds%20to%20SeisBench%20dataset.ipynb).
|
- Prepare data for training a seisbench model detecting P and S waves (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](utils/Transforming%20mseeds%20from%20Bogdanka%20to%20Seisbench%20format.ipynb) and the [script](utils/mseeds_to_seisbench.py)
|
||||||
|
- [to update] Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb)
|
||||||
- Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb)
|
- Train various cnn models available in seisbench library and compare their performance of detecting P and S waves, check the [script](scripts/pipeline.py)
|
||||||
- Train various cnn models available in seisbench library and compare their performance of detecting P phase, check the [script](scripts/pipeline.py)
|
|
||||||
|
|
||||||
- [to update] Validate model performance, check the [notebook](notebooks/Check%20model%20performance%20depending%20on%20station-random%20window.ipynb)
|
- [to update] Validate model performance, check the [notebook](notebooks/Check%20model%20performance%20depending%20on%20station-random%20window.ipynb)
|
||||||
- [to update] Use model for detecting P phase, check the [notebook](notebooks/Present%20model%20predictions.ipynb)
|
- [to update] Use model for detecting P phase, check the [notebook](notebooks/Present%20model%20predictions.ipynb)
|
||||||
@ -68,31 +67,68 @@ poetry shell
|
|||||||
WANDB_HOST="https://epos-ai.grid.cyfronet.pl/"
|
WANDB_HOST="https://epos-ai.grid.cyfronet.pl/"
|
||||||
WANDB_API_KEY="your key"
|
WANDB_API_KEY="your key"
|
||||||
WANDB_USER="your user"
|
WANDB_USER="your user"
|
||||||
WANDB_PROJECT="training_seisbench_models_on_igf_data"
|
WANDB_PROJECT="training_seisbench_models"
|
||||||
BENCHMARK_DEFAULT_WORKER=2
|
BENCHMARK_DEFAULT_WORKER=2
|
||||||
|
|
||||||
2. Transform data into seisbench format. (unofficial)
|
2. Transform data into seisbench format.
|
||||||
* Download original data from the [drive](https://drive.google.com/drive/folders/1InVI9DLaD7gdzraM2jMzeIrtiBSu-UIK?usp=drive_link)
|
|
||||||
* Run the notebook: `utils/Transforming mseeds to SeisBench dataset.ipynb`
|
|
||||||
|
|
||||||
3. Run the pipeline script:
|
To utilize functionality of Seisbench library, data need to be transformed to [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)). If your data is in the MSEED format, you can use the prepared script `mseeds_to_seisbench.py` to perform the transformation. Please make sure that your data has the same structure as the data used in this project.
|
||||||
|
The script assumes that:
|
||||||
|
* the data is stored in the following directory structure:
|
||||||
|
`input_path/year/station_network_code/station_code/trace_channel.D` e.g.
|
||||||
|
`input_path/2018/PL/ALBE/EHE.D/`
|
||||||
|
* the file names follow the pattern:
|
||||||
|
`station_network_code.station_code..trace_channel.D.year.day_of_year`
|
||||||
|
e.g. `PL.ALBE..EHE.D.2018.282`
|
||||||
|
* events catalog is stored in quakeML format
|
||||||
|
|
||||||
`python pipeline.py`
|
Run the script `mseeds_to_seisbench` located in the `utils` directory
|
||||||
|
|
||||||
|
```
|
||||||
|
cd utils
|
||||||
|
python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path
|
||||||
|
```
|
||||||
|
If you want to run the script on a cluster, you can use the script `convert_data.sh` as a template (adjust the grant name, computing name and paths) and send the job to queue using sbatch command on login node of e.g. Ares:
|
||||||
|
|
||||||
|
```
|
||||||
|
cd utils
|
||||||
|
sbatch convert_data.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
If your data has a different structure or format, use the notebooks to gain an understanding of the Seisbench format and what needs to be done to transform your data:
|
||||||
|
* [Seisbench example](https://colab.research.google.com/github/seisbench/seisbench/blob/main/examples/01a_dataset_basics.ipynb) or
|
||||||
|
* [Transforming mseeds from Bogdanka to Seisbench format](utils/Transforming mseeds from Bogdanka to Seisbench format.ipynb) notebook
|
||||||
|
|
||||||
|
|
||||||
|
3. Adjust the `config.json` and specify:
|
||||||
|
* `dataset_name` - the name of the dataset, which will be used to name the folder with evaluation targets and predictions
|
||||||
|
* `data_path` - the path to the data in the Seisbench format
|
||||||
|
* `experiment_count` - the number of experiments to run for each model type
|
||||||
|
|
||||||
|
|
||||||
|
4. Run the pipeline script
|
||||||
|
`python pipeline.py`
|
||||||
|
|
||||||
The script performs the following steps:
|
The script performs the following steps:
|
||||||
* Generates evaluation targets
|
* Generates evaluation targets in `datasets/<dataset_name>/targets` directory.
|
||||||
* Trains multiple versions of GPD, PhaseNet and ... models to find the best hyperparameters, producing the lowest validation loss.
|
* Trains multiple versions of GPD, PhaseNet and ... models to find the best hyperparameters, producing the lowest validation loss.
|
||||||
|
|
||||||
This step utilizes the Weights & Biases platform to perform the hyperparameters search (called sweeping) and track the training process and store the results.
|
This step utilizes the Weights & Biases platform to perform the hyperparameters search (called sweeping) and track the training process and store the results.
|
||||||
The results are available at
|
The results are available at
|
||||||
`https://epos-ai.grid.cyfronet.pl/<your user name>/<your project name>`
|
`https://epos-ai.grid.cyfronet.pl/<WANDB_USER>/<WANDB_PROJECT>`
|
||||||
* Uses the best performing model of each type to generate predictions
|
Weights and training logs can be downloaded from the platform.
|
||||||
* Evaluates the performance of each model by comparing the predictions with the evaluation targets
|
Additionally, the most important data are saved locally in `weights/<dataset_name>_<model_name>/ ` directory:
|
||||||
* Saves the results in the `scripts/pred` directory
|
* Weights of the best checkpoint of each model are saved as `<dataset_name>_<model_name>_sweep=<sweep_id>-run=<run_id>-epoch=<epoch_number>-val_loss=<val_loss>.ckpt`
|
||||||
*
|
* Metrics and hyperparams are saved in <run_id> folders
|
||||||
The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script.
|
|
||||||
For example, to change the sweep configuration file for GPD model, run:
|
* Uses the best performing model of each type to generate predictions. The predictons are saved in the `scripts/pred/<dataset_name>_<model_name>/<run_id>` directory.
|
||||||
`python pipeline.py --gpd_config <new config file>`
|
* Evaluates the performance of each model by comparing the predictions with the evaluation targets.
|
||||||
The new config file should be placed in the `experiments` or as specified in the `configs_path` parameter in the config.json file.
|
The results are saved in the `scripts/pred/results.csv` file.
|
||||||
|
|
||||||
|
The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script.
|
||||||
|
For example, to change the sweep configuration file for GPD model, run:
|
||||||
|
`python pipeline.py --gpd_config <new config file>`
|
||||||
|
The new config file should be placed in the `experiments` folder or as specified in the `configs_path` parameter in the config.json file.
|
||||||
|
|
||||||
### Troubleshooting
|
### Troubleshooting
|
||||||
|
|
||||||
|
@ -1,7 +1,7 @@
|
|||||||
{
|
{
|
||||||
"dataset_name": "igf",
|
"dataset_name": "bogdanka",
|
||||||
"data_path": "datasets/igf/seisbench_format/",
|
"data_path": "datasets/bogdanka/seisbench_format/",
|
||||||
"targets_path": "datasets/targets/igf",
|
"targets_path": "datasets/targets",
|
||||||
"models_path": "weights",
|
"models_path": "weights",
|
||||||
"configs_path": "experiments",
|
"configs_path": "experiments",
|
||||||
"sampling_rate": 100,
|
"sampling_rate": 100,
|
||||||
|
29
poetry.lock
generated
29
poetry.lock
generated
@ -283,6 +283,14 @@ python-versions = "*"
|
|||||||
[package.dependencies]
|
[package.dependencies]
|
||||||
six = ">=1.4.0"
|
six = ">=1.4.0"
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "et-xmlfile"
|
||||||
|
version = "1.1.0"
|
||||||
|
description = "An implementation of lxml.xmlfile for the standard library"
|
||||||
|
category = "main"
|
||||||
|
optional = false
|
||||||
|
python-versions = ">=3.6"
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "exceptiongroup"
|
name = "exceptiongroup"
|
||||||
version = "1.1.2"
|
version = "1.1.2"
|
||||||
@ -971,6 +979,17 @@ imaging = ["cartopy"]
|
|||||||
"io.shapefile" = ["pyshp"]
|
"io.shapefile" = ["pyshp"]
|
||||||
tests = ["packaging", "pyproj", "pytest", "pytest-json-report"]
|
tests = ["packaging", "pyproj", "pytest", "pytest-json-report"]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "openpyxl"
|
||||||
|
version = "3.1.2"
|
||||||
|
description = "A Python library to read/write Excel 2010 xlsx/xlsm files"
|
||||||
|
category = "main"
|
||||||
|
optional = false
|
||||||
|
python-versions = ">=3.6"
|
||||||
|
|
||||||
|
[package.dependencies]
|
||||||
|
et-xmlfile = "*"
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "overrides"
|
name = "overrides"
|
||||||
version = "7.3.1"
|
version = "7.3.1"
|
||||||
@ -1766,7 +1785,7 @@ test = ["websockets"]
|
|||||||
[metadata]
|
[metadata]
|
||||||
lock-version = "1.1"
|
lock-version = "1.1"
|
||||||
python-versions = "^3.10"
|
python-versions = "^3.10"
|
||||||
content-hash = "2f8790f8c3e1a78ff23f0a0f0e954c97d2b0033fc6a890d4ef1355c6922dcc64"
|
content-hash = "86f528987bd303e300f586a26f506318d7bdaba445886a6a5a36f86f9e89b229"
|
||||||
|
|
||||||
[metadata.files]
|
[metadata.files]
|
||||||
anyio = [
|
anyio = [
|
||||||
@ -2076,6 +2095,10 @@ docker-pycreds = [
|
|||||||
{file = "docker-pycreds-0.4.0.tar.gz", hash = "sha256:6ce3270bcaf404cc4c3e27e4b6c70d3521deae82fb508767870fdbf772d584d4"},
|
{file = "docker-pycreds-0.4.0.tar.gz", hash = "sha256:6ce3270bcaf404cc4c3e27e4b6c70d3521deae82fb508767870fdbf772d584d4"},
|
||||||
{file = "docker_pycreds-0.4.0-py2.py3-none-any.whl", hash = "sha256:7266112468627868005106ec19cd0d722702d2b7d5912a28e19b826c3d37af49"},
|
{file = "docker_pycreds-0.4.0-py2.py3-none-any.whl", hash = "sha256:7266112468627868005106ec19cd0d722702d2b7d5912a28e19b826c3d37af49"},
|
||||||
]
|
]
|
||||||
|
et-xmlfile = [
|
||||||
|
{file = "et_xmlfile-1.1.0-py3-none-any.whl", hash = "sha256:a2ba85d1d6a74ef63837eed693bcb89c3f752169b0e3e7ae5b16ca5e1b3deada"},
|
||||||
|
{file = "et_xmlfile-1.1.0.tar.gz", hash = "sha256:8eb9e2bc2f8c97e37a2dc85a09ecdcdec9d8a396530a6d5a33b30b9a92da0c5c"},
|
||||||
|
]
|
||||||
exceptiongroup = [
|
exceptiongroup = [
|
||||||
{file = "exceptiongroup-1.1.2-py3-none-any.whl", hash = "sha256:e346e69d186172ca7cf029c8c1d16235aa0e04035e5750b4b95039e65204328f"},
|
{file = "exceptiongroup-1.1.2-py3-none-any.whl", hash = "sha256:e346e69d186172ca7cf029c8c1d16235aa0e04035e5750b4b95039e65204328f"},
|
||||||
{file = "exceptiongroup-1.1.2.tar.gz", hash = "sha256:12c3e887d6485d16943a309616de20ae5582633e0a2eda17f4e10fd61c1e8af5"},
|
{file = "exceptiongroup-1.1.2.tar.gz", hash = "sha256:12c3e887d6485d16943a309616de20ae5582633e0a2eda17f4e10fd61c1e8af5"},
|
||||||
@ -2622,6 +2645,10 @@ obspy = [
|
|||||||
{file = "obspy-1.4.0-cp39-cp39-win_amd64.whl", hash = "sha256:2090a95b08b214575892c3d99bb3362b13a3b0f4689d4ee55f95ea4d8a2cbc26"},
|
{file = "obspy-1.4.0-cp39-cp39-win_amd64.whl", hash = "sha256:2090a95b08b214575892c3d99bb3362b13a3b0f4689d4ee55f95ea4d8a2cbc26"},
|
||||||
{file = "obspy-1.4.0.tar.gz", hash = "sha256:336a6e1d9a485732b08173cb5dc1dd720a8e53f3b54c180a62bb8ceaa5fe5c06"},
|
{file = "obspy-1.4.0.tar.gz", hash = "sha256:336a6e1d9a485732b08173cb5dc1dd720a8e53f3b54c180a62bb8ceaa5fe5c06"},
|
||||||
]
|
]
|
||||||
|
openpyxl = [
|
||||||
|
{file = "openpyxl-3.1.2-py2.py3-none-any.whl", hash = "sha256:f91456ead12ab3c6c2e9491cf33ba6d08357d802192379bb482f1033ade496f5"},
|
||||||
|
{file = "openpyxl-3.1.2.tar.gz", hash = "sha256:a6f5977418eff3b2d5500d54d9db50c8277a368436f4e4f8ddb1be3422870184"},
|
||||||
|
]
|
||||||
overrides = [
|
overrides = [
|
||||||
{file = "overrides-7.3.1-py3-none-any.whl", hash = "sha256:6187d8710a935d09b0bcef8238301d6ee2569d2ac1ae0ec39a8c7924e27f58ca"},
|
{file = "overrides-7.3.1-py3-none-any.whl", hash = "sha256:6187d8710a935d09b0bcef8238301d6ee2569d2ac1ae0ec39a8c7924e27f58ca"},
|
||||||
{file = "overrides-7.3.1.tar.gz", hash = "sha256:8b97c6c1e1681b78cbc9424b138d880f0803c2254c5ebaabdde57bb6c62093f2"},
|
{file = "overrides-7.3.1.tar.gz", hash = "sha256:8b97c6c1e1681b78cbc9424b138d880f0803c2254c5ebaabdde57bb6c62093f2"},
|
||||||
|
@ -16,6 +16,7 @@ wandb = "^0.15.4"
|
|||||||
torchmetrics = "^0.11.4"
|
torchmetrics = "^0.11.4"
|
||||||
ipykernel = "^6.24.0"
|
ipykernel = "^6.24.0"
|
||||||
jupyterlab = "^4.0.2"
|
jupyterlab = "^4.0.2"
|
||||||
|
openpyxl = "^3.1.2"
|
||||||
|
|
||||||
[tool.poetry.dev-dependencies]
|
[tool.poetry.dev-dependencies]
|
||||||
|
|
||||||
|
@ -15,8 +15,8 @@ config = load_config(config_path)
|
|||||||
|
|
||||||
data_path = f"{project_path}/{config['data_path']}"
|
data_path = f"{project_path}/{config['data_path']}"
|
||||||
models_path = f"{project_path}/{config['models_path']}"
|
models_path = f"{project_path}/{config['models_path']}"
|
||||||
targets_path = f"{project_path}/{config['targets_path']}"
|
|
||||||
dataset_name = config['dataset_name']
|
dataset_name = config['dataset_name']
|
||||||
|
targets_path = f"{project_path}/{config['targets_path']}/{dataset_name}"
|
||||||
configs_path = f"{project_path}/{config['configs_path']}"
|
configs_path = f"{project_path}/{config['configs_path']}"
|
||||||
|
|
||||||
sweep_files = config['sweep_files']
|
sweep_files = config['sweep_files']
|
||||||
|
@ -29,11 +29,11 @@ data_aliases = {
|
|||||||
"instance": "InstanceCountsCombined",
|
"instance": "InstanceCountsCombined",
|
||||||
"iquique": "Iquique",
|
"iquique": "Iquique",
|
||||||
"lendb": "LenDB",
|
"lendb": "LenDB",
|
||||||
"scedc": "SCEDC"
|
"scedc": "SCEDC",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, sweep_id=None, test_run=False):
|
def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, sweep_id=None):
|
||||||
weights = Path(weights)
|
weights = Path(weights)
|
||||||
targets = Path(os.path.abspath(targets))
|
targets = Path(os.path.abspath(targets))
|
||||||
print(targets)
|
print(targets)
|
||||||
@ -100,8 +100,6 @@ def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, swe
|
|||||||
for task in ["1", "23"]:
|
for task in ["1", "23"]:
|
||||||
task_csv = targets / f"task{task}.csv"
|
task_csv = targets / f"task{task}.csv"
|
||||||
|
|
||||||
print(task_csv)
|
|
||||||
|
|
||||||
if not task_csv.is_file():
|
if not task_csv.is_file():
|
||||||
continue
|
continue
|
||||||
|
|
||||||
@ -227,9 +225,7 @@ if __name__ == "__main__":
|
|||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--sweep_id", type=str, help="wandb sweep_id", required=False, default=None
|
"--sweep_id", type=str, help="wandb sweep_id", required=False, default=None
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
|
||||||
"--test_run", action="store_true", required=False, default=False
|
|
||||||
)
|
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
main(
|
main(
|
||||||
@ -239,8 +235,7 @@ if __name__ == "__main__":
|
|||||||
batchsize=args.batchsize,
|
batchsize=args.batchsize,
|
||||||
num_workers=args.num_workers,
|
num_workers=args.num_workers,
|
||||||
sampling_rate=args.sampling_rate,
|
sampling_rate=args.sampling_rate,
|
||||||
sweep_id=args.sweep_id,
|
sweep_id=args.sweep_id
|
||||||
test_run=args.test_run
|
|
||||||
)
|
)
|
||||||
running_time = str(
|
running_time = str(
|
||||||
datetime.timedelta(seconds=time.perf_counter() - code_start_time)
|
datetime.timedelta(seconds=time.perf_counter() - code_start_time)
|
||||||
|
@ -3,6 +3,7 @@
|
|||||||
# This work was partially funded by EPOS Project funded in frame of PL-POIR4.2
|
# This work was partially funded by EPOS Project funded in frame of PL-POIR4.2
|
||||||
# -----------------
|
# -----------------
|
||||||
|
|
||||||
|
import os
|
||||||
import os.path
|
import os.path
|
||||||
import argparse
|
import argparse
|
||||||
from pytorch_lightning.loggers import WandbLogger, CSVLogger
|
from pytorch_lightning.loggers import WandbLogger, CSVLogger
|
||||||
@ -22,6 +23,7 @@ from config_loader import models_path, dataset_name, seed, experiment_count
|
|||||||
|
|
||||||
|
|
||||||
torch.multiprocessing.set_sharing_strategy('file_system')
|
torch.multiprocessing.set_sharing_strategy('file_system')
|
||||||
|
os.system("ulimit -n unlimited")
|
||||||
|
|
||||||
load_dotenv()
|
load_dotenv()
|
||||||
wandb_api_key = os.environ.get('WANDB_API_KEY')
|
wandb_api_key = os.environ.get('WANDB_API_KEY')
|
||||||
|
@ -17,8 +17,8 @@ import eval
|
|||||||
import collect_results
|
import collect_results
|
||||||
from config_loader import data_path, targets_path, sampling_rate, dataset_name, sweep_files
|
from config_loader import data_path, targets_path, sampling_rate, dataset_name, sweep_files
|
||||||
|
|
||||||
|
logging.root.setLevel(logging.INFO)
|
||||||
logger = logging.getLogger('pipeline')
|
logger = logging.getLogger('pipeline')
|
||||||
logger.setLevel(logging.INFO)
|
|
||||||
|
|
||||||
|
|
||||||
def load_sweep_config(model_name, args):
|
def load_sweep_config(model_name, args):
|
||||||
@ -76,16 +76,19 @@ def main():
|
|||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
# generate labels
|
# generate labels
|
||||||
|
logger.info("Started generating labels for the dataset.")
|
||||||
generate_eval_targets.main(data_path, targets_path, "2,3", sampling_rate, None)
|
generate_eval_targets.main(data_path, targets_path, "2,3", sampling_rate, None)
|
||||||
|
|
||||||
# find the best hyperparams for the models
|
# find the best hyperparams for the models
|
||||||
|
logger.info("Started training the models.")
|
||||||
for model_name in ["GPD", "PhaseNet"]:
|
for model_name in ["GPD", "PhaseNet"]:
|
||||||
sweep_id = find_the_best_params(model_name, args)
|
sweep_id = find_the_best_params(model_name, args)
|
||||||
generate_predictions(sweep_id, model_name)
|
generate_predictions(sweep_id, model_name)
|
||||||
|
|
||||||
# collect results
|
# collect results
|
||||||
|
logger.info("Collecting results.")
|
||||||
collect_results.traverse_path("pred", "pred/results.csv")
|
collect_results.traverse_path("pred", "pred/results.csv")
|
||||||
|
logger.info("Results saved in pred/results.csv")
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
main()
|
main()
|
||||||
|
@ -20,18 +20,13 @@ import torch
|
|||||||
import os
|
import os
|
||||||
import logging
|
import logging
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
|
||||||
import models, data, util
|
import models, data, util
|
||||||
import time
|
import time
|
||||||
import datetime
|
import datetime
|
||||||
import wandb
|
import wandb
|
||||||
#
|
|
||||||
# load_dotenv()
|
|
||||||
# wandb_api_key = os.environ.get('WANDB_API_KEY')
|
|
||||||
# if wandb_api_key is None:
|
|
||||||
# raise ValueError("WANDB_API_KEY environment variable is not set.")
|
|
||||||
#
|
|
||||||
# wandb.login(key=wandb_api_key)
|
|
||||||
|
|
||||||
def train(config, experiment_name, test_run):
|
def train(config, experiment_name, test_run):
|
||||||
"""
|
"""
|
||||||
@ -210,6 +205,14 @@ def generate_phase_mask(dataset, phases):
|
|||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|
||||||
|
load_dotenv()
|
||||||
|
wandb_api_key = os.environ.get('WANDB_API_KEY')
|
||||||
|
if wandb_api_key is None:
|
||||||
|
raise ValueError("WANDB_API_KEY environment variable is not set.")
|
||||||
|
|
||||||
|
wandb.login(key=wandb_api_key)
|
||||||
|
|
||||||
code_start_time = time.perf_counter()
|
code_start_time = time.perf_counter()
|
||||||
|
|
||||||
torch.manual_seed(42)
|
torch.manual_seed(42)
|
||||||
|
@ -16,7 +16,7 @@ load_dotenv()
|
|||||||
|
|
||||||
|
|
||||||
logging.basicConfig()
|
logging.basicConfig()
|
||||||
logging.getLogger().setLevel(logging.DEBUG)
|
logging.getLogger().setLevel(logging.INFO)
|
||||||
|
|
||||||
|
|
||||||
def load_best_model_data(sweep_id, weights):
|
def load_best_model_data(sweep_id, weights):
|
||||||
|
1134
utils/Transforming mseeds from Bogdanka to Seisbench format.ipynb
Normal file
1134
utils/Transforming mseeds from Bogdanka to Seisbench format.ipynb
Normal file
File diff suppressed because one or more lines are too long
@ -88,8 +88,8 @@
|
|||||||
"</div>"
|
"</div>"
|
||||||
],
|
],
|
||||||
"text/plain": [
|
"text/plain": [
|
||||||
" Datetime X Y Depth Mw \n",
|
" Datetime X Y Depth Mw \\\n",
|
||||||
"0 2020-01-01 10:09:42.200 5.582503e+06 5.702646e+06 0.7 2.469231 \\\n",
|
"0 2020-01-01 10:09:42.200 5.582503e+06 5.702646e+06 0.7 2.469231 \n",
|
||||||
"\n",
|
"\n",
|
||||||
" Phases mseed_name \n",
|
" Phases mseed_name \n",
|
||||||
"0 Pg BRDW 2020-01-01 10:09:44.400, Sg BRDW 2020-... 20200101100941.mseed "
|
"0 Pg BRDW 2020-01-01 10:09:44.400, Sg BRDW 2020-... 20200101100941.mseed "
|
||||||
@ -101,7 +101,7 @@
|
|||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
"input_path = str(Path.cwd().parent) + \"/data/igf/\"\n",
|
"input_path = str(Path.cwd().parent) + \"/datasets/igf/\"\n",
|
||||||
"catalog = pd.read_excel(input_path + \"Catalog_20_21.xlsx\", index_col=0)\n",
|
"catalog = pd.read_excel(input_path + \"Catalog_20_21.xlsx\", index_col=0)\n",
|
||||||
"catalog.head(1)"
|
"catalog.head(1)"
|
||||||
]
|
]
|
||||||
@ -317,7 +317,7 @@
|
|||||||
"name": "stderr",
|
"name": "stderr",
|
||||||
"output_type": "stream",
|
"output_type": "stream",
|
||||||
"text": [
|
"text": [
|
||||||
"Traces converted: 35784it [00:52, 679.39it/s]\n"
|
"Traces converted: 35784it [01:01, 578.58it/s]\n"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
@ -339,8 +339,10 @@
|
|||||||
" continue\n",
|
" continue\n",
|
||||||
" if os.path.exists(input_path + \"mseeds/mseeds_2020/\" + event.mseed_name):\n",
|
" if os.path.exists(input_path + \"mseeds/mseeds_2020/\" + event.mseed_name):\n",
|
||||||
" mseed_path = input_path + \"mseeds/mseeds_2020/\" + event.mseed_name \n",
|
" mseed_path = input_path + \"mseeds/mseeds_2020/\" + event.mseed_name \n",
|
||||||
" else:\n",
|
" elif os.path.exists(input_path + \"mseeds/mseeds_2021/\" + event.mseed_name):\n",
|
||||||
" mseed_path = input_path + \"mseeds/mseeds_2021/\" + event.mseed_name \n",
|
" mseed_path = input_path + \"mseeds/mseeds_2021/\" + event.mseed_name \n",
|
||||||
|
" else: \n",
|
||||||
|
" continue\n",
|
||||||
" \n",
|
" \n",
|
||||||
" \n",
|
" \n",
|
||||||
" stream = get_mseed(mseed_path)\n",
|
" stream = get_mseed(mseed_path)\n",
|
||||||
@ -374,6 +376,8 @@
|
|||||||
" # trace_params[f\"trace_{pick.phase_hint}_status\"] = pick.evaluation_mode\n",
|
" # trace_params[f\"trace_{pick.phase_hint}_status\"] = pick.evaluation_mode\n",
|
||||||
" \n",
|
" \n",
|
||||||
" writer.add_trace({**event_params, **trace_params}, data)\n",
|
" writer.add_trace({**event_params, **trace_params}, data)\n",
|
||||||
|
"\n",
|
||||||
|
" # break\n",
|
||||||
" \n",
|
" \n",
|
||||||
" "
|
" "
|
||||||
]
|
]
|
||||||
@ -393,7 +397,25 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"data = sbd.WaveformDataset(output_path, sampling_rate=100)"
|
"data = sbd.WaveformDataset(output_path, sampling_rate=100)\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 12,
|
||||||
|
"id": "33c77509-7aab-4833-a372-16030941395d",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Unnamed dataset - 35784 traces\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"print(data)"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -406,17 +428,17 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 12,
|
"execution_count": 13,
|
||||||
"id": "1753f65e-fe5d-4cfa-ab42-ae161ac4a253",
|
"id": "1753f65e-fe5d-4cfa-ab42-ae161ac4a253",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"data": {
|
||||||
"text/plain": [
|
"text/plain": [
|
||||||
"<matplotlib.lines.Line2D at 0x7f7ed04a8820>"
|
"<matplotlib.lines.Line2D at 0x14d6c12d0>"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"execution_count": 12,
|
"execution_count": 13,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"output_type": "execute_result"
|
"output_type": "execute_result"
|
||||||
},
|
},
|
||||||
@ -449,7 +471,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 13,
|
"execution_count": 14,
|
||||||
"id": "bf7dae75-c90b-44f8-a51d-44e8abaaa3c3",
|
"id": "bf7dae75-c90b-44f8-a51d-44e8abaaa3c3",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
@ -472,7 +494,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 14,
|
"execution_count": 15,
|
||||||
"id": "de82db24-d983-4592-a0eb-f96beecb2f69",
|
"id": "de82db24-d983-4592-a0eb-f96beecb2f69",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
@ -622,29 +644,29 @@
|
|||||||
"</div>"
|
"</div>"
|
||||||
],
|
],
|
||||||
"text/plain": [
|
"text/plain": [
|
||||||
" index source_origin_time source_latitude_deg source_longitude_deg \n",
|
" index source_origin_time source_latitude_deg source_longitude_deg \\\n",
|
||||||
"0 0 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \\\n",
|
"0 0 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
|
||||||
"1 1 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
|
"1 1 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
|
||||||
"2 2 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
|
"2 2 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
|
||||||
"3 3 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
|
"3 3 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
|
||||||
"4 4 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
|
"4 4 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
|
||||||
"\n",
|
"\n",
|
||||||
" source_depth_km source_magnitude split station_network_code station_code \n",
|
" source_depth_km source_magnitude split station_network_code station_code \\\n",
|
||||||
"0 0.7 2.469231 train PL BRDW \\\n",
|
"0 0.7 2.469231 train PL BRDW \n",
|
||||||
"1 0.7 2.469231 train PL BRDW \n",
|
"1 0.7 2.469231 train PL BRDW \n",
|
||||||
"2 0.7 2.469231 train PL GROD \n",
|
"2 0.7 2.469231 train PL GROD \n",
|
||||||
"3 0.7 2.469231 train PL GROD \n",
|
"3 0.7 2.469231 train PL GROD \n",
|
||||||
"4 0.7 2.469231 train PL GUZI \n",
|
"4 0.7 2.469231 train PL GUZI \n",
|
||||||
"\n",
|
"\n",
|
||||||
" trace_channel trace_sampling_rate_hz trace_start_time \n",
|
" trace_channel trace_sampling_rate_hz trace_start_time \\\n",
|
||||||
"0 EHE 100.0 2020-01-01T10:09:36.480000Z \\\n",
|
"0 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
|
||||||
"1 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
|
"1 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
|
||||||
"2 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
|
"2 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
|
||||||
"3 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
|
"3 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
|
||||||
"4 CNE 100.0 2020-01-01T10:09:36.476000Z \n",
|
"4 CNE 100.0 2020-01-01T10:09:36.476000Z \n",
|
||||||
"\n",
|
"\n",
|
||||||
" trace_Pg_arrival_sample trace_name trace_Sg_arrival_sample \n",
|
" trace_Pg_arrival_sample trace_name trace_Sg_arrival_sample \\\n",
|
||||||
"0 792.0 bucket0$0,:3,:2001 NaN \\\n",
|
"0 792.0 bucket0$0,:3,:2001 NaN \n",
|
||||||
"1 NaN bucket0$1,:3,:2001 921.0 \n",
|
"1 NaN bucket0$1,:3,:2001 921.0 \n",
|
||||||
"2 872.0 bucket0$2,:3,:2001 NaN \n",
|
"2 872.0 bucket0$2,:3,:2001 NaN \n",
|
||||||
"3 NaN bucket0$3,:3,:2001 1017.0 \n",
|
"3 NaN bucket0$3,:3,:2001 1017.0 \n",
|
||||||
@ -658,7 +680,7 @@
|
|||||||
"4 ZNE "
|
"4 ZNE "
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"execution_count": 14,
|
"execution_count": 15,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"output_type": "execute_result"
|
"output_type": "execute_result"
|
||||||
}
|
}
|
||||||
@ -700,7 +722,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.9.7"
|
"version": "3.10.6"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
19
utils/convert_data.sh
Normal file
19
utils/convert_data.sh
Normal file
@ -0,0 +1,19 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=mseeds_to_seisbench
|
||||||
|
#SBATCH --time=1:00:00
|
||||||
|
#SBATCH --account=plgeposai22gpu-gpu
|
||||||
|
#SBATCH --partition plgrid
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --ntasks-per-node=1
|
||||||
|
#SBATCH --mem=24gb
|
||||||
|
|
||||||
|
|
||||||
|
## activate conda environment
|
||||||
|
source /net/pr2/projects/plgrid/plggeposai/kmilian/mambaforge/bin/activate
|
||||||
|
conda activate epos-ai-train
|
||||||
|
|
||||||
|
input_path="/net/pr2/projects/plgrid/plggeposai/datasets/bogdanka"
|
||||||
|
catalog_path="/net/pr2/projects/plgrid/plggeposai/datasets/bogdanka/BOIS_all.xml"
|
||||||
|
output_path="/net/pr2/projects/plgrid/plggeposai/kmilian/platform-demo-scripts/datasets/bogdanka/seisbench_format"
|
||||||
|
|
||||||
|
python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path
|
250
utils/mseeds_to_seisbench.py
Normal file
250
utils/mseeds_to_seisbench.py
Normal file
@ -0,0 +1,250 @@
|
|||||||
|
import os
|
||||||
|
import pandas as pd
|
||||||
|
import glob
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import obspy
|
||||||
|
from obspy.core.event import read_events
|
||||||
|
|
||||||
|
import seisbench
|
||||||
|
import seisbench.data as sbd
|
||||||
|
import seisbench.util as sbu
|
||||||
|
import sys
|
||||||
|
import logging
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
|
||||||
|
logging.basicConfig(filename="output.out",
|
||||||
|
filemode='a',
|
||||||
|
format='%(asctime)s,%(msecs)d %(name)s %(levelname)s %(message)s',
|
||||||
|
datefmt='%H:%M:%S',
|
||||||
|
level=logging.DEBUG)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.getLogger('converter')
|
||||||
|
|
||||||
|
def create_traces_catalog(directory, years):
|
||||||
|
for year in years:
|
||||||
|
directory = f"{directory}/{year}"
|
||||||
|
files = glob.glob(directory)
|
||||||
|
traces = []
|
||||||
|
for i, f in enumerate(files):
|
||||||
|
st = obspy.read(f)
|
||||||
|
|
||||||
|
for tr in st.traces:
|
||||||
|
# trace_id = tr.id
|
||||||
|
# start = tr.meta.starttime
|
||||||
|
# end = tr.meta.endtime
|
||||||
|
|
||||||
|
trs = pd.Series({
|
||||||
|
'trace_id': tr.id,
|
||||||
|
'trace_st': tr.meta.starttime,
|
||||||
|
'trace_end': tr.meta.endtime,
|
||||||
|
'stream_fname': f
|
||||||
|
})
|
||||||
|
traces.append(trs)
|
||||||
|
|
||||||
|
traces_catalog = pd.DataFrame(pd.concat(traces)).transpose()
|
||||||
|
traces_catalog.to_csv("data/bogdanka/traces_catalog.csv", append=True, index=False)
|
||||||
|
|
||||||
|
|
||||||
|
def split_events(events, input_path):
|
||||||
|
|
||||||
|
logger.info("Splitting available events into train, dev and test sets ...")
|
||||||
|
events_stats = pd.DataFrame()
|
||||||
|
events_stats.index.name = "event"
|
||||||
|
|
||||||
|
for i, event in enumerate(events):
|
||||||
|
#check if mseed exists
|
||||||
|
actual_picks = 0
|
||||||
|
for pick in event.picks:
|
||||||
|
trace_params = get_trace_params(pick)
|
||||||
|
trace_path = get_trace_path(input_path, trace_params)
|
||||||
|
if os.path.isfile(trace_path):
|
||||||
|
actual_picks += 1
|
||||||
|
|
||||||
|
events_stats.loc[i, "pick_count"] = actual_picks
|
||||||
|
|
||||||
|
events_stats['pick_count_cumsum'] = events_stats.pick_count.cumsum()
|
||||||
|
|
||||||
|
train_th = 0.7 * events_stats.pick_count_cumsum.values[-1]
|
||||||
|
dev_th = 0.85 * events_stats.pick_count_cumsum.values[-1]
|
||||||
|
|
||||||
|
events_stats['split'] = 'test'
|
||||||
|
for i, event in events_stats.iterrows():
|
||||||
|
if event['pick_count_cumsum'] < train_th:
|
||||||
|
events_stats.loc[i, 'split'] = 'train'
|
||||||
|
elif event['pick_count_cumsum'] < dev_th:
|
||||||
|
events_stats.loc[i, 'split'] = 'dev'
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
|
||||||
|
return events_stats
|
||||||
|
|
||||||
|
|
||||||
|
def get_event_params(event):
|
||||||
|
origin = event.preferred_origin()
|
||||||
|
if origin is None:
|
||||||
|
return {}
|
||||||
|
# print(origin)
|
||||||
|
|
||||||
|
mag = event.preferred_magnitude()
|
||||||
|
|
||||||
|
source_id = str(event.resource_id)
|
||||||
|
|
||||||
|
event_params = {
|
||||||
|
"source_id": source_id,
|
||||||
|
"source_origin_uncertainty_sec": origin.time_errors["uncertainty"],
|
||||||
|
"source_latitude_deg": origin.latitude,
|
||||||
|
"source_latitude_uncertainty_km": origin.latitude_errors["uncertainty"],
|
||||||
|
"source_longitude_deg": origin.longitude,
|
||||||
|
"source_longitude_uncertainty_km": origin.longitude_errors["uncertainty"],
|
||||||
|
"source_depth_km": origin.depth / 1e3,
|
||||||
|
"source_depth_uncertainty_km": origin.depth_errors["uncertainty"] / 1e3 if origin.depth_errors[
|
||||||
|
"uncertainty"] is not None else None,
|
||||||
|
}
|
||||||
|
|
||||||
|
if mag is not None:
|
||||||
|
event_params["source_magnitude"] = mag.mag
|
||||||
|
event_params["source_magnitude_uncertainty"] = mag.mag_errors["uncertainty"]
|
||||||
|
event_params["source_magnitude_type"] = mag.magnitude_type
|
||||||
|
event_params["source_magnitude_author"] = mag.creation_info.agency_id if mag.creation_info is not None else None
|
||||||
|
|
||||||
|
return event_params
|
||||||
|
|
||||||
|
|
||||||
|
def get_trace_params(pick):
|
||||||
|
net = pick.waveform_id.network_code
|
||||||
|
sta = pick.waveform_id.station_code
|
||||||
|
|
||||||
|
trace_params = {
|
||||||
|
"station_network_code": net,
|
||||||
|
"station_code": sta,
|
||||||
|
"trace_channel": pick.waveform_id.channel_code,
|
||||||
|
"station_location_code": pick.waveform_id.location_code,
|
||||||
|
"time": pick.time
|
||||||
|
}
|
||||||
|
|
||||||
|
return trace_params
|
||||||
|
|
||||||
|
|
||||||
|
def find_trace(pick_time, traces):
|
||||||
|
for tr in traces:
|
||||||
|
if pick_time > tr.stats.endtime:
|
||||||
|
continue
|
||||||
|
if pick_time >= tr.stats.starttime:
|
||||||
|
# print(pick_time, " - selected trace: ", tr)
|
||||||
|
return tr
|
||||||
|
|
||||||
|
logger.warning(f"no matching trace for peak: {pick_time}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def get_trace_path(input_path, trace_params):
|
||||||
|
year = trace_params["time"].year
|
||||||
|
day_of_year = pd.Timestamp(str(trace_params["time"])).day_of_year
|
||||||
|
net = trace_params["station_network_code"]
|
||||||
|
station = trace_params["station_code"]
|
||||||
|
tr_channel = trace_params["trace_channel"]
|
||||||
|
|
||||||
|
path = f"{input_path}/{year}/{net}/{station}/{tr_channel}.D/{net}.{station}..{tr_channel}.D.{year}.{day_of_year}"
|
||||||
|
return path
|
||||||
|
|
||||||
|
|
||||||
|
def load_trace(input_path, trace_params):
|
||||||
|
trace_path = get_trace_path(input_path, trace_params)
|
||||||
|
trace = None
|
||||||
|
|
||||||
|
if not os.path.isfile(trace_path):
|
||||||
|
logger.w(trace_path + " not found")
|
||||||
|
else:
|
||||||
|
stream = obspy.read(trace_path)
|
||||||
|
if len(stream.traces) > 1:
|
||||||
|
trace = find_trace(trace_params["time"], stream.traces)
|
||||||
|
elif len(stream.traces) == 0:
|
||||||
|
logger.warning(f"no data in: {trace_path}")
|
||||||
|
else:
|
||||||
|
trace = stream.traces[0]
|
||||||
|
|
||||||
|
return trace
|
||||||
|
|
||||||
|
|
||||||
|
def load_stream(input_path, trace_params, time_before=60, time_after=60):
|
||||||
|
trace_path = get_trace_path(input_path, trace_params)
|
||||||
|
sampling_rate, stream = None, None
|
||||||
|
pick_time = trace_params["time"]
|
||||||
|
|
||||||
|
if not os.path.isfile(trace_path):
|
||||||
|
print(trace_path + " not found")
|
||||||
|
else:
|
||||||
|
stream = obspy.read(trace_path)
|
||||||
|
stream = stream.slice(pick_time - time_before, pick_time + time_after)
|
||||||
|
if len(stream.traces) == 0:
|
||||||
|
print(f"no data in: {trace_path}")
|
||||||
|
else:
|
||||||
|
sampling_rate = stream.traces[0].stats.sampling_rate
|
||||||
|
|
||||||
|
return sampling_rate, stream
|
||||||
|
|
||||||
|
|
||||||
|
def convert_mseed_to_seisbench_format(input_path, catalog_path, output_path):
|
||||||
|
"""
|
||||||
|
Convert mseed files to seisbench dataset format
|
||||||
|
:param input_path: folder with mseed files
|
||||||
|
:param catalog_path: path to events catalog in quakeml format
|
||||||
|
:param output_path: folder to save seisbench dataset
|
||||||
|
:return:
|
||||||
|
"""
|
||||||
|
logger.info("Loading events catalog ...")
|
||||||
|
events = read_events(catalog_path)
|
||||||
|
events_stats = split_events(events, input_path)
|
||||||
|
|
||||||
|
metadata_path = output_path + "/metadata.csv"
|
||||||
|
waveforms_path = output_path + "/waveforms.hdf5"
|
||||||
|
|
||||||
|
logger.debug("Catalog loaded, starting conversion ...")
|
||||||
|
|
||||||
|
with sbd.WaveformDataWriter(metadata_path, waveforms_path) as writer:
|
||||||
|
writer.data_format = {
|
||||||
|
"dimension_order": "CW",
|
||||||
|
"component_order": "ZNE",
|
||||||
|
}
|
||||||
|
for i, event in enumerate(events):
|
||||||
|
logger.debug(f"Converting {i} event")
|
||||||
|
event_params = get_event_params(event)
|
||||||
|
event_params["split"] = events_stats.loc[i, "split"]
|
||||||
|
|
||||||
|
for pick in event.picks:
|
||||||
|
trace_params = get_trace_params(pick)
|
||||||
|
sampling_rate, stream = load_stream(input_path, trace_params)
|
||||||
|
if stream is None:
|
||||||
|
continue
|
||||||
|
|
||||||
|
actual_t_start, data, _ = sbu.stream_to_array(
|
||||||
|
stream,
|
||||||
|
component_order=writer.data_format["component_order"],
|
||||||
|
)
|
||||||
|
|
||||||
|
trace_params["trace_sampling_rate_hz"] = sampling_rate
|
||||||
|
trace_params["trace_start_time"] = str(actual_t_start)
|
||||||
|
|
||||||
|
pick_time = obspy.core.utcdatetime.UTCDateTime(trace_params["time"])
|
||||||
|
pick_idx = (pick_time - actual_t_start) * sampling_rate
|
||||||
|
|
||||||
|
trace_params[f"trace_{pick.phase_hint}_arrival_sample"] = int(pick_idx)
|
||||||
|
|
||||||
|
writer.add_trace({**event_params, **trace_params}, data)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(description='Convert mseed files to seisbench format')
|
||||||
|
parser.add_argument('--input_path', type=str, help='Path to mseed files')
|
||||||
|
parser.add_argument('--catalog_path', type=str, help='Path to events catalog in quakeml format')
|
||||||
|
parser.add_argument('--output_path', type=str, help='Path to output files')
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
|
||||||
|
convert_mseed_to_seisbench_format(args.input_path, args.catalog_path, args.output_path)
|
230
utils/utils.py
Normal file
230
utils/utils.py
Normal file
@ -0,0 +1,230 @@
|
|||||||
|
import os
|
||||||
|
import pandas as pd
|
||||||
|
import glob
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import obspy
|
||||||
|
from obspy.core.event import read_events
|
||||||
|
|
||||||
|
import seisbench.data as sbd
|
||||||
|
import seisbench.util as sbu
|
||||||
|
import sys
|
||||||
|
import logging
|
||||||
|
|
||||||
|
logging.basicConfig(filename="output.out",
|
||||||
|
filemode='a',
|
||||||
|
format='%(asctime)s,%(msecs)d %(name)s %(levelname)s %(message)s',
|
||||||
|
datefmt='%H:%M:%S',
|
||||||
|
level=logging.DEBUG)
|
||||||
|
|
||||||
|
logger = logging.getLogger('converter')
|
||||||
|
|
||||||
|
def create_traces_catalog(directory, years):
|
||||||
|
for year in years:
|
||||||
|
directory = f"{directory}/{year}"
|
||||||
|
files = glob.glob(directory)
|
||||||
|
traces = []
|
||||||
|
for i, f in enumerate(files):
|
||||||
|
st = obspy.read(f)
|
||||||
|
|
||||||
|
for tr in st.traces:
|
||||||
|
# trace_id = tr.id
|
||||||
|
# start = tr.meta.starttime
|
||||||
|
# end = tr.meta.endtime
|
||||||
|
|
||||||
|
trs = pd.Series({
|
||||||
|
'trace_id': tr.id,
|
||||||
|
'trace_st': tr.meta.starttime,
|
||||||
|
'trace_end': tr.meta.endtime,
|
||||||
|
'stream_fname': f
|
||||||
|
})
|
||||||
|
traces.append(trs)
|
||||||
|
|
||||||
|
traces_catalog = pd.DataFrame(pd.concat(traces)).transpose()
|
||||||
|
traces_catalog.to_csv("data/bogdanka/traces_catalog.csv", append=True, index=False)
|
||||||
|
|
||||||
|
|
||||||
|
def split_events(events, input_path):
|
||||||
|
|
||||||
|
logger.info("Splitting available events into train, dev and test sets ...")
|
||||||
|
events_stats = pd.DataFrame()
|
||||||
|
events_stats.index.name = "event"
|
||||||
|
|
||||||
|
for i, event in enumerate(events):
|
||||||
|
#check if mseed exists
|
||||||
|
actual_picks = 0
|
||||||
|
for pick in event.picks:
|
||||||
|
trace_params = get_trace_params(pick)
|
||||||
|
trace_path = get_trace_path(input_path, trace_params)
|
||||||
|
if os.path.isfile(trace_path):
|
||||||
|
actual_picks += 1
|
||||||
|
|
||||||
|
events_stats.loc[i, "pick_count"] = actual_picks
|
||||||
|
|
||||||
|
events_stats['pick_count_cumsum'] = events_stats.pick_count.cumsum()
|
||||||
|
|
||||||
|
train_th = 0.7 * events_stats.pick_count_cumsum.values[-1]
|
||||||
|
dev_th = 0.85 * events_stats.pick_count_cumsum.values[-1]
|
||||||
|
|
||||||
|
events_stats['split'] = 'test'
|
||||||
|
for i, event in events_stats.iterrows():
|
||||||
|
if event['pick_count_cumsum'] < train_th:
|
||||||
|
events_stats.loc[i, 'split'] = 'train'
|
||||||
|
elif event['pick_count_cumsum'] < dev_th:
|
||||||
|
events_stats.loc[i, 'split'] = 'dev'
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
|
||||||
|
return events_stats
|
||||||
|
|
||||||
|
|
||||||
|
def get_event_params(event):
|
||||||
|
origin = event.preferred_origin()
|
||||||
|
if origin is None:
|
||||||
|
return {}
|
||||||
|
# print(origin)
|
||||||
|
|
||||||
|
mag = event.preferred_magnitude()
|
||||||
|
|
||||||
|
source_id = str(event.resource_id)
|
||||||
|
|
||||||
|
event_params = {
|
||||||
|
"source_id": source_id,
|
||||||
|
"source_origin_uncertainty_sec": origin.time_errors["uncertainty"],
|
||||||
|
"source_latitude_deg": origin.latitude,
|
||||||
|
"source_latitude_uncertainty_km": origin.latitude_errors["uncertainty"],
|
||||||
|
"source_longitude_deg": origin.longitude,
|
||||||
|
"source_longitude_uncertainty_km": origin.longitude_errors["uncertainty"],
|
||||||
|
"source_depth_km": origin.depth / 1e3,
|
||||||
|
"source_depth_uncertainty_km": origin.depth_errors["uncertainty"] / 1e3 if origin.depth_errors[
|
||||||
|
"uncertainty"] is not None else None,
|
||||||
|
}
|
||||||
|
|
||||||
|
if mag is not None:
|
||||||
|
event_params["source_magnitude"] = mag.mag
|
||||||
|
event_params["source_magnitude_uncertainty"] = mag.mag_errors["uncertainty"]
|
||||||
|
event_params["source_magnitude_type"] = mag.magnitude_type
|
||||||
|
event_params["source_magnitude_author"] = mag.creation_info.agency_id if mag.creation_info is not None else None
|
||||||
|
|
||||||
|
return event_params
|
||||||
|
|
||||||
|
|
||||||
|
def get_trace_params(pick):
|
||||||
|
net = pick.waveform_id.network_code
|
||||||
|
sta = pick.waveform_id.station_code
|
||||||
|
|
||||||
|
trace_params = {
|
||||||
|
"station_network_code": net,
|
||||||
|
"station_code": sta,
|
||||||
|
"trace_channel": pick.waveform_id.channel_code,
|
||||||
|
"station_location_code": pick.waveform_id.location_code,
|
||||||
|
"time": pick.time
|
||||||
|
}
|
||||||
|
|
||||||
|
return trace_params
|
||||||
|
|
||||||
|
|
||||||
|
def find_trace(pick_time, traces):
|
||||||
|
for tr in traces:
|
||||||
|
if pick_time > tr.stats.endtime:
|
||||||
|
continue
|
||||||
|
if pick_time >= tr.stats.starttime:
|
||||||
|
# print(pick_time, " - selected trace: ", tr)
|
||||||
|
return tr
|
||||||
|
|
||||||
|
logger.warning(f"no matching trace for peak: {pick_time}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def get_trace_path(input_path, trace_params):
|
||||||
|
year = trace_params["time"].year
|
||||||
|
day_of_year = pd.Timestamp(str(trace_params["time"])).day_of_year
|
||||||
|
net = trace_params["station_network_code"]
|
||||||
|
station = trace_params["station_code"]
|
||||||
|
tr_channel = trace_params["trace_channel"]
|
||||||
|
|
||||||
|
path = f"{input_path}/{year}/{net}/{station}/{tr_channel}.D/{net}.{station}..{tr_channel}.D.{year}.{day_of_year}"
|
||||||
|
return path
|
||||||
|
|
||||||
|
|
||||||
|
def load_trace(input_path, trace_params):
|
||||||
|
trace_path = get_trace_path(input_path, trace_params)
|
||||||
|
trace = None
|
||||||
|
|
||||||
|
if not os.path.isfile(trace_path):
|
||||||
|
logger.w(trace_path + " not found")
|
||||||
|
else:
|
||||||
|
stream = obspy.read(trace_path)
|
||||||
|
if len(stream.traces) > 1:
|
||||||
|
trace = find_trace(trace_params["time"], stream.traces)
|
||||||
|
elif len(stream.traces) == 0:
|
||||||
|
logger.warning(f"no data in: {trace_path}")
|
||||||
|
else:
|
||||||
|
trace = stream.traces[0]
|
||||||
|
|
||||||
|
return trace
|
||||||
|
|
||||||
|
|
||||||
|
def load_stream(input_path, trace_params, time_before=60, time_after=60):
|
||||||
|
trace_path = get_trace_path(input_path, trace_params)
|
||||||
|
sampling_rate, stream = None, None
|
||||||
|
pick_time = trace_params["time"]
|
||||||
|
|
||||||
|
if not os.path.isfile(trace_path):
|
||||||
|
print(trace_path + " not found")
|
||||||
|
else:
|
||||||
|
stream = obspy.read(trace_path)
|
||||||
|
stream = stream.slice(pick_time - time_before, pick_time + time_after)
|
||||||
|
if len(stream.traces) == 0:
|
||||||
|
print(f"no data in: {trace_path}")
|
||||||
|
else:
|
||||||
|
sampling_rate = stream.traces[0].stats.sampling_rate
|
||||||
|
|
||||||
|
return sampling_rate, stream
|
||||||
|
|
||||||
|
|
||||||
|
def convert_mseed_to_seisbench_format():
|
||||||
|
input_path = "/net/pr2/projects/plgrid/plggeposai"
|
||||||
|
logger.info("Loading events catalog ...")
|
||||||
|
events = read_events(input_path + "/BOIS_all.xml")
|
||||||
|
events_stats = split_events(events)
|
||||||
|
output_path = input_path + "/seisbench_format"
|
||||||
|
metadata_path = output_path + "/metadata.csv"
|
||||||
|
waveforms_path = output_path + "/waveforms.hdf5"
|
||||||
|
|
||||||
|
with sbd.WaveformDataWriter(metadata_path, waveforms_path) as writer:
|
||||||
|
writer.data_format = {
|
||||||
|
"dimension_order": "CW",
|
||||||
|
"component_order": "ZNE",
|
||||||
|
}
|
||||||
|
for i, event in enumerate(events):
|
||||||
|
logger.debug(f"Converting {i} event")
|
||||||
|
event_params = get_event_params(event)
|
||||||
|
event_params["split"] = events_stats.loc[i, "split"]
|
||||||
|
# b = False
|
||||||
|
|
||||||
|
for pick in event.picks:
|
||||||
|
trace_params = get_trace_params(pick)
|
||||||
|
sampling_rate, stream = load_stream(input_path, trace_params)
|
||||||
|
if stream is None:
|
||||||
|
continue
|
||||||
|
|
||||||
|
actual_t_start, data, _ = sbu.stream_to_array(
|
||||||
|
stream,
|
||||||
|
component_order=writer.data_format["component_order"],
|
||||||
|
)
|
||||||
|
|
||||||
|
trace_params["trace_sampling_rate_hz"] = sampling_rate
|
||||||
|
trace_params["trace_start_time"] = str(actual_t_start)
|
||||||
|
|
||||||
|
pick_time = obspy.core.utcdatetime.UTCDateTime(trace_params["time"])
|
||||||
|
pick_idx = (pick_time - actual_t_start) * sampling_rate
|
||||||
|
|
||||||
|
trace_params[f"trace_{pick.phase_hint}_arrival_sample"] = int(pick_idx)
|
||||||
|
|
||||||
|
writer.add_trace({**event_params, **trace_params}, data)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
convert_mseed_to_seisbench_format()
|
||||||
|
# create_traces_catalog("/net/pr2/projects/plgrid/plggeposai/", ["2018", "2019"])
|
Loading…
Reference in New Issue
Block a user