Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging

This commit is contained in:
Krystyna Milian 2023-09-26 10:50:46 +02:00
parent 78ac51478c
commit aa39980573
15 changed files with 1788 additions and 66 deletions

View File

@ -2,10 +2,9 @@
This repo contains notebooks and scripts demonstrating how to: This repo contains notebooks and scripts demonstrating how to:
- Prepare IGF data for training a seisbench model detecting P phase (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](utils/Transforming%20mseeds%20to%20SeisBench%20dataset.ipynb). - Prepare data for training a seisbench model detecting P and S waves (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](utils/Transforming%20mseeds%20from%20Bogdanka%20to%20Seisbench%20format.ipynb) and the [script](utils/mseeds_to_seisbench.py)
- [to update] Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb)
- Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb) - Train various cnn models available in seisbench library and compare their performance of detecting P and S waves, check the [script](scripts/pipeline.py)
- Train various cnn models available in seisbench library and compare their performance of detecting P phase, check the [script](scripts/pipeline.py)
- [to update] Validate model performance, check the [notebook](notebooks/Check%20model%20performance%20depending%20on%20station-random%20window.ipynb) - [to update] Validate model performance, check the [notebook](notebooks/Check%20model%20performance%20depending%20on%20station-random%20window.ipynb)
- [to update] Use model for detecting P phase, check the [notebook](notebooks/Present%20model%20predictions.ipynb) - [to update] Use model for detecting P phase, check the [notebook](notebooks/Present%20model%20predictions.ipynb)
@ -68,31 +67,68 @@ poetry shell
WANDB_HOST="https://epos-ai.grid.cyfronet.pl/" WANDB_HOST="https://epos-ai.grid.cyfronet.pl/"
WANDB_API_KEY="your key" WANDB_API_KEY="your key"
WANDB_USER="your user" WANDB_USER="your user"
WANDB_PROJECT="training_seisbench_models_on_igf_data" WANDB_PROJECT="training_seisbench_models"
BENCHMARK_DEFAULT_WORKER=2 BENCHMARK_DEFAULT_WORKER=2
2. Transform data into seisbench format. (unofficial) 2. Transform data into seisbench format.
* Download original data from the [drive](https://drive.google.com/drive/folders/1InVI9DLaD7gdzraM2jMzeIrtiBSu-UIK?usp=drive_link)
* Run the notebook: `utils/Transforming mseeds to SeisBench dataset.ipynb`
3. Run the pipeline script: To utilize functionality of Seisbench library, data need to be transformed to [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)). If your data is in the MSEED format, you can use the prepared script `mseeds_to_seisbench.py` to perform the transformation. Please make sure that your data has the same structure as the data used in this project.
The script assumes that:
* the data is stored in the following directory structure:
`input_path/year/station_network_code/station_code/trace_channel.D` e.g.
`input_path/2018/PL/ALBE/EHE.D/`
* the file names follow the pattern:
`station_network_code.station_code..trace_channel.D.year.day_of_year`
e.g. `PL.ALBE..EHE.D.2018.282`
* events catalog is stored in quakeML format
Run the script `mseeds_to_seisbench` located in the `utils` directory
```
cd utils
python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path
```
If you want to run the script on a cluster, you can use the script `convert_data.sh` as a template (adjust the grant name, computing name and paths) and send the job to queue using sbatch command on login node of e.g. Ares:
```
cd utils
sbatch convert_data.sh
```
If your data has a different structure or format, use the notebooks to gain an understanding of the Seisbench format and what needs to be done to transform your data:
* [Seisbench example](https://colab.research.google.com/github/seisbench/seisbench/blob/main/examples/01a_dataset_basics.ipynb) or
* [Transforming mseeds from Bogdanka to Seisbench format](utils/Transforming mseeds from Bogdanka to Seisbench format.ipynb) notebook
3. Adjust the `config.json` and specify:
* `dataset_name` - the name of the dataset, which will be used to name the folder with evaluation targets and predictions
* `data_path` - the path to the data in the Seisbench format
* `experiment_count` - the number of experiments to run for each model type
4. Run the pipeline script
`python pipeline.py` `python pipeline.py`
The script performs the following steps: The script performs the following steps:
* Generates evaluation targets * Generates evaluation targets in `datasets/<dataset_name>/targets` directory.
* Trains multiple versions of GPD, PhaseNet and ... models to find the best hyperparameters, producing the lowest validation loss. * Trains multiple versions of GPD, PhaseNet and ... models to find the best hyperparameters, producing the lowest validation loss.
This step utilizes the Weights & Biases platform to perform the hyperparameters search (called sweeping) and track the training process and store the results. This step utilizes the Weights & Biases platform to perform the hyperparameters search (called sweeping) and track the training process and store the results.
The results are available at The results are available at
`https://epos-ai.grid.cyfronet.pl/<your user name>/<your project name>` `https://epos-ai.grid.cyfronet.pl/<WANDB_USER>/<WANDB_PROJECT>`
* Uses the best performing model of each type to generate predictions Weights and training logs can be downloaded from the platform.
* Evaluates the performance of each model by comparing the predictions with the evaluation targets Additionally, the most important data are saved locally in `weights/<dataset_name>_<model_name>/ ` directory:
* Saves the results in the `scripts/pred` directory * Weights of the best checkpoint of each model are saved as `<dataset_name>_<model_name>_sweep=<sweep_id>-run=<run_id>-epoch=<epoch_number>-val_loss=<val_loss>.ckpt`
* * Metrics and hyperparams are saved in <run_id> folders
* Uses the best performing model of each type to generate predictions. The predictons are saved in the `scripts/pred/<dataset_name>_<model_name>/<run_id>` directory.
* Evaluates the performance of each model by comparing the predictions with the evaluation targets.
The results are saved in the `scripts/pred/results.csv` file.
The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script. The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script.
For example, to change the sweep configuration file for GPD model, run: For example, to change the sweep configuration file for GPD model, run:
`python pipeline.py --gpd_config <new config file>` `python pipeline.py --gpd_config <new config file>`
The new config file should be placed in the `experiments` or as specified in the `configs_path` parameter in the config.json file. The new config file should be placed in the `experiments` folder or as specified in the `configs_path` parameter in the config.json file.
### Troubleshooting ### Troubleshooting

View File

@ -1,7 +1,7 @@
{ {
"dataset_name": "igf", "dataset_name": "bogdanka",
"data_path": "datasets/igf/seisbench_format/", "data_path": "datasets/bogdanka/seisbench_format/",
"targets_path": "datasets/targets/igf", "targets_path": "datasets/targets",
"models_path": "weights", "models_path": "weights",
"configs_path": "experiments", "configs_path": "experiments",
"sampling_rate": 100, "sampling_rate": 100,

29
poetry.lock generated
View File

@ -283,6 +283,14 @@ python-versions = "*"
[package.dependencies] [package.dependencies]
six = ">=1.4.0" six = ">=1.4.0"
[[package]]
name = "et-xmlfile"
version = "1.1.0"
description = "An implementation of lxml.xmlfile for the standard library"
category = "main"
optional = false
python-versions = ">=3.6"
[[package]] [[package]]
name = "exceptiongroup" name = "exceptiongroup"
version = "1.1.2" version = "1.1.2"
@ -971,6 +979,17 @@ imaging = ["cartopy"]
"io.shapefile" = ["pyshp"] "io.shapefile" = ["pyshp"]
tests = ["packaging", "pyproj", "pytest", "pytest-json-report"] tests = ["packaging", "pyproj", "pytest", "pytest-json-report"]
[[package]]
name = "openpyxl"
version = "3.1.2"
description = "A Python library to read/write Excel 2010 xlsx/xlsm files"
category = "main"
optional = false
python-versions = ">=3.6"
[package.dependencies]
et-xmlfile = "*"
[[package]] [[package]]
name = "overrides" name = "overrides"
version = "7.3.1" version = "7.3.1"
@ -1766,7 +1785,7 @@ test = ["websockets"]
[metadata] [metadata]
lock-version = "1.1" lock-version = "1.1"
python-versions = "^3.10" python-versions = "^3.10"
content-hash = "2f8790f8c3e1a78ff23f0a0f0e954c97d2b0033fc6a890d4ef1355c6922dcc64" content-hash = "86f528987bd303e300f586a26f506318d7bdaba445886a6a5a36f86f9e89b229"
[metadata.files] [metadata.files]
anyio = [ anyio = [
@ -2076,6 +2095,10 @@ docker-pycreds = [
{file = "docker-pycreds-0.4.0.tar.gz", hash = "sha256:6ce3270bcaf404cc4c3e27e4b6c70d3521deae82fb508767870fdbf772d584d4"}, {file = "docker-pycreds-0.4.0.tar.gz", hash = "sha256:6ce3270bcaf404cc4c3e27e4b6c70d3521deae82fb508767870fdbf772d584d4"},
{file = "docker_pycreds-0.4.0-py2.py3-none-any.whl", hash = "sha256:7266112468627868005106ec19cd0d722702d2b7d5912a28e19b826c3d37af49"}, {file = "docker_pycreds-0.4.0-py2.py3-none-any.whl", hash = "sha256:7266112468627868005106ec19cd0d722702d2b7d5912a28e19b826c3d37af49"},
] ]
et-xmlfile = [
{file = "et_xmlfile-1.1.0-py3-none-any.whl", hash = "sha256:a2ba85d1d6a74ef63837eed693bcb89c3f752169b0e3e7ae5b16ca5e1b3deada"},
{file = "et_xmlfile-1.1.0.tar.gz", hash = "sha256:8eb9e2bc2f8c97e37a2dc85a09ecdcdec9d8a396530a6d5a33b30b9a92da0c5c"},
]
exceptiongroup = [ exceptiongroup = [
{file = "exceptiongroup-1.1.2-py3-none-any.whl", hash = "sha256:e346e69d186172ca7cf029c8c1d16235aa0e04035e5750b4b95039e65204328f"}, {file = "exceptiongroup-1.1.2-py3-none-any.whl", hash = "sha256:e346e69d186172ca7cf029c8c1d16235aa0e04035e5750b4b95039e65204328f"},
{file = "exceptiongroup-1.1.2.tar.gz", hash = "sha256:12c3e887d6485d16943a309616de20ae5582633e0a2eda17f4e10fd61c1e8af5"}, {file = "exceptiongroup-1.1.2.tar.gz", hash = "sha256:12c3e887d6485d16943a309616de20ae5582633e0a2eda17f4e10fd61c1e8af5"},
@ -2622,6 +2645,10 @@ obspy = [
{file = "obspy-1.4.0-cp39-cp39-win_amd64.whl", hash = "sha256:2090a95b08b214575892c3d99bb3362b13a3b0f4689d4ee55f95ea4d8a2cbc26"}, {file = "obspy-1.4.0-cp39-cp39-win_amd64.whl", hash = "sha256:2090a95b08b214575892c3d99bb3362b13a3b0f4689d4ee55f95ea4d8a2cbc26"},
{file = "obspy-1.4.0.tar.gz", hash = "sha256:336a6e1d9a485732b08173cb5dc1dd720a8e53f3b54c180a62bb8ceaa5fe5c06"}, {file = "obspy-1.4.0.tar.gz", hash = "sha256:336a6e1d9a485732b08173cb5dc1dd720a8e53f3b54c180a62bb8ceaa5fe5c06"},
] ]
openpyxl = [
{file = "openpyxl-3.1.2-py2.py3-none-any.whl", hash = "sha256:f91456ead12ab3c6c2e9491cf33ba6d08357d802192379bb482f1033ade496f5"},
{file = "openpyxl-3.1.2.tar.gz", hash = "sha256:a6f5977418eff3b2d5500d54d9db50c8277a368436f4e4f8ddb1be3422870184"},
]
overrides = [ overrides = [
{file = "overrides-7.3.1-py3-none-any.whl", hash = "sha256:6187d8710a935d09b0bcef8238301d6ee2569d2ac1ae0ec39a8c7924e27f58ca"}, {file = "overrides-7.3.1-py3-none-any.whl", hash = "sha256:6187d8710a935d09b0bcef8238301d6ee2569d2ac1ae0ec39a8c7924e27f58ca"},
{file = "overrides-7.3.1.tar.gz", hash = "sha256:8b97c6c1e1681b78cbc9424b138d880f0803c2254c5ebaabdde57bb6c62093f2"}, {file = "overrides-7.3.1.tar.gz", hash = "sha256:8b97c6c1e1681b78cbc9424b138d880f0803c2254c5ebaabdde57bb6c62093f2"},

View File

@ -16,6 +16,7 @@ wandb = "^0.15.4"
torchmetrics = "^0.11.4" torchmetrics = "^0.11.4"
ipykernel = "^6.24.0" ipykernel = "^6.24.0"
jupyterlab = "^4.0.2" jupyterlab = "^4.0.2"
openpyxl = "^3.1.2"
[tool.poetry.dev-dependencies] [tool.poetry.dev-dependencies]

View File

@ -15,8 +15,8 @@ config = load_config(config_path)
data_path = f"{project_path}/{config['data_path']}" data_path = f"{project_path}/{config['data_path']}"
models_path = f"{project_path}/{config['models_path']}" models_path = f"{project_path}/{config['models_path']}"
targets_path = f"{project_path}/{config['targets_path']}"
dataset_name = config['dataset_name'] dataset_name = config['dataset_name']
targets_path = f"{project_path}/{config['targets_path']}/{dataset_name}"
configs_path = f"{project_path}/{config['configs_path']}" configs_path = f"{project_path}/{config['configs_path']}"
sweep_files = config['sweep_files'] sweep_files = config['sweep_files']

View File

@ -29,11 +29,11 @@ data_aliases = {
"instance": "InstanceCountsCombined", "instance": "InstanceCountsCombined",
"iquique": "Iquique", "iquique": "Iquique",
"lendb": "LenDB", "lendb": "LenDB",
"scedc": "SCEDC" "scedc": "SCEDC",
} }
def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, sweep_id=None, test_run=False): def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, sweep_id=None):
weights = Path(weights) weights = Path(weights)
targets = Path(os.path.abspath(targets)) targets = Path(os.path.abspath(targets))
print(targets) print(targets)
@ -100,8 +100,6 @@ def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, swe
for task in ["1", "23"]: for task in ["1", "23"]:
task_csv = targets / f"task{task}.csv" task_csv = targets / f"task{task}.csv"
print(task_csv)
if not task_csv.is_file(): if not task_csv.is_file():
continue continue
@ -227,9 +225,7 @@ if __name__ == "__main__":
parser.add_argument( parser.add_argument(
"--sweep_id", type=str, help="wandb sweep_id", required=False, default=None "--sweep_id", type=str, help="wandb sweep_id", required=False, default=None
) )
parser.add_argument(
"--test_run", action="store_true", required=False, default=False
)
args = parser.parse_args() args = parser.parse_args()
main( main(
@ -239,8 +235,7 @@ if __name__ == "__main__":
batchsize=args.batchsize, batchsize=args.batchsize,
num_workers=args.num_workers, num_workers=args.num_workers,
sampling_rate=args.sampling_rate, sampling_rate=args.sampling_rate,
sweep_id=args.sweep_id, sweep_id=args.sweep_id
test_run=args.test_run
) )
running_time = str( running_time = str(
datetime.timedelta(seconds=time.perf_counter() - code_start_time) datetime.timedelta(seconds=time.perf_counter() - code_start_time)

View File

@ -3,6 +3,7 @@
# This work was partially funded by EPOS Project funded in frame of PL-POIR4.2 # This work was partially funded by EPOS Project funded in frame of PL-POIR4.2
# ----------------- # -----------------
import os
import os.path import os.path
import argparse import argparse
from pytorch_lightning.loggers import WandbLogger, CSVLogger from pytorch_lightning.loggers import WandbLogger, CSVLogger
@ -22,6 +23,7 @@ from config_loader import models_path, dataset_name, seed, experiment_count
torch.multiprocessing.set_sharing_strategy('file_system') torch.multiprocessing.set_sharing_strategy('file_system')
os.system("ulimit -n unlimited")
load_dotenv() load_dotenv()
wandb_api_key = os.environ.get('WANDB_API_KEY') wandb_api_key = os.environ.get('WANDB_API_KEY')

View File

@ -17,8 +17,8 @@ import eval
import collect_results import collect_results
from config_loader import data_path, targets_path, sampling_rate, dataset_name, sweep_files from config_loader import data_path, targets_path, sampling_rate, dataset_name, sweep_files
logging.root.setLevel(logging.INFO)
logger = logging.getLogger('pipeline') logger = logging.getLogger('pipeline')
logger.setLevel(logging.INFO)
def load_sweep_config(model_name, args): def load_sweep_config(model_name, args):
@ -76,16 +76,19 @@ def main():
args = parser.parse_args() args = parser.parse_args()
# generate labels # generate labels
logger.info("Started generating labels for the dataset.")
generate_eval_targets.main(data_path, targets_path, "2,3", sampling_rate, None) generate_eval_targets.main(data_path, targets_path, "2,3", sampling_rate, None)
# find the best hyperparams for the models # find the best hyperparams for the models
logger.info("Started training the models.")
for model_name in ["GPD", "PhaseNet"]: for model_name in ["GPD", "PhaseNet"]:
sweep_id = find_the_best_params(model_name, args) sweep_id = find_the_best_params(model_name, args)
generate_predictions(sweep_id, model_name) generate_predictions(sweep_id, model_name)
# collect results # collect results
logger.info("Collecting results.")
collect_results.traverse_path("pred", "pred/results.csv") collect_results.traverse_path("pred", "pred/results.csv")
logger.info("Results saved in pred/results.csv")
if __name__ == "__main__": if __name__ == "__main__":
main() main()

View File

@ -20,18 +20,13 @@ import torch
import os import os
import logging import logging
from pathlib import Path from pathlib import Path
from dotenv import load_dotenv
import models, data, util import models, data, util
import time import time
import datetime import datetime
import wandb import wandb
#
# load_dotenv()
# wandb_api_key = os.environ.get('WANDB_API_KEY')
# if wandb_api_key is None:
# raise ValueError("WANDB_API_KEY environment variable is not set.")
#
# wandb.login(key=wandb_api_key)
def train(config, experiment_name, test_run): def train(config, experiment_name, test_run):
""" """
@ -210,6 +205,14 @@ def generate_phase_mask(dataset, phases):
if __name__ == "__main__": if __name__ == "__main__":
load_dotenv()
wandb_api_key = os.environ.get('WANDB_API_KEY')
if wandb_api_key is None:
raise ValueError("WANDB_API_KEY environment variable is not set.")
wandb.login(key=wandb_api_key)
code_start_time = time.perf_counter() code_start_time = time.perf_counter()
torch.manual_seed(42) torch.manual_seed(42)

View File

@ -16,7 +16,7 @@ load_dotenv()
logging.basicConfig() logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG) logging.getLogger().setLevel(logging.INFO)
def load_best_model_data(sweep_id, weights): def load_best_model_data(sweep_id, weights):

File diff suppressed because one or more lines are too long

View File

@ -88,8 +88,8 @@
"</div>" "</div>"
], ],
"text/plain": [ "text/plain": [
" Datetime X Y Depth Mw \n", " Datetime X Y Depth Mw \\\n",
"0 2020-01-01 10:09:42.200 5.582503e+06 5.702646e+06 0.7 2.469231 \\\n", "0 2020-01-01 10:09:42.200 5.582503e+06 5.702646e+06 0.7 2.469231 \n",
"\n", "\n",
" Phases mseed_name \n", " Phases mseed_name \n",
"0 Pg BRDW 2020-01-01 10:09:44.400, Sg BRDW 2020-... 20200101100941.mseed " "0 Pg BRDW 2020-01-01 10:09:44.400, Sg BRDW 2020-... 20200101100941.mseed "
@ -101,7 +101,7 @@
} }
], ],
"source": [ "source": [
"input_path = str(Path.cwd().parent) + \"/data/igf/\"\n", "input_path = str(Path.cwd().parent) + \"/datasets/igf/\"\n",
"catalog = pd.read_excel(input_path + \"Catalog_20_21.xlsx\", index_col=0)\n", "catalog = pd.read_excel(input_path + \"Catalog_20_21.xlsx\", index_col=0)\n",
"catalog.head(1)" "catalog.head(1)"
] ]
@ -317,7 +317,7 @@
"name": "stderr", "name": "stderr",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"Traces converted: 35784it [00:52, 679.39it/s]\n" "Traces converted: 35784it [01:01, 578.58it/s]\n"
] ]
} }
], ],
@ -339,8 +339,10 @@
" continue\n", " continue\n",
" if os.path.exists(input_path + \"mseeds/mseeds_2020/\" + event.mseed_name):\n", " if os.path.exists(input_path + \"mseeds/mseeds_2020/\" + event.mseed_name):\n",
" mseed_path = input_path + \"mseeds/mseeds_2020/\" + event.mseed_name \n", " mseed_path = input_path + \"mseeds/mseeds_2020/\" + event.mseed_name \n",
" else:\n", " elif os.path.exists(input_path + \"mseeds/mseeds_2021/\" + event.mseed_name):\n",
" mseed_path = input_path + \"mseeds/mseeds_2021/\" + event.mseed_name \n", " mseed_path = input_path + \"mseeds/mseeds_2021/\" + event.mseed_name \n",
" else: \n",
" continue\n",
" \n", " \n",
" \n", " \n",
" stream = get_mseed(mseed_path)\n", " stream = get_mseed(mseed_path)\n",
@ -375,6 +377,8 @@
" \n", " \n",
" writer.add_trace({**event_params, **trace_params}, data)\n", " writer.add_trace({**event_params, **trace_params}, data)\n",
"\n", "\n",
" # break\n",
" \n",
" " " "
] ]
}, },
@ -393,7 +397,25 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"data = sbd.WaveformDataset(output_path, sampling_rate=100)" "data = sbd.WaveformDataset(output_path, sampling_rate=100)\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "33c77509-7aab-4833-a372-16030941395d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Unnamed dataset - 35784 traces\n"
]
}
],
"source": [
"print(data)"
] ]
}, },
{ {
@ -406,17 +428,17 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 12, "execution_count": 13,
"id": "1753f65e-fe5d-4cfa-ab42-ae161ac4a253", "id": "1753f65e-fe5d-4cfa-ab42-ae161ac4a253",
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
"data": { "data": {
"text/plain": [ "text/plain": [
"<matplotlib.lines.Line2D at 0x7f7ed04a8820>" "<matplotlib.lines.Line2D at 0x14d6c12d0>"
] ]
}, },
"execution_count": 12, "execution_count": 13,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
}, },
@ -449,7 +471,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 13, "execution_count": 14,
"id": "bf7dae75-c90b-44f8-a51d-44e8abaaa3c3", "id": "bf7dae75-c90b-44f8-a51d-44e8abaaa3c3",
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
@ -472,7 +494,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 14, "execution_count": 15,
"id": "de82db24-d983-4592-a0eb-f96beecb2f69", "id": "de82db24-d983-4592-a0eb-f96beecb2f69",
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
@ -622,29 +644,29 @@
"</div>" "</div>"
], ],
"text/plain": [ "text/plain": [
" index source_origin_time source_latitude_deg source_longitude_deg \n", " index source_origin_time source_latitude_deg source_longitude_deg \\\n",
"0 0 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \\\n", "0 0 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
"1 1 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n", "1 1 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
"2 2 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n", "2 2 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
"3 3 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n", "3 3 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
"4 4 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n", "4 4 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
"\n", "\n",
" source_depth_km source_magnitude split station_network_code station_code \n", " source_depth_km source_magnitude split station_network_code station_code \\\n",
"0 0.7 2.469231 train PL BRDW \\\n", "0 0.7 2.469231 train PL BRDW \n",
"1 0.7 2.469231 train PL BRDW \n", "1 0.7 2.469231 train PL BRDW \n",
"2 0.7 2.469231 train PL GROD \n", "2 0.7 2.469231 train PL GROD \n",
"3 0.7 2.469231 train PL GROD \n", "3 0.7 2.469231 train PL GROD \n",
"4 0.7 2.469231 train PL GUZI \n", "4 0.7 2.469231 train PL GUZI \n",
"\n", "\n",
" trace_channel trace_sampling_rate_hz trace_start_time \n", " trace_channel trace_sampling_rate_hz trace_start_time \\\n",
"0 EHE 100.0 2020-01-01T10:09:36.480000Z \\\n", "0 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
"1 EHE 100.0 2020-01-01T10:09:36.480000Z \n", "1 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
"2 EHE 100.0 2020-01-01T10:09:36.480000Z \n", "2 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
"3 EHE 100.0 2020-01-01T10:09:36.480000Z \n", "3 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
"4 CNE 100.0 2020-01-01T10:09:36.476000Z \n", "4 CNE 100.0 2020-01-01T10:09:36.476000Z \n",
"\n", "\n",
" trace_Pg_arrival_sample trace_name trace_Sg_arrival_sample \n", " trace_Pg_arrival_sample trace_name trace_Sg_arrival_sample \\\n",
"0 792.0 bucket0$0,:3,:2001 NaN \\\n", "0 792.0 bucket0$0,:3,:2001 NaN \n",
"1 NaN bucket0$1,:3,:2001 921.0 \n", "1 NaN bucket0$1,:3,:2001 921.0 \n",
"2 872.0 bucket0$2,:3,:2001 NaN \n", "2 872.0 bucket0$2,:3,:2001 NaN \n",
"3 NaN bucket0$3,:3,:2001 1017.0 \n", "3 NaN bucket0$3,:3,:2001 1017.0 \n",
@ -658,7 +680,7 @@
"4 ZNE " "4 ZNE "
] ]
}, },
"execution_count": 14, "execution_count": 15,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -700,7 +722,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.9.7" "version": "3.10.6"
} }
}, },
"nbformat": 4, "nbformat": 4,

19
utils/convert_data.sh Normal file
View File

@ -0,0 +1,19 @@
#!/bin/bash
#SBATCH --job-name=mseeds_to_seisbench
#SBATCH --time=1:00:00
#SBATCH --account=plgeposai22gpu-gpu
#SBATCH --partition plgrid
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=24gb
## activate conda environment
source /net/pr2/projects/plgrid/plggeposai/kmilian/mambaforge/bin/activate
conda activate epos-ai-train
input_path="/net/pr2/projects/plgrid/plggeposai/datasets/bogdanka"
catalog_path="/net/pr2/projects/plgrid/plggeposai/datasets/bogdanka/BOIS_all.xml"
output_path="/net/pr2/projects/plgrid/plggeposai/kmilian/platform-demo-scripts/datasets/bogdanka/seisbench_format"
python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path

View File

@ -0,0 +1,250 @@
import os
import pandas as pd
import glob
from pathlib import Path
import obspy
from obspy.core.event import read_events
import seisbench
import seisbench.data as sbd
import seisbench.util as sbu
import sys
import logging
import argparse
logging.basicConfig(filename="output.out",
filemode='a',
format='%(asctime)s,%(msecs)d %(name)s %(levelname)s %(message)s',
datefmt='%H:%M:%S',
level=logging.DEBUG)
logger = logging.getLogger('converter')
def create_traces_catalog(directory, years):
for year in years:
directory = f"{directory}/{year}"
files = glob.glob(directory)
traces = []
for i, f in enumerate(files):
st = obspy.read(f)
for tr in st.traces:
# trace_id = tr.id
# start = tr.meta.starttime
# end = tr.meta.endtime
trs = pd.Series({
'trace_id': tr.id,
'trace_st': tr.meta.starttime,
'trace_end': tr.meta.endtime,
'stream_fname': f
})
traces.append(trs)
traces_catalog = pd.DataFrame(pd.concat(traces)).transpose()
traces_catalog.to_csv("data/bogdanka/traces_catalog.csv", append=True, index=False)
def split_events(events, input_path):
logger.info("Splitting available events into train, dev and test sets ...")
events_stats = pd.DataFrame()
events_stats.index.name = "event"
for i, event in enumerate(events):
#check if mseed exists
actual_picks = 0
for pick in event.picks:
trace_params = get_trace_params(pick)
trace_path = get_trace_path(input_path, trace_params)
if os.path.isfile(trace_path):
actual_picks += 1
events_stats.loc[i, "pick_count"] = actual_picks
events_stats['pick_count_cumsum'] = events_stats.pick_count.cumsum()
train_th = 0.7 * events_stats.pick_count_cumsum.values[-1]
dev_th = 0.85 * events_stats.pick_count_cumsum.values[-1]
events_stats['split'] = 'test'
for i, event in events_stats.iterrows():
if event['pick_count_cumsum'] < train_th:
events_stats.loc[i, 'split'] = 'train'
elif event['pick_count_cumsum'] < dev_th:
events_stats.loc[i, 'split'] = 'dev'
else:
break
return events_stats
def get_event_params(event):
origin = event.preferred_origin()
if origin is None:
return {}
# print(origin)
mag = event.preferred_magnitude()
source_id = str(event.resource_id)
event_params = {
"source_id": source_id,
"source_origin_uncertainty_sec": origin.time_errors["uncertainty"],
"source_latitude_deg": origin.latitude,
"source_latitude_uncertainty_km": origin.latitude_errors["uncertainty"],
"source_longitude_deg": origin.longitude,
"source_longitude_uncertainty_km": origin.longitude_errors["uncertainty"],
"source_depth_km": origin.depth / 1e3,
"source_depth_uncertainty_km": origin.depth_errors["uncertainty"] / 1e3 if origin.depth_errors[
"uncertainty"] is not None else None,
}
if mag is not None:
event_params["source_magnitude"] = mag.mag
event_params["source_magnitude_uncertainty"] = mag.mag_errors["uncertainty"]
event_params["source_magnitude_type"] = mag.magnitude_type
event_params["source_magnitude_author"] = mag.creation_info.agency_id if mag.creation_info is not None else None
return event_params
def get_trace_params(pick):
net = pick.waveform_id.network_code
sta = pick.waveform_id.station_code
trace_params = {
"station_network_code": net,
"station_code": sta,
"trace_channel": pick.waveform_id.channel_code,
"station_location_code": pick.waveform_id.location_code,
"time": pick.time
}
return trace_params
def find_trace(pick_time, traces):
for tr in traces:
if pick_time > tr.stats.endtime:
continue
if pick_time >= tr.stats.starttime:
# print(pick_time, " - selected trace: ", tr)
return tr
logger.warning(f"no matching trace for peak: {pick_time}")
return None
def get_trace_path(input_path, trace_params):
year = trace_params["time"].year
day_of_year = pd.Timestamp(str(trace_params["time"])).day_of_year
net = trace_params["station_network_code"]
station = trace_params["station_code"]
tr_channel = trace_params["trace_channel"]
path = f"{input_path}/{year}/{net}/{station}/{tr_channel}.D/{net}.{station}..{tr_channel}.D.{year}.{day_of_year}"
return path
def load_trace(input_path, trace_params):
trace_path = get_trace_path(input_path, trace_params)
trace = None
if not os.path.isfile(trace_path):
logger.w(trace_path + " not found")
else:
stream = obspy.read(trace_path)
if len(stream.traces) > 1:
trace = find_trace(trace_params["time"], stream.traces)
elif len(stream.traces) == 0:
logger.warning(f"no data in: {trace_path}")
else:
trace = stream.traces[0]
return trace
def load_stream(input_path, trace_params, time_before=60, time_after=60):
trace_path = get_trace_path(input_path, trace_params)
sampling_rate, stream = None, None
pick_time = trace_params["time"]
if not os.path.isfile(trace_path):
print(trace_path + " not found")
else:
stream = obspy.read(trace_path)
stream = stream.slice(pick_time - time_before, pick_time + time_after)
if len(stream.traces) == 0:
print(f"no data in: {trace_path}")
else:
sampling_rate = stream.traces[0].stats.sampling_rate
return sampling_rate, stream
def convert_mseed_to_seisbench_format(input_path, catalog_path, output_path):
"""
Convert mseed files to seisbench dataset format
:param input_path: folder with mseed files
:param catalog_path: path to events catalog in quakeml format
:param output_path: folder to save seisbench dataset
:return:
"""
logger.info("Loading events catalog ...")
events = read_events(catalog_path)
events_stats = split_events(events, input_path)
metadata_path = output_path + "/metadata.csv"
waveforms_path = output_path + "/waveforms.hdf5"
logger.debug("Catalog loaded, starting conversion ...")
with sbd.WaveformDataWriter(metadata_path, waveforms_path) as writer:
writer.data_format = {
"dimension_order": "CW",
"component_order": "ZNE",
}
for i, event in enumerate(events):
logger.debug(f"Converting {i} event")
event_params = get_event_params(event)
event_params["split"] = events_stats.loc[i, "split"]
for pick in event.picks:
trace_params = get_trace_params(pick)
sampling_rate, stream = load_stream(input_path, trace_params)
if stream is None:
continue
actual_t_start, data, _ = sbu.stream_to_array(
stream,
component_order=writer.data_format["component_order"],
)
trace_params["trace_sampling_rate_hz"] = sampling_rate
trace_params["trace_start_time"] = str(actual_t_start)
pick_time = obspy.core.utcdatetime.UTCDateTime(trace_params["time"])
pick_idx = (pick_time - actual_t_start) * sampling_rate
trace_params[f"trace_{pick.phase_hint}_arrival_sample"] = int(pick_idx)
writer.add_trace({**event_params, **trace_params}, data)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Convert mseed files to seisbench format')
parser.add_argument('--input_path', type=str, help='Path to mseed files')
parser.add_argument('--catalog_path', type=str, help='Path to events catalog in quakeml format')
parser.add_argument('--output_path', type=str, help='Path to output files')
args = parser.parse_args()
convert_mseed_to_seisbench_format(args.input_path, args.catalog_path, args.output_path)

230
utils/utils.py Normal file
View File

@ -0,0 +1,230 @@
import os
import pandas as pd
import glob
from pathlib import Path
import obspy
from obspy.core.event import read_events
import seisbench.data as sbd
import seisbench.util as sbu
import sys
import logging
logging.basicConfig(filename="output.out",
filemode='a',
format='%(asctime)s,%(msecs)d %(name)s %(levelname)s %(message)s',
datefmt='%H:%M:%S',
level=logging.DEBUG)
logger = logging.getLogger('converter')
def create_traces_catalog(directory, years):
for year in years:
directory = f"{directory}/{year}"
files = glob.glob(directory)
traces = []
for i, f in enumerate(files):
st = obspy.read(f)
for tr in st.traces:
# trace_id = tr.id
# start = tr.meta.starttime
# end = tr.meta.endtime
trs = pd.Series({
'trace_id': tr.id,
'trace_st': tr.meta.starttime,
'trace_end': tr.meta.endtime,
'stream_fname': f
})
traces.append(trs)
traces_catalog = pd.DataFrame(pd.concat(traces)).transpose()
traces_catalog.to_csv("data/bogdanka/traces_catalog.csv", append=True, index=False)
def split_events(events, input_path):
logger.info("Splitting available events into train, dev and test sets ...")
events_stats = pd.DataFrame()
events_stats.index.name = "event"
for i, event in enumerate(events):
#check if mseed exists
actual_picks = 0
for pick in event.picks:
trace_params = get_trace_params(pick)
trace_path = get_trace_path(input_path, trace_params)
if os.path.isfile(trace_path):
actual_picks += 1
events_stats.loc[i, "pick_count"] = actual_picks
events_stats['pick_count_cumsum'] = events_stats.pick_count.cumsum()
train_th = 0.7 * events_stats.pick_count_cumsum.values[-1]
dev_th = 0.85 * events_stats.pick_count_cumsum.values[-1]
events_stats['split'] = 'test'
for i, event in events_stats.iterrows():
if event['pick_count_cumsum'] < train_th:
events_stats.loc[i, 'split'] = 'train'
elif event['pick_count_cumsum'] < dev_th:
events_stats.loc[i, 'split'] = 'dev'
else:
break
return events_stats
def get_event_params(event):
origin = event.preferred_origin()
if origin is None:
return {}
# print(origin)
mag = event.preferred_magnitude()
source_id = str(event.resource_id)
event_params = {
"source_id": source_id,
"source_origin_uncertainty_sec": origin.time_errors["uncertainty"],
"source_latitude_deg": origin.latitude,
"source_latitude_uncertainty_km": origin.latitude_errors["uncertainty"],
"source_longitude_deg": origin.longitude,
"source_longitude_uncertainty_km": origin.longitude_errors["uncertainty"],
"source_depth_km": origin.depth / 1e3,
"source_depth_uncertainty_km": origin.depth_errors["uncertainty"] / 1e3 if origin.depth_errors[
"uncertainty"] is not None else None,
}
if mag is not None:
event_params["source_magnitude"] = mag.mag
event_params["source_magnitude_uncertainty"] = mag.mag_errors["uncertainty"]
event_params["source_magnitude_type"] = mag.magnitude_type
event_params["source_magnitude_author"] = mag.creation_info.agency_id if mag.creation_info is not None else None
return event_params
def get_trace_params(pick):
net = pick.waveform_id.network_code
sta = pick.waveform_id.station_code
trace_params = {
"station_network_code": net,
"station_code": sta,
"trace_channel": pick.waveform_id.channel_code,
"station_location_code": pick.waveform_id.location_code,
"time": pick.time
}
return trace_params
def find_trace(pick_time, traces):
for tr in traces:
if pick_time > tr.stats.endtime:
continue
if pick_time >= tr.stats.starttime:
# print(pick_time, " - selected trace: ", tr)
return tr
logger.warning(f"no matching trace for peak: {pick_time}")
return None
def get_trace_path(input_path, trace_params):
year = trace_params["time"].year
day_of_year = pd.Timestamp(str(trace_params["time"])).day_of_year
net = trace_params["station_network_code"]
station = trace_params["station_code"]
tr_channel = trace_params["trace_channel"]
path = f"{input_path}/{year}/{net}/{station}/{tr_channel}.D/{net}.{station}..{tr_channel}.D.{year}.{day_of_year}"
return path
def load_trace(input_path, trace_params):
trace_path = get_trace_path(input_path, trace_params)
trace = None
if not os.path.isfile(trace_path):
logger.w(trace_path + " not found")
else:
stream = obspy.read(trace_path)
if len(stream.traces) > 1:
trace = find_trace(trace_params["time"], stream.traces)
elif len(stream.traces) == 0:
logger.warning(f"no data in: {trace_path}")
else:
trace = stream.traces[0]
return trace
def load_stream(input_path, trace_params, time_before=60, time_after=60):
trace_path = get_trace_path(input_path, trace_params)
sampling_rate, stream = None, None
pick_time = trace_params["time"]
if not os.path.isfile(trace_path):
print(trace_path + " not found")
else:
stream = obspy.read(trace_path)
stream = stream.slice(pick_time - time_before, pick_time + time_after)
if len(stream.traces) == 0:
print(f"no data in: {trace_path}")
else:
sampling_rate = stream.traces[0].stats.sampling_rate
return sampling_rate, stream
def convert_mseed_to_seisbench_format():
input_path = "/net/pr2/projects/plgrid/plggeposai"
logger.info("Loading events catalog ...")
events = read_events(input_path + "/BOIS_all.xml")
events_stats = split_events(events)
output_path = input_path + "/seisbench_format"
metadata_path = output_path + "/metadata.csv"
waveforms_path = output_path + "/waveforms.hdf5"
with sbd.WaveformDataWriter(metadata_path, waveforms_path) as writer:
writer.data_format = {
"dimension_order": "CW",
"component_order": "ZNE",
}
for i, event in enumerate(events):
logger.debug(f"Converting {i} event")
event_params = get_event_params(event)
event_params["split"] = events_stats.loc[i, "split"]
# b = False
for pick in event.picks:
trace_params = get_trace_params(pick)
sampling_rate, stream = load_stream(input_path, trace_params)
if stream is None:
continue
actual_t_start, data, _ = sbu.stream_to_array(
stream,
component_order=writer.data_format["component_order"],
)
trace_params["trace_sampling_rate_hz"] = sampling_rate
trace_params["trace_start_time"] = str(actual_t_start)
pick_time = obspy.core.utcdatetime.UTCDateTime(trace_params["time"])
pick_idx = (pick_time - actual_t_start) * sampling_rate
trace_params[f"trace_{pick.phase_hint}_arrival_sample"] = int(pick_idx)
writer.add_trace({**event_params, **trace_params}, data)
if __name__ == "__main__":
convert_mseed_to_seisbench_format()
# create_traces_catalog("/net/pr2/projects/plgrid/plggeposai/", ["2018", "2019"])