Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging
This commit is contained in:
parent
78ac51478c
commit
aa39980573
78
README.md
78
README.md
@ -2,10 +2,9 @@
|
||||
|
||||
|
||||
This repo contains notebooks and scripts demonstrating how to:
|
||||
- Prepare IGF data for training a seisbench model detecting P phase (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](utils/Transforming%20mseeds%20to%20SeisBench%20dataset.ipynb).
|
||||
|
||||
- Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb)
|
||||
- Train various cnn models available in seisbench library and compare their performance of detecting P phase, check the [script](scripts/pipeline.py)
|
||||
- Prepare data for training a seisbench model detecting P and S waves (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](utils/Transforming%20mseeds%20from%20Bogdanka%20to%20Seisbench%20format.ipynb) and the [script](utils/mseeds_to_seisbench.py)
|
||||
- [to update] Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb)
|
||||
- Train various cnn models available in seisbench library and compare their performance of detecting P and S waves, check the [script](scripts/pipeline.py)
|
||||
|
||||
- [to update] Validate model performance, check the [notebook](notebooks/Check%20model%20performance%20depending%20on%20station-random%20window.ipynb)
|
||||
- [to update] Use model for detecting P phase, check the [notebook](notebooks/Present%20model%20predictions.ipynb)
|
||||
@ -68,31 +67,68 @@ poetry shell
|
||||
WANDB_HOST="https://epos-ai.grid.cyfronet.pl/"
|
||||
WANDB_API_KEY="your key"
|
||||
WANDB_USER="your user"
|
||||
WANDB_PROJECT="training_seisbench_models_on_igf_data"
|
||||
WANDB_PROJECT="training_seisbench_models"
|
||||
BENCHMARK_DEFAULT_WORKER=2
|
||||
|
||||
2. Transform data into seisbench format. (unofficial)
|
||||
* Download original data from the [drive](https://drive.google.com/drive/folders/1InVI9DLaD7gdzraM2jMzeIrtiBSu-UIK?usp=drive_link)
|
||||
* Run the notebook: `utils/Transforming mseeds to SeisBench dataset.ipynb`
|
||||
2. Transform data into seisbench format.
|
||||
|
||||
3. Run the pipeline script:
|
||||
To utilize functionality of Seisbench library, data need to be transformed to [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)). If your data is in the MSEED format, you can use the prepared script `mseeds_to_seisbench.py` to perform the transformation. Please make sure that your data has the same structure as the data used in this project.
|
||||
The script assumes that:
|
||||
* the data is stored in the following directory structure:
|
||||
`input_path/year/station_network_code/station_code/trace_channel.D` e.g.
|
||||
`input_path/2018/PL/ALBE/EHE.D/`
|
||||
* the file names follow the pattern:
|
||||
`station_network_code.station_code..trace_channel.D.year.day_of_year`
|
||||
e.g. `PL.ALBE..EHE.D.2018.282`
|
||||
* events catalog is stored in quakeML format
|
||||
|
||||
`python pipeline.py`
|
||||
Run the script `mseeds_to_seisbench` located in the `utils` directory
|
||||
|
||||
```
|
||||
cd utils
|
||||
python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path
|
||||
```
|
||||
If you want to run the script on a cluster, you can use the script `convert_data.sh` as a template (adjust the grant name, computing name and paths) and send the job to queue using sbatch command on login node of e.g. Ares:
|
||||
|
||||
```
|
||||
cd utils
|
||||
sbatch convert_data.sh
|
||||
```
|
||||
|
||||
If your data has a different structure or format, use the notebooks to gain an understanding of the Seisbench format and what needs to be done to transform your data:
|
||||
* [Seisbench example](https://colab.research.google.com/github/seisbench/seisbench/blob/main/examples/01a_dataset_basics.ipynb) or
|
||||
* [Transforming mseeds from Bogdanka to Seisbench format](utils/Transforming mseeds from Bogdanka to Seisbench format.ipynb) notebook
|
||||
|
||||
|
||||
3. Adjust the `config.json` and specify:
|
||||
* `dataset_name` - the name of the dataset, which will be used to name the folder with evaluation targets and predictions
|
||||
* `data_path` - the path to the data in the Seisbench format
|
||||
* `experiment_count` - the number of experiments to run for each model type
|
||||
|
||||
|
||||
4. Run the pipeline script
|
||||
`python pipeline.py`
|
||||
|
||||
The script performs the following steps:
|
||||
* Generates evaluation targets
|
||||
* Generates evaluation targets in `datasets/<dataset_name>/targets` directory.
|
||||
* Trains multiple versions of GPD, PhaseNet and ... models to find the best hyperparameters, producing the lowest validation loss.
|
||||
|
||||
This step utilizes the Weights & Biases platform to perform the hyperparameters search (called sweeping) and track the training process and store the results.
|
||||
The results are available at
|
||||
`https://epos-ai.grid.cyfronet.pl/<your user name>/<your project name>`
|
||||
* Uses the best performing model of each type to generate predictions
|
||||
* Evaluates the performance of each model by comparing the predictions with the evaluation targets
|
||||
* Saves the results in the `scripts/pred` directory
|
||||
*
|
||||
The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script.
|
||||
For example, to change the sweep configuration file for GPD model, run:
|
||||
`python pipeline.py --gpd_config <new config file>`
|
||||
The new config file should be placed in the `experiments` or as specified in the `configs_path` parameter in the config.json file.
|
||||
The results are available at
|
||||
`https://epos-ai.grid.cyfronet.pl/<WANDB_USER>/<WANDB_PROJECT>`
|
||||
Weights and training logs can be downloaded from the platform.
|
||||
Additionally, the most important data are saved locally in `weights/<dataset_name>_<model_name>/ ` directory:
|
||||
* Weights of the best checkpoint of each model are saved as `<dataset_name>_<model_name>_sweep=<sweep_id>-run=<run_id>-epoch=<epoch_number>-val_loss=<val_loss>.ckpt`
|
||||
* Metrics and hyperparams are saved in <run_id> folders
|
||||
|
||||
* Uses the best performing model of each type to generate predictions. The predictons are saved in the `scripts/pred/<dataset_name>_<model_name>/<run_id>` directory.
|
||||
* Evaluates the performance of each model by comparing the predictions with the evaluation targets.
|
||||
The results are saved in the `scripts/pred/results.csv` file.
|
||||
|
||||
The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script.
|
||||
For example, to change the sweep configuration file for GPD model, run:
|
||||
`python pipeline.py --gpd_config <new config file>`
|
||||
The new config file should be placed in the `experiments` folder or as specified in the `configs_path` parameter in the config.json file.
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
|
@ -1,7 +1,7 @@
|
||||
{
|
||||
"dataset_name": "igf",
|
||||
"data_path": "datasets/igf/seisbench_format/",
|
||||
"targets_path": "datasets/targets/igf",
|
||||
"dataset_name": "bogdanka",
|
||||
"data_path": "datasets/bogdanka/seisbench_format/",
|
||||
"targets_path": "datasets/targets",
|
||||
"models_path": "weights",
|
||||
"configs_path": "experiments",
|
||||
"sampling_rate": 100,
|
||||
|
29
poetry.lock
generated
29
poetry.lock
generated
@ -283,6 +283,14 @@ python-versions = "*"
|
||||
[package.dependencies]
|
||||
six = ">=1.4.0"
|
||||
|
||||
[[package]]
|
||||
name = "et-xmlfile"
|
||||
version = "1.1.0"
|
||||
description = "An implementation of lxml.xmlfile for the standard library"
|
||||
category = "main"
|
||||
optional = false
|
||||
python-versions = ">=3.6"
|
||||
|
||||
[[package]]
|
||||
name = "exceptiongroup"
|
||||
version = "1.1.2"
|
||||
@ -971,6 +979,17 @@ imaging = ["cartopy"]
|
||||
"io.shapefile" = ["pyshp"]
|
||||
tests = ["packaging", "pyproj", "pytest", "pytest-json-report"]
|
||||
|
||||
[[package]]
|
||||
name = "openpyxl"
|
||||
version = "3.1.2"
|
||||
description = "A Python library to read/write Excel 2010 xlsx/xlsm files"
|
||||
category = "main"
|
||||
optional = false
|
||||
python-versions = ">=3.6"
|
||||
|
||||
[package.dependencies]
|
||||
et-xmlfile = "*"
|
||||
|
||||
[[package]]
|
||||
name = "overrides"
|
||||
version = "7.3.1"
|
||||
@ -1766,7 +1785,7 @@ test = ["websockets"]
|
||||
[metadata]
|
||||
lock-version = "1.1"
|
||||
python-versions = "^3.10"
|
||||
content-hash = "2f8790f8c3e1a78ff23f0a0f0e954c97d2b0033fc6a890d4ef1355c6922dcc64"
|
||||
content-hash = "86f528987bd303e300f586a26f506318d7bdaba445886a6a5a36f86f9e89b229"
|
||||
|
||||
[metadata.files]
|
||||
anyio = [
|
||||
@ -2076,6 +2095,10 @@ docker-pycreds = [
|
||||
{file = "docker-pycreds-0.4.0.tar.gz", hash = "sha256:6ce3270bcaf404cc4c3e27e4b6c70d3521deae82fb508767870fdbf772d584d4"},
|
||||
{file = "docker_pycreds-0.4.0-py2.py3-none-any.whl", hash = "sha256:7266112468627868005106ec19cd0d722702d2b7d5912a28e19b826c3d37af49"},
|
||||
]
|
||||
et-xmlfile = [
|
||||
{file = "et_xmlfile-1.1.0-py3-none-any.whl", hash = "sha256:a2ba85d1d6a74ef63837eed693bcb89c3f752169b0e3e7ae5b16ca5e1b3deada"},
|
||||
{file = "et_xmlfile-1.1.0.tar.gz", hash = "sha256:8eb9e2bc2f8c97e37a2dc85a09ecdcdec9d8a396530a6d5a33b30b9a92da0c5c"},
|
||||
]
|
||||
exceptiongroup = [
|
||||
{file = "exceptiongroup-1.1.2-py3-none-any.whl", hash = "sha256:e346e69d186172ca7cf029c8c1d16235aa0e04035e5750b4b95039e65204328f"},
|
||||
{file = "exceptiongroup-1.1.2.tar.gz", hash = "sha256:12c3e887d6485d16943a309616de20ae5582633e0a2eda17f4e10fd61c1e8af5"},
|
||||
@ -2622,6 +2645,10 @@ obspy = [
|
||||
{file = "obspy-1.4.0-cp39-cp39-win_amd64.whl", hash = "sha256:2090a95b08b214575892c3d99bb3362b13a3b0f4689d4ee55f95ea4d8a2cbc26"},
|
||||
{file = "obspy-1.4.0.tar.gz", hash = "sha256:336a6e1d9a485732b08173cb5dc1dd720a8e53f3b54c180a62bb8ceaa5fe5c06"},
|
||||
]
|
||||
openpyxl = [
|
||||
{file = "openpyxl-3.1.2-py2.py3-none-any.whl", hash = "sha256:f91456ead12ab3c6c2e9491cf33ba6d08357d802192379bb482f1033ade496f5"},
|
||||
{file = "openpyxl-3.1.2.tar.gz", hash = "sha256:a6f5977418eff3b2d5500d54d9db50c8277a368436f4e4f8ddb1be3422870184"},
|
||||
]
|
||||
overrides = [
|
||||
{file = "overrides-7.3.1-py3-none-any.whl", hash = "sha256:6187d8710a935d09b0bcef8238301d6ee2569d2ac1ae0ec39a8c7924e27f58ca"},
|
||||
{file = "overrides-7.3.1.tar.gz", hash = "sha256:8b97c6c1e1681b78cbc9424b138d880f0803c2254c5ebaabdde57bb6c62093f2"},
|
||||
|
@ -16,6 +16,7 @@ wandb = "^0.15.4"
|
||||
torchmetrics = "^0.11.4"
|
||||
ipykernel = "^6.24.0"
|
||||
jupyterlab = "^4.0.2"
|
||||
openpyxl = "^3.1.2"
|
||||
|
||||
[tool.poetry.dev-dependencies]
|
||||
|
||||
|
@ -15,8 +15,8 @@ config = load_config(config_path)
|
||||
|
||||
data_path = f"{project_path}/{config['data_path']}"
|
||||
models_path = f"{project_path}/{config['models_path']}"
|
||||
targets_path = f"{project_path}/{config['targets_path']}"
|
||||
dataset_name = config['dataset_name']
|
||||
targets_path = f"{project_path}/{config['targets_path']}/{dataset_name}"
|
||||
configs_path = f"{project_path}/{config['configs_path']}"
|
||||
|
||||
sweep_files = config['sweep_files']
|
||||
|
@ -29,11 +29,11 @@ data_aliases = {
|
||||
"instance": "InstanceCountsCombined",
|
||||
"iquique": "Iquique",
|
||||
"lendb": "LenDB",
|
||||
"scedc": "SCEDC"
|
||||
"scedc": "SCEDC",
|
||||
}
|
||||
|
||||
|
||||
def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, sweep_id=None, test_run=False):
|
||||
def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, sweep_id=None):
|
||||
weights = Path(weights)
|
||||
targets = Path(os.path.abspath(targets))
|
||||
print(targets)
|
||||
@ -100,8 +100,6 @@ def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, swe
|
||||
for task in ["1", "23"]:
|
||||
task_csv = targets / f"task{task}.csv"
|
||||
|
||||
print(task_csv)
|
||||
|
||||
if not task_csv.is_file():
|
||||
continue
|
||||
|
||||
@ -227,9 +225,7 @@ if __name__ == "__main__":
|
||||
parser.add_argument(
|
||||
"--sweep_id", type=str, help="wandb sweep_id", required=False, default=None
|
||||
)
|
||||
parser.add_argument(
|
||||
"--test_run", action="store_true", required=False, default=False
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
main(
|
||||
@ -239,8 +235,7 @@ if __name__ == "__main__":
|
||||
batchsize=args.batchsize,
|
||||
num_workers=args.num_workers,
|
||||
sampling_rate=args.sampling_rate,
|
||||
sweep_id=args.sweep_id,
|
||||
test_run=args.test_run
|
||||
sweep_id=args.sweep_id
|
||||
)
|
||||
running_time = str(
|
||||
datetime.timedelta(seconds=time.perf_counter() - code_start_time)
|
||||
|
@ -3,6 +3,7 @@
|
||||
# This work was partially funded by EPOS Project funded in frame of PL-POIR4.2
|
||||
# -----------------
|
||||
|
||||
import os
|
||||
import os.path
|
||||
import argparse
|
||||
from pytorch_lightning.loggers import WandbLogger, CSVLogger
|
||||
@ -22,6 +23,7 @@ from config_loader import models_path, dataset_name, seed, experiment_count
|
||||
|
||||
|
||||
torch.multiprocessing.set_sharing_strategy('file_system')
|
||||
os.system("ulimit -n unlimited")
|
||||
|
||||
load_dotenv()
|
||||
wandb_api_key = os.environ.get('WANDB_API_KEY')
|
||||
|
@ -17,8 +17,8 @@ import eval
|
||||
import collect_results
|
||||
from config_loader import data_path, targets_path, sampling_rate, dataset_name, sweep_files
|
||||
|
||||
logging.root.setLevel(logging.INFO)
|
||||
logger = logging.getLogger('pipeline')
|
||||
logger.setLevel(logging.INFO)
|
||||
|
||||
|
||||
def load_sweep_config(model_name, args):
|
||||
@ -76,16 +76,19 @@ def main():
|
||||
args = parser.parse_args()
|
||||
|
||||
# generate labels
|
||||
logger.info("Started generating labels for the dataset.")
|
||||
generate_eval_targets.main(data_path, targets_path, "2,3", sampling_rate, None)
|
||||
|
||||
# find the best hyperparams for the models
|
||||
logger.info("Started training the models.")
|
||||
for model_name in ["GPD", "PhaseNet"]:
|
||||
sweep_id = find_the_best_params(model_name, args)
|
||||
generate_predictions(sweep_id, model_name)
|
||||
|
||||
# collect results
|
||||
logger.info("Collecting results.")
|
||||
collect_results.traverse_path("pred", "pred/results.csv")
|
||||
|
||||
logger.info("Results saved in pred/results.csv")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
@ -20,18 +20,13 @@ import torch
|
||||
import os
|
||||
import logging
|
||||
from pathlib import Path
|
||||
from dotenv import load_dotenv
|
||||
|
||||
import models, data, util
|
||||
import time
|
||||
import datetime
|
||||
import wandb
|
||||
#
|
||||
# load_dotenv()
|
||||
# wandb_api_key = os.environ.get('WANDB_API_KEY')
|
||||
# if wandb_api_key is None:
|
||||
# raise ValueError("WANDB_API_KEY environment variable is not set.")
|
||||
#
|
||||
# wandb.login(key=wandb_api_key)
|
||||
|
||||
|
||||
def train(config, experiment_name, test_run):
|
||||
"""
|
||||
@ -210,6 +205,14 @@ def generate_phase_mask(dataset, phases):
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
load_dotenv()
|
||||
wandb_api_key = os.environ.get('WANDB_API_KEY')
|
||||
if wandb_api_key is None:
|
||||
raise ValueError("WANDB_API_KEY environment variable is not set.")
|
||||
|
||||
wandb.login(key=wandb_api_key)
|
||||
|
||||
code_start_time = time.perf_counter()
|
||||
|
||||
torch.manual_seed(42)
|
||||
|
@ -16,7 +16,7 @@ load_dotenv()
|
||||
|
||||
|
||||
logging.basicConfig()
|
||||
logging.getLogger().setLevel(logging.DEBUG)
|
||||
logging.getLogger().setLevel(logging.INFO)
|
||||
|
||||
|
||||
def load_best_model_data(sweep_id, weights):
|
||||
|
1134
utils/Transforming mseeds from Bogdanka to Seisbench format.ipynb
Normal file
1134
utils/Transforming mseeds from Bogdanka to Seisbench format.ipynb
Normal file
File diff suppressed because one or more lines are too long
@ -88,8 +88,8 @@
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" Datetime X Y Depth Mw \n",
|
||||
"0 2020-01-01 10:09:42.200 5.582503e+06 5.702646e+06 0.7 2.469231 \\\n",
|
||||
" Datetime X Y Depth Mw \\\n",
|
||||
"0 2020-01-01 10:09:42.200 5.582503e+06 5.702646e+06 0.7 2.469231 \n",
|
||||
"\n",
|
||||
" Phases mseed_name \n",
|
||||
"0 Pg BRDW 2020-01-01 10:09:44.400, Sg BRDW 2020-... 20200101100941.mseed "
|
||||
@ -101,7 +101,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"input_path = str(Path.cwd().parent) + \"/data/igf/\"\n",
|
||||
"input_path = str(Path.cwd().parent) + \"/datasets/igf/\"\n",
|
||||
"catalog = pd.read_excel(input_path + \"Catalog_20_21.xlsx\", index_col=0)\n",
|
||||
"catalog.head(1)"
|
||||
]
|
||||
@ -317,7 +317,7 @@
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Traces converted: 35784it [00:52, 679.39it/s]\n"
|
||||
"Traces converted: 35784it [01:01, 578.58it/s]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -339,8 +339,10 @@
|
||||
" continue\n",
|
||||
" if os.path.exists(input_path + \"mseeds/mseeds_2020/\" + event.mseed_name):\n",
|
||||
" mseed_path = input_path + \"mseeds/mseeds_2020/\" + event.mseed_name \n",
|
||||
" else:\n",
|
||||
" elif os.path.exists(input_path + \"mseeds/mseeds_2021/\" + event.mseed_name):\n",
|
||||
" mseed_path = input_path + \"mseeds/mseeds_2021/\" + event.mseed_name \n",
|
||||
" else: \n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" \n",
|
||||
" stream = get_mseed(mseed_path)\n",
|
||||
@ -374,6 +376,8 @@
|
||||
" # trace_params[f\"trace_{pick.phase_hint}_status\"] = pick.evaluation_mode\n",
|
||||
" \n",
|
||||
" writer.add_trace({**event_params, **trace_params}, data)\n",
|
||||
"\n",
|
||||
" # break\n",
|
||||
" \n",
|
||||
" "
|
||||
]
|
||||
@ -393,7 +397,25 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"data = sbd.WaveformDataset(output_path, sampling_rate=100)"
|
||||
"data = sbd.WaveformDataset(output_path, sampling_rate=100)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"id": "33c77509-7aab-4833-a372-16030941395d",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Unnamed dataset - 35784 traces\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(data)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -406,17 +428,17 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"execution_count": 13,
|
||||
"id": "1753f65e-fe5d-4cfa-ab42-ae161ac4a253",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"<matplotlib.lines.Line2D at 0x7f7ed04a8820>"
|
||||
"<matplotlib.lines.Line2D at 0x14d6c12d0>"
|
||||
]
|
||||
},
|
||||
"execution_count": 12,
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
},
|
||||
@ -449,7 +471,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"execution_count": 14,
|
||||
"id": "bf7dae75-c90b-44f8-a51d-44e8abaaa3c3",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -472,7 +494,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"execution_count": 15,
|
||||
"id": "de82db24-d983-4592-a0eb-f96beecb2f69",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@ -622,29 +644,29 @@
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" index source_origin_time source_latitude_deg source_longitude_deg \n",
|
||||
"0 0 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \\\n",
|
||||
" index source_origin_time source_latitude_deg source_longitude_deg \\\n",
|
||||
"0 0 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
|
||||
"1 1 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
|
||||
"2 2 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
|
||||
"3 3 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
|
||||
"4 4 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
|
||||
"\n",
|
||||
" source_depth_km source_magnitude split station_network_code station_code \n",
|
||||
"0 0.7 2.469231 train PL BRDW \\\n",
|
||||
" source_depth_km source_magnitude split station_network_code station_code \\\n",
|
||||
"0 0.7 2.469231 train PL BRDW \n",
|
||||
"1 0.7 2.469231 train PL BRDW \n",
|
||||
"2 0.7 2.469231 train PL GROD \n",
|
||||
"3 0.7 2.469231 train PL GROD \n",
|
||||
"4 0.7 2.469231 train PL GUZI \n",
|
||||
"\n",
|
||||
" trace_channel trace_sampling_rate_hz trace_start_time \n",
|
||||
"0 EHE 100.0 2020-01-01T10:09:36.480000Z \\\n",
|
||||
" trace_channel trace_sampling_rate_hz trace_start_time \\\n",
|
||||
"0 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
|
||||
"1 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
|
||||
"2 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
|
||||
"3 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
|
||||
"4 CNE 100.0 2020-01-01T10:09:36.476000Z \n",
|
||||
"\n",
|
||||
" trace_Pg_arrival_sample trace_name trace_Sg_arrival_sample \n",
|
||||
"0 792.0 bucket0$0,:3,:2001 NaN \\\n",
|
||||
" trace_Pg_arrival_sample trace_name trace_Sg_arrival_sample \\\n",
|
||||
"0 792.0 bucket0$0,:3,:2001 NaN \n",
|
||||
"1 NaN bucket0$1,:3,:2001 921.0 \n",
|
||||
"2 872.0 bucket0$2,:3,:2001 NaN \n",
|
||||
"3 NaN bucket0$3,:3,:2001 1017.0 \n",
|
||||
@ -658,7 +680,7 @@
|
||||
"4 ZNE "
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@ -700,7 +722,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.7"
|
||||
"version": "3.10.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
19
utils/convert_data.sh
Normal file
19
utils/convert_data.sh
Normal file
@ -0,0 +1,19 @@
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=mseeds_to_seisbench
|
||||
#SBATCH --time=1:00:00
|
||||
#SBATCH --account=plgeposai22gpu-gpu
|
||||
#SBATCH --partition plgrid
|
||||
#SBATCH --cpus-per-task=1
|
||||
#SBATCH --ntasks-per-node=1
|
||||
#SBATCH --mem=24gb
|
||||
|
||||
|
||||
## activate conda environment
|
||||
source /net/pr2/projects/plgrid/plggeposai/kmilian/mambaforge/bin/activate
|
||||
conda activate epos-ai-train
|
||||
|
||||
input_path="/net/pr2/projects/plgrid/plggeposai/datasets/bogdanka"
|
||||
catalog_path="/net/pr2/projects/plgrid/plggeposai/datasets/bogdanka/BOIS_all.xml"
|
||||
output_path="/net/pr2/projects/plgrid/plggeposai/kmilian/platform-demo-scripts/datasets/bogdanka/seisbench_format"
|
||||
|
||||
python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path
|
250
utils/mseeds_to_seisbench.py
Normal file
250
utils/mseeds_to_seisbench.py
Normal file
@ -0,0 +1,250 @@
|
||||
import os
|
||||
import pandas as pd
|
||||
import glob
|
||||
from pathlib import Path
|
||||
|
||||
import obspy
|
||||
from obspy.core.event import read_events
|
||||
|
||||
import seisbench
|
||||
import seisbench.data as sbd
|
||||
import seisbench.util as sbu
|
||||
import sys
|
||||
import logging
|
||||
import argparse
|
||||
|
||||
|
||||
logging.basicConfig(filename="output.out",
|
||||
filemode='a',
|
||||
format='%(asctime)s,%(msecs)d %(name)s %(levelname)s %(message)s',
|
||||
datefmt='%H:%M:%S',
|
||||
level=logging.DEBUG)
|
||||
|
||||
|
||||
|
||||
logger = logging.getLogger('converter')
|
||||
|
||||
def create_traces_catalog(directory, years):
|
||||
for year in years:
|
||||
directory = f"{directory}/{year}"
|
||||
files = glob.glob(directory)
|
||||
traces = []
|
||||
for i, f in enumerate(files):
|
||||
st = obspy.read(f)
|
||||
|
||||
for tr in st.traces:
|
||||
# trace_id = tr.id
|
||||
# start = tr.meta.starttime
|
||||
# end = tr.meta.endtime
|
||||
|
||||
trs = pd.Series({
|
||||
'trace_id': tr.id,
|
||||
'trace_st': tr.meta.starttime,
|
||||
'trace_end': tr.meta.endtime,
|
||||
'stream_fname': f
|
||||
})
|
||||
traces.append(trs)
|
||||
|
||||
traces_catalog = pd.DataFrame(pd.concat(traces)).transpose()
|
||||
traces_catalog.to_csv("data/bogdanka/traces_catalog.csv", append=True, index=False)
|
||||
|
||||
|
||||
def split_events(events, input_path):
|
||||
|
||||
logger.info("Splitting available events into train, dev and test sets ...")
|
||||
events_stats = pd.DataFrame()
|
||||
events_stats.index.name = "event"
|
||||
|
||||
for i, event in enumerate(events):
|
||||
#check if mseed exists
|
||||
actual_picks = 0
|
||||
for pick in event.picks:
|
||||
trace_params = get_trace_params(pick)
|
||||
trace_path = get_trace_path(input_path, trace_params)
|
||||
if os.path.isfile(trace_path):
|
||||
actual_picks += 1
|
||||
|
||||
events_stats.loc[i, "pick_count"] = actual_picks
|
||||
|
||||
events_stats['pick_count_cumsum'] = events_stats.pick_count.cumsum()
|
||||
|
||||
train_th = 0.7 * events_stats.pick_count_cumsum.values[-1]
|
||||
dev_th = 0.85 * events_stats.pick_count_cumsum.values[-1]
|
||||
|
||||
events_stats['split'] = 'test'
|
||||
for i, event in events_stats.iterrows():
|
||||
if event['pick_count_cumsum'] < train_th:
|
||||
events_stats.loc[i, 'split'] = 'train'
|
||||
elif event['pick_count_cumsum'] < dev_th:
|
||||
events_stats.loc[i, 'split'] = 'dev'
|
||||
else:
|
||||
break
|
||||
|
||||
return events_stats
|
||||
|
||||
|
||||
def get_event_params(event):
|
||||
origin = event.preferred_origin()
|
||||
if origin is None:
|
||||
return {}
|
||||
# print(origin)
|
||||
|
||||
mag = event.preferred_magnitude()
|
||||
|
||||
source_id = str(event.resource_id)
|
||||
|
||||
event_params = {
|
||||
"source_id": source_id,
|
||||
"source_origin_uncertainty_sec": origin.time_errors["uncertainty"],
|
||||
"source_latitude_deg": origin.latitude,
|
||||
"source_latitude_uncertainty_km": origin.latitude_errors["uncertainty"],
|
||||
"source_longitude_deg": origin.longitude,
|
||||
"source_longitude_uncertainty_km": origin.longitude_errors["uncertainty"],
|
||||
"source_depth_km": origin.depth / 1e3,
|
||||
"source_depth_uncertainty_km": origin.depth_errors["uncertainty"] / 1e3 if origin.depth_errors[
|
||||
"uncertainty"] is not None else None,
|
||||
}
|
||||
|
||||
if mag is not None:
|
||||
event_params["source_magnitude"] = mag.mag
|
||||
event_params["source_magnitude_uncertainty"] = mag.mag_errors["uncertainty"]
|
||||
event_params["source_magnitude_type"] = mag.magnitude_type
|
||||
event_params["source_magnitude_author"] = mag.creation_info.agency_id if mag.creation_info is not None else None
|
||||
|
||||
return event_params
|
||||
|
||||
|
||||
def get_trace_params(pick):
|
||||
net = pick.waveform_id.network_code
|
||||
sta = pick.waveform_id.station_code
|
||||
|
||||
trace_params = {
|
||||
"station_network_code": net,
|
||||
"station_code": sta,
|
||||
"trace_channel": pick.waveform_id.channel_code,
|
||||
"station_location_code": pick.waveform_id.location_code,
|
||||
"time": pick.time
|
||||
}
|
||||
|
||||
return trace_params
|
||||
|
||||
|
||||
def find_trace(pick_time, traces):
|
||||
for tr in traces:
|
||||
if pick_time > tr.stats.endtime:
|
||||
continue
|
||||
if pick_time >= tr.stats.starttime:
|
||||
# print(pick_time, " - selected trace: ", tr)
|
||||
return tr
|
||||
|
||||
logger.warning(f"no matching trace for peak: {pick_time}")
|
||||
return None
|
||||
|
||||
|
||||
def get_trace_path(input_path, trace_params):
|
||||
year = trace_params["time"].year
|
||||
day_of_year = pd.Timestamp(str(trace_params["time"])).day_of_year
|
||||
net = trace_params["station_network_code"]
|
||||
station = trace_params["station_code"]
|
||||
tr_channel = trace_params["trace_channel"]
|
||||
|
||||
path = f"{input_path}/{year}/{net}/{station}/{tr_channel}.D/{net}.{station}..{tr_channel}.D.{year}.{day_of_year}"
|
||||
return path
|
||||
|
||||
|
||||
def load_trace(input_path, trace_params):
|
||||
trace_path = get_trace_path(input_path, trace_params)
|
||||
trace = None
|
||||
|
||||
if not os.path.isfile(trace_path):
|
||||
logger.w(trace_path + " not found")
|
||||
else:
|
||||
stream = obspy.read(trace_path)
|
||||
if len(stream.traces) > 1:
|
||||
trace = find_trace(trace_params["time"], stream.traces)
|
||||
elif len(stream.traces) == 0:
|
||||
logger.warning(f"no data in: {trace_path}")
|
||||
else:
|
||||
trace = stream.traces[0]
|
||||
|
||||
return trace
|
||||
|
||||
|
||||
def load_stream(input_path, trace_params, time_before=60, time_after=60):
|
||||
trace_path = get_trace_path(input_path, trace_params)
|
||||
sampling_rate, stream = None, None
|
||||
pick_time = trace_params["time"]
|
||||
|
||||
if not os.path.isfile(trace_path):
|
||||
print(trace_path + " not found")
|
||||
else:
|
||||
stream = obspy.read(trace_path)
|
||||
stream = stream.slice(pick_time - time_before, pick_time + time_after)
|
||||
if len(stream.traces) == 0:
|
||||
print(f"no data in: {trace_path}")
|
||||
else:
|
||||
sampling_rate = stream.traces[0].stats.sampling_rate
|
||||
|
||||
return sampling_rate, stream
|
||||
|
||||
|
||||
def convert_mseed_to_seisbench_format(input_path, catalog_path, output_path):
|
||||
"""
|
||||
Convert mseed files to seisbench dataset format
|
||||
:param input_path: folder with mseed files
|
||||
:param catalog_path: path to events catalog in quakeml format
|
||||
:param output_path: folder to save seisbench dataset
|
||||
:return:
|
||||
"""
|
||||
logger.info("Loading events catalog ...")
|
||||
events = read_events(catalog_path)
|
||||
events_stats = split_events(events, input_path)
|
||||
|
||||
metadata_path = output_path + "/metadata.csv"
|
||||
waveforms_path = output_path + "/waveforms.hdf5"
|
||||
|
||||
logger.debug("Catalog loaded, starting conversion ...")
|
||||
|
||||
with sbd.WaveformDataWriter(metadata_path, waveforms_path) as writer:
|
||||
writer.data_format = {
|
||||
"dimension_order": "CW",
|
||||
"component_order": "ZNE",
|
||||
}
|
||||
for i, event in enumerate(events):
|
||||
logger.debug(f"Converting {i} event")
|
||||
event_params = get_event_params(event)
|
||||
event_params["split"] = events_stats.loc[i, "split"]
|
||||
|
||||
for pick in event.picks:
|
||||
trace_params = get_trace_params(pick)
|
||||
sampling_rate, stream = load_stream(input_path, trace_params)
|
||||
if stream is None:
|
||||
continue
|
||||
|
||||
actual_t_start, data, _ = sbu.stream_to_array(
|
||||
stream,
|
||||
component_order=writer.data_format["component_order"],
|
||||
)
|
||||
|
||||
trace_params["trace_sampling_rate_hz"] = sampling_rate
|
||||
trace_params["trace_start_time"] = str(actual_t_start)
|
||||
|
||||
pick_time = obspy.core.utcdatetime.UTCDateTime(trace_params["time"])
|
||||
pick_idx = (pick_time - actual_t_start) * sampling_rate
|
||||
|
||||
trace_params[f"trace_{pick.phase_hint}_arrival_sample"] = int(pick_idx)
|
||||
|
||||
writer.add_trace({**event_params, **trace_params}, data)
|
||||
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
parser = argparse.ArgumentParser(description='Convert mseed files to seisbench format')
|
||||
parser.add_argument('--input_path', type=str, help='Path to mseed files')
|
||||
parser.add_argument('--catalog_path', type=str, help='Path to events catalog in quakeml format')
|
||||
parser.add_argument('--output_path', type=str, help='Path to output files')
|
||||
args = parser.parse_args()
|
||||
|
||||
|
||||
convert_mseed_to_seisbench_format(args.input_path, args.catalog_path, args.output_path)
|
230
utils/utils.py
Normal file
230
utils/utils.py
Normal file
@ -0,0 +1,230 @@
|
||||
import os
|
||||
import pandas as pd
|
||||
import glob
|
||||
from pathlib import Path
|
||||
|
||||
import obspy
|
||||
from obspy.core.event import read_events
|
||||
|
||||
import seisbench.data as sbd
|
||||
import seisbench.util as sbu
|
||||
import sys
|
||||
import logging
|
||||
|
||||
logging.basicConfig(filename="output.out",
|
||||
filemode='a',
|
||||
format='%(asctime)s,%(msecs)d %(name)s %(levelname)s %(message)s',
|
||||
datefmt='%H:%M:%S',
|
||||
level=logging.DEBUG)
|
||||
|
||||
logger = logging.getLogger('converter')
|
||||
|
||||
def create_traces_catalog(directory, years):
|
||||
for year in years:
|
||||
directory = f"{directory}/{year}"
|
||||
files = glob.glob(directory)
|
||||
traces = []
|
||||
for i, f in enumerate(files):
|
||||
st = obspy.read(f)
|
||||
|
||||
for tr in st.traces:
|
||||
# trace_id = tr.id
|
||||
# start = tr.meta.starttime
|
||||
# end = tr.meta.endtime
|
||||
|
||||
trs = pd.Series({
|
||||
'trace_id': tr.id,
|
||||
'trace_st': tr.meta.starttime,
|
||||
'trace_end': tr.meta.endtime,
|
||||
'stream_fname': f
|
||||
})
|
||||
traces.append(trs)
|
||||
|
||||
traces_catalog = pd.DataFrame(pd.concat(traces)).transpose()
|
||||
traces_catalog.to_csv("data/bogdanka/traces_catalog.csv", append=True, index=False)
|
||||
|
||||
|
||||
def split_events(events, input_path):
|
||||
|
||||
logger.info("Splitting available events into train, dev and test sets ...")
|
||||
events_stats = pd.DataFrame()
|
||||
events_stats.index.name = "event"
|
||||
|
||||
for i, event in enumerate(events):
|
||||
#check if mseed exists
|
||||
actual_picks = 0
|
||||
for pick in event.picks:
|
||||
trace_params = get_trace_params(pick)
|
||||
trace_path = get_trace_path(input_path, trace_params)
|
||||
if os.path.isfile(trace_path):
|
||||
actual_picks += 1
|
||||
|
||||
events_stats.loc[i, "pick_count"] = actual_picks
|
||||
|
||||
events_stats['pick_count_cumsum'] = events_stats.pick_count.cumsum()
|
||||
|
||||
train_th = 0.7 * events_stats.pick_count_cumsum.values[-1]
|
||||
dev_th = 0.85 * events_stats.pick_count_cumsum.values[-1]
|
||||
|
||||
events_stats['split'] = 'test'
|
||||
for i, event in events_stats.iterrows():
|
||||
if event['pick_count_cumsum'] < train_th:
|
||||
events_stats.loc[i, 'split'] = 'train'
|
||||
elif event['pick_count_cumsum'] < dev_th:
|
||||
events_stats.loc[i, 'split'] = 'dev'
|
||||
else:
|
||||
break
|
||||
|
||||
return events_stats
|
||||
|
||||
|
||||
def get_event_params(event):
|
||||
origin = event.preferred_origin()
|
||||
if origin is None:
|
||||
return {}
|
||||
# print(origin)
|
||||
|
||||
mag = event.preferred_magnitude()
|
||||
|
||||
source_id = str(event.resource_id)
|
||||
|
||||
event_params = {
|
||||
"source_id": source_id,
|
||||
"source_origin_uncertainty_sec": origin.time_errors["uncertainty"],
|
||||
"source_latitude_deg": origin.latitude,
|
||||
"source_latitude_uncertainty_km": origin.latitude_errors["uncertainty"],
|
||||
"source_longitude_deg": origin.longitude,
|
||||
"source_longitude_uncertainty_km": origin.longitude_errors["uncertainty"],
|
||||
"source_depth_km": origin.depth / 1e3,
|
||||
"source_depth_uncertainty_km": origin.depth_errors["uncertainty"] / 1e3 if origin.depth_errors[
|
||||
"uncertainty"] is not None else None,
|
||||
}
|
||||
|
||||
if mag is not None:
|
||||
event_params["source_magnitude"] = mag.mag
|
||||
event_params["source_magnitude_uncertainty"] = mag.mag_errors["uncertainty"]
|
||||
event_params["source_magnitude_type"] = mag.magnitude_type
|
||||
event_params["source_magnitude_author"] = mag.creation_info.agency_id if mag.creation_info is not None else None
|
||||
|
||||
return event_params
|
||||
|
||||
|
||||
def get_trace_params(pick):
|
||||
net = pick.waveform_id.network_code
|
||||
sta = pick.waveform_id.station_code
|
||||
|
||||
trace_params = {
|
||||
"station_network_code": net,
|
||||
"station_code": sta,
|
||||
"trace_channel": pick.waveform_id.channel_code,
|
||||
"station_location_code": pick.waveform_id.location_code,
|
||||
"time": pick.time
|
||||
}
|
||||
|
||||
return trace_params
|
||||
|
||||
|
||||
def find_trace(pick_time, traces):
|
||||
for tr in traces:
|
||||
if pick_time > tr.stats.endtime:
|
||||
continue
|
||||
if pick_time >= tr.stats.starttime:
|
||||
# print(pick_time, " - selected trace: ", tr)
|
||||
return tr
|
||||
|
||||
logger.warning(f"no matching trace for peak: {pick_time}")
|
||||
return None
|
||||
|
||||
|
||||
def get_trace_path(input_path, trace_params):
|
||||
year = trace_params["time"].year
|
||||
day_of_year = pd.Timestamp(str(trace_params["time"])).day_of_year
|
||||
net = trace_params["station_network_code"]
|
||||
station = trace_params["station_code"]
|
||||
tr_channel = trace_params["trace_channel"]
|
||||
|
||||
path = f"{input_path}/{year}/{net}/{station}/{tr_channel}.D/{net}.{station}..{tr_channel}.D.{year}.{day_of_year}"
|
||||
return path
|
||||
|
||||
|
||||
def load_trace(input_path, trace_params):
|
||||
trace_path = get_trace_path(input_path, trace_params)
|
||||
trace = None
|
||||
|
||||
if not os.path.isfile(trace_path):
|
||||
logger.w(trace_path + " not found")
|
||||
else:
|
||||
stream = obspy.read(trace_path)
|
||||
if len(stream.traces) > 1:
|
||||
trace = find_trace(trace_params["time"], stream.traces)
|
||||
elif len(stream.traces) == 0:
|
||||
logger.warning(f"no data in: {trace_path}")
|
||||
else:
|
||||
trace = stream.traces[0]
|
||||
|
||||
return trace
|
||||
|
||||
|
||||
def load_stream(input_path, trace_params, time_before=60, time_after=60):
|
||||
trace_path = get_trace_path(input_path, trace_params)
|
||||
sampling_rate, stream = None, None
|
||||
pick_time = trace_params["time"]
|
||||
|
||||
if not os.path.isfile(trace_path):
|
||||
print(trace_path + " not found")
|
||||
else:
|
||||
stream = obspy.read(trace_path)
|
||||
stream = stream.slice(pick_time - time_before, pick_time + time_after)
|
||||
if len(stream.traces) == 0:
|
||||
print(f"no data in: {trace_path}")
|
||||
else:
|
||||
sampling_rate = stream.traces[0].stats.sampling_rate
|
||||
|
||||
return sampling_rate, stream
|
||||
|
||||
|
||||
def convert_mseed_to_seisbench_format():
|
||||
input_path = "/net/pr2/projects/plgrid/plggeposai"
|
||||
logger.info("Loading events catalog ...")
|
||||
events = read_events(input_path + "/BOIS_all.xml")
|
||||
events_stats = split_events(events)
|
||||
output_path = input_path + "/seisbench_format"
|
||||
metadata_path = output_path + "/metadata.csv"
|
||||
waveforms_path = output_path + "/waveforms.hdf5"
|
||||
|
||||
with sbd.WaveformDataWriter(metadata_path, waveforms_path) as writer:
|
||||
writer.data_format = {
|
||||
"dimension_order": "CW",
|
||||
"component_order": "ZNE",
|
||||
}
|
||||
for i, event in enumerate(events):
|
||||
logger.debug(f"Converting {i} event")
|
||||
event_params = get_event_params(event)
|
||||
event_params["split"] = events_stats.loc[i, "split"]
|
||||
# b = False
|
||||
|
||||
for pick in event.picks:
|
||||
trace_params = get_trace_params(pick)
|
||||
sampling_rate, stream = load_stream(input_path, trace_params)
|
||||
if stream is None:
|
||||
continue
|
||||
|
||||
actual_t_start, data, _ = sbu.stream_to_array(
|
||||
stream,
|
||||
component_order=writer.data_format["component_order"],
|
||||
)
|
||||
|
||||
trace_params["trace_sampling_rate_hz"] = sampling_rate
|
||||
trace_params["trace_start_time"] = str(actual_t_start)
|
||||
|
||||
pick_time = obspy.core.utcdatetime.UTCDateTime(trace_params["time"])
|
||||
pick_idx = (pick_time - actual_t_start) * sampling_rate
|
||||
|
||||
trace_params[f"trace_{pick.phase_hint}_arrival_sample"] = int(pick_idx)
|
||||
|
||||
writer.add_trace({**event_params, **trace_params}, data)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
convert_mseed_to_seisbench_format()
|
||||
# create_traces_catalog("/net/pr2/projects/plgrid/plggeposai/", ["2018", "2019"])
|
Loading…
Reference in New Issue
Block a user