Added scripts converting mseeds from Bogdanka to seisbench format, extended readme, modidified logging

This commit is contained in:
Krystyna Milian 2023-09-26 10:50:46 +02:00
parent 78ac51478c
commit aa39980573
15 changed files with 1788 additions and 66 deletions

View File

@ -2,10 +2,9 @@
This repo contains notebooks and scripts demonstrating how to:
- Prepare IGF data for training a seisbench model detecting P phase (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](utils/Transforming%20mseeds%20to%20SeisBench%20dataset.ipynb).
- Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb)
- Train various cnn models available in seisbench library and compare their performance of detecting P phase, check the [script](scripts/pipeline.py)
- Prepare data for training a seisbench model detecting P and S waves (i.e. transform mseeds into [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)), check the [notebook](utils/Transforming%20mseeds%20from%20Bogdanka%20to%20Seisbench%20format.ipynb) and the [script](utils/mseeds_to_seisbench.py)
- [to update] Explore available data, check the [notebook](notebooks/Explore%20igf%20data.ipynb)
- Train various cnn models available in seisbench library and compare their performance of detecting P and S waves, check the [script](scripts/pipeline.py)
- [to update] Validate model performance, check the [notebook](notebooks/Check%20model%20performance%20depending%20on%20station-random%20window.ipynb)
- [to update] Use model for detecting P phase, check the [notebook](notebooks/Present%20model%20predictions.ipynb)
@ -68,31 +67,68 @@ poetry shell
WANDB_HOST="https://epos-ai.grid.cyfronet.pl/"
WANDB_API_KEY="your key"
WANDB_USER="your user"
WANDB_PROJECT="training_seisbench_models_on_igf_data"
WANDB_PROJECT="training_seisbench_models"
BENCHMARK_DEFAULT_WORKER=2
2. Transform data into seisbench format. (unofficial)
* Download original data from the [drive](https://drive.google.com/drive/folders/1InVI9DLaD7gdzraM2jMzeIrtiBSu-UIK?usp=drive_link)
* Run the notebook: `utils/Transforming mseeds to SeisBench dataset.ipynb`
2. Transform data into seisbench format.
To utilize functionality of Seisbench library, data need to be transformed to [SeisBench data format](https://seisbench.readthedocs.io/en/stable/pages/data_format.html)). If your data is in the MSEED format, you can use the prepared script `mseeds_to_seisbench.py` to perform the transformation. Please make sure that your data has the same structure as the data used in this project.
The script assumes that:
* the data is stored in the following directory structure:
`input_path/year/station_network_code/station_code/trace_channel.D` e.g.
`input_path/2018/PL/ALBE/EHE.D/`
* the file names follow the pattern:
`station_network_code.station_code..trace_channel.D.year.day_of_year`
e.g. `PL.ALBE..EHE.D.2018.282`
* events catalog is stored in quakeML format
Run the script `mseeds_to_seisbench` located in the `utils` directory
3. Run the pipeline script:
```
cd utils
python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path
```
If you want to run the script on a cluster, you can use the script `convert_data.sh` as a template (adjust the grant name, computing name and paths) and send the job to queue using sbatch command on login node of e.g. Ares:
```
cd utils
sbatch convert_data.sh
```
If your data has a different structure or format, use the notebooks to gain an understanding of the Seisbench format and what needs to be done to transform your data:
* [Seisbench example](https://colab.research.google.com/github/seisbench/seisbench/blob/main/examples/01a_dataset_basics.ipynb) or
* [Transforming mseeds from Bogdanka to Seisbench format](utils/Transforming mseeds from Bogdanka to Seisbench format.ipynb) notebook
`python pipeline.py`
3. Adjust the `config.json` and specify:
* `dataset_name` - the name of the dataset, which will be used to name the folder with evaluation targets and predictions
* `data_path` - the path to the data in the Seisbench format
* `experiment_count` - the number of experiments to run for each model type
4. Run the pipeline script
`python pipeline.py`
The script performs the following steps:
* Generates evaluation targets
* Generates evaluation targets in `datasets/<dataset_name>/targets` directory.
* Trains multiple versions of GPD, PhaseNet and ... models to find the best hyperparameters, producing the lowest validation loss.
This step utilizes the Weights & Biases platform to perform the hyperparameters search (called sweeping) and track the training process and store the results.
The results are available at
`https://epos-ai.grid.cyfronet.pl/<your user name>/<your project name>`
* Uses the best performing model of each type to generate predictions
* Evaluates the performance of each model by comparing the predictions with the evaluation targets
* Saves the results in the `scripts/pred` directory
*
The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script.
For example, to change the sweep configuration file for GPD model, run:
`python pipeline.py --gpd_config <new config file>`
The new config file should be placed in the `experiments` or as specified in the `configs_path` parameter in the config.json file.
The results are available at
`https://epos-ai.grid.cyfronet.pl/<WANDB_USER>/<WANDB_PROJECT>`
Weights and training logs can be downloaded from the platform.
Additionally, the most important data are saved locally in `weights/<dataset_name>_<model_name>/ ` directory:
* Weights of the best checkpoint of each model are saved as `<dataset_name>_<model_name>_sweep=<sweep_id>-run=<run_id>-epoch=<epoch_number>-val_loss=<val_loss>.ckpt`
* Metrics and hyperparams are saved in <run_id> folders
* Uses the best performing model of each type to generate predictions. The predictons are saved in the `scripts/pred/<dataset_name>_<model_name>/<run_id>` directory.
* Evaluates the performance of each model by comparing the predictions with the evaluation targets.
The results are saved in the `scripts/pred/results.csv` file.
The default settings are saved in config.json file. To change the settings, edit the config.json file or pass the new settings as arguments to the script.
For example, to change the sweep configuration file for GPD model, run:
`python pipeline.py --gpd_config <new config file>`
The new config file should be placed in the `experiments` folder or as specified in the `configs_path` parameter in the config.json file.
### Troubleshooting

View File

@ -1,7 +1,7 @@
{
"dataset_name": "igf",
"data_path": "datasets/igf/seisbench_format/",
"targets_path": "datasets/targets/igf",
"dataset_name": "bogdanka",
"data_path": "datasets/bogdanka/seisbench_format/",
"targets_path": "datasets/targets",
"models_path": "weights",
"configs_path": "experiments",
"sampling_rate": 100,

29
poetry.lock generated
View File

@ -283,6 +283,14 @@ python-versions = "*"
[package.dependencies]
six = ">=1.4.0"
[[package]]
name = "et-xmlfile"
version = "1.1.0"
description = "An implementation of lxml.xmlfile for the standard library"
category = "main"
optional = false
python-versions = ">=3.6"
[[package]]
name = "exceptiongroup"
version = "1.1.2"
@ -971,6 +979,17 @@ imaging = ["cartopy"]
"io.shapefile" = ["pyshp"]
tests = ["packaging", "pyproj", "pytest", "pytest-json-report"]
[[package]]
name = "openpyxl"
version = "3.1.2"
description = "A Python library to read/write Excel 2010 xlsx/xlsm files"
category = "main"
optional = false
python-versions = ">=3.6"
[package.dependencies]
et-xmlfile = "*"
[[package]]
name = "overrides"
version = "7.3.1"
@ -1766,7 +1785,7 @@ test = ["websockets"]
[metadata]
lock-version = "1.1"
python-versions = "^3.10"
content-hash = "2f8790f8c3e1a78ff23f0a0f0e954c97d2b0033fc6a890d4ef1355c6922dcc64"
content-hash = "86f528987bd303e300f586a26f506318d7bdaba445886a6a5a36f86f9e89b229"
[metadata.files]
anyio = [
@ -2076,6 +2095,10 @@ docker-pycreds = [
{file = "docker-pycreds-0.4.0.tar.gz", hash = "sha256:6ce3270bcaf404cc4c3e27e4b6c70d3521deae82fb508767870fdbf772d584d4"},
{file = "docker_pycreds-0.4.0-py2.py3-none-any.whl", hash = "sha256:7266112468627868005106ec19cd0d722702d2b7d5912a28e19b826c3d37af49"},
]
et-xmlfile = [
{file = "et_xmlfile-1.1.0-py3-none-any.whl", hash = "sha256:a2ba85d1d6a74ef63837eed693bcb89c3f752169b0e3e7ae5b16ca5e1b3deada"},
{file = "et_xmlfile-1.1.0.tar.gz", hash = "sha256:8eb9e2bc2f8c97e37a2dc85a09ecdcdec9d8a396530a6d5a33b30b9a92da0c5c"},
]
exceptiongroup = [
{file = "exceptiongroup-1.1.2-py3-none-any.whl", hash = "sha256:e346e69d186172ca7cf029c8c1d16235aa0e04035e5750b4b95039e65204328f"},
{file = "exceptiongroup-1.1.2.tar.gz", hash = "sha256:12c3e887d6485d16943a309616de20ae5582633e0a2eda17f4e10fd61c1e8af5"},
@ -2622,6 +2645,10 @@ obspy = [
{file = "obspy-1.4.0-cp39-cp39-win_amd64.whl", hash = "sha256:2090a95b08b214575892c3d99bb3362b13a3b0f4689d4ee55f95ea4d8a2cbc26"},
{file = "obspy-1.4.0.tar.gz", hash = "sha256:336a6e1d9a485732b08173cb5dc1dd720a8e53f3b54c180a62bb8ceaa5fe5c06"},
]
openpyxl = [
{file = "openpyxl-3.1.2-py2.py3-none-any.whl", hash = "sha256:f91456ead12ab3c6c2e9491cf33ba6d08357d802192379bb482f1033ade496f5"},
{file = "openpyxl-3.1.2.tar.gz", hash = "sha256:a6f5977418eff3b2d5500d54d9db50c8277a368436f4e4f8ddb1be3422870184"},
]
overrides = [
{file = "overrides-7.3.1-py3-none-any.whl", hash = "sha256:6187d8710a935d09b0bcef8238301d6ee2569d2ac1ae0ec39a8c7924e27f58ca"},
{file = "overrides-7.3.1.tar.gz", hash = "sha256:8b97c6c1e1681b78cbc9424b138d880f0803c2254c5ebaabdde57bb6c62093f2"},

View File

@ -16,6 +16,7 @@ wandb = "^0.15.4"
torchmetrics = "^0.11.4"
ipykernel = "^6.24.0"
jupyterlab = "^4.0.2"
openpyxl = "^3.1.2"
[tool.poetry.dev-dependencies]

View File

@ -15,8 +15,8 @@ config = load_config(config_path)
data_path = f"{project_path}/{config['data_path']}"
models_path = f"{project_path}/{config['models_path']}"
targets_path = f"{project_path}/{config['targets_path']}"
dataset_name = config['dataset_name']
targets_path = f"{project_path}/{config['targets_path']}/{dataset_name}"
configs_path = f"{project_path}/{config['configs_path']}"
sweep_files = config['sweep_files']

View File

@ -29,11 +29,11 @@ data_aliases = {
"instance": "InstanceCountsCombined",
"iquique": "Iquique",
"lendb": "LenDB",
"scedc": "SCEDC"
"scedc": "SCEDC",
}
def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, sweep_id=None, test_run=False):
def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, sweep_id=None):
weights = Path(weights)
targets = Path(os.path.abspath(targets))
print(targets)
@ -100,8 +100,6 @@ def main(weights, targets, sets, batchsize, num_workers, sampling_rate=None, swe
for task in ["1", "23"]:
task_csv = targets / f"task{task}.csv"
print(task_csv)
if not task_csv.is_file():
continue
@ -227,9 +225,7 @@ if __name__ == "__main__":
parser.add_argument(
"--sweep_id", type=str, help="wandb sweep_id", required=False, default=None
)
parser.add_argument(
"--test_run", action="store_true", required=False, default=False
)
args = parser.parse_args()
main(
@ -239,8 +235,7 @@ if __name__ == "__main__":
batchsize=args.batchsize,
num_workers=args.num_workers,
sampling_rate=args.sampling_rate,
sweep_id=args.sweep_id,
test_run=args.test_run
sweep_id=args.sweep_id
)
running_time = str(
datetime.timedelta(seconds=time.perf_counter() - code_start_time)

View File

@ -3,6 +3,7 @@
# This work was partially funded by EPOS Project funded in frame of PL-POIR4.2
# -----------------
import os
import os.path
import argparse
from pytorch_lightning.loggers import WandbLogger, CSVLogger
@ -22,6 +23,7 @@ from config_loader import models_path, dataset_name, seed, experiment_count
torch.multiprocessing.set_sharing_strategy('file_system')
os.system("ulimit -n unlimited")
load_dotenv()
wandb_api_key = os.environ.get('WANDB_API_KEY')

View File

@ -17,8 +17,8 @@ import eval
import collect_results
from config_loader import data_path, targets_path, sampling_rate, dataset_name, sweep_files
logging.root.setLevel(logging.INFO)
logger = logging.getLogger('pipeline')
logger.setLevel(logging.INFO)
def load_sweep_config(model_name, args):
@ -76,16 +76,19 @@ def main():
args = parser.parse_args()
# generate labels
logger.info("Started generating labels for the dataset.")
generate_eval_targets.main(data_path, targets_path, "2,3", sampling_rate, None)
# find the best hyperparams for the models
logger.info("Started training the models.")
for model_name in ["GPD", "PhaseNet"]:
sweep_id = find_the_best_params(model_name, args)
generate_predictions(sweep_id, model_name)
# collect results
logger.info("Collecting results.")
collect_results.traverse_path("pred", "pred/results.csv")
logger.info("Results saved in pred/results.csv")
if __name__ == "__main__":
main()

View File

@ -20,18 +20,13 @@ import torch
import os
import logging
from pathlib import Path
from dotenv import load_dotenv
import models, data, util
import time
import datetime
import wandb
#
# load_dotenv()
# wandb_api_key = os.environ.get('WANDB_API_KEY')
# if wandb_api_key is None:
# raise ValueError("WANDB_API_KEY environment variable is not set.")
#
# wandb.login(key=wandb_api_key)
def train(config, experiment_name, test_run):
"""
@ -210,6 +205,14 @@ def generate_phase_mask(dataset, phases):
if __name__ == "__main__":
load_dotenv()
wandb_api_key = os.environ.get('WANDB_API_KEY')
if wandb_api_key is None:
raise ValueError("WANDB_API_KEY environment variable is not set.")
wandb.login(key=wandb_api_key)
code_start_time = time.perf_counter()
torch.manual_seed(42)

View File

@ -16,7 +16,7 @@ load_dotenv()
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.INFO)
def load_best_model_data(sweep_id, weights):

File diff suppressed because one or more lines are too long

View File

@ -88,8 +88,8 @@
"</div>"
],
"text/plain": [
" Datetime X Y Depth Mw \n",
"0 2020-01-01 10:09:42.200 5.582503e+06 5.702646e+06 0.7 2.469231 \\\n",
" Datetime X Y Depth Mw \\\n",
"0 2020-01-01 10:09:42.200 5.582503e+06 5.702646e+06 0.7 2.469231 \n",
"\n",
" Phases mseed_name \n",
"0 Pg BRDW 2020-01-01 10:09:44.400, Sg BRDW 2020-... 20200101100941.mseed "
@ -101,7 +101,7 @@
}
],
"source": [
"input_path = str(Path.cwd().parent) + \"/data/igf/\"\n",
"input_path = str(Path.cwd().parent) + \"/datasets/igf/\"\n",
"catalog = pd.read_excel(input_path + \"Catalog_20_21.xlsx\", index_col=0)\n",
"catalog.head(1)"
]
@ -317,7 +317,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
"Traces converted: 35784it [00:52, 679.39it/s]\n"
"Traces converted: 35784it [01:01, 578.58it/s]\n"
]
}
],
@ -339,8 +339,10 @@
" continue\n",
" if os.path.exists(input_path + \"mseeds/mseeds_2020/\" + event.mseed_name):\n",
" mseed_path = input_path + \"mseeds/mseeds_2020/\" + event.mseed_name \n",
" else:\n",
" elif os.path.exists(input_path + \"mseeds/mseeds_2021/\" + event.mseed_name):\n",
" mseed_path = input_path + \"mseeds/mseeds_2021/\" + event.mseed_name \n",
" else: \n",
" continue\n",
" \n",
" \n",
" stream = get_mseed(mseed_path)\n",
@ -374,6 +376,8 @@
" # trace_params[f\"trace_{pick.phase_hint}_status\"] = pick.evaluation_mode\n",
" \n",
" writer.add_trace({**event_params, **trace_params}, data)\n",
"\n",
" # break\n",
" \n",
" "
]
@ -393,7 +397,25 @@
"metadata": {},
"outputs": [],
"source": [
"data = sbd.WaveformDataset(output_path, sampling_rate=100)"
"data = sbd.WaveformDataset(output_path, sampling_rate=100)\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "33c77509-7aab-4833-a372-16030941395d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Unnamed dataset - 35784 traces\n"
]
}
],
"source": [
"print(data)"
]
},
{
@ -406,17 +428,17 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 13,
"id": "1753f65e-fe5d-4cfa-ab42-ae161ac4a253",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.lines.Line2D at 0x7f7ed04a8820>"
"<matplotlib.lines.Line2D at 0x14d6c12d0>"
]
},
"execution_count": 12,
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
},
@ -449,7 +471,7 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 14,
"id": "bf7dae75-c90b-44f8-a51d-44e8abaaa3c3",
"metadata": {},
"outputs": [
@ -472,7 +494,7 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 15,
"id": "de82db24-d983-4592-a0eb-f96beecb2f69",
"metadata": {},
"outputs": [
@ -622,29 +644,29 @@
"</div>"
],
"text/plain": [
" index source_origin_time source_latitude_deg source_longitude_deg \n",
"0 0 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \\\n",
" index source_origin_time source_latitude_deg source_longitude_deg \\\n",
"0 0 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
"1 1 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
"2 2 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
"3 3 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
"4 4 2020-01-01 10:09:42.200 5.702646e+06 5.582503e+06 \n",
"\n",
" source_depth_km source_magnitude split station_network_code station_code \n",
"0 0.7 2.469231 train PL BRDW \\\n",
" source_depth_km source_magnitude split station_network_code station_code \\\n",
"0 0.7 2.469231 train PL BRDW \n",
"1 0.7 2.469231 train PL BRDW \n",
"2 0.7 2.469231 train PL GROD \n",
"3 0.7 2.469231 train PL GROD \n",
"4 0.7 2.469231 train PL GUZI \n",
"\n",
" trace_channel trace_sampling_rate_hz trace_start_time \n",
"0 EHE 100.0 2020-01-01T10:09:36.480000Z \\\n",
" trace_channel trace_sampling_rate_hz trace_start_time \\\n",
"0 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
"1 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
"2 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
"3 EHE 100.0 2020-01-01T10:09:36.480000Z \n",
"4 CNE 100.0 2020-01-01T10:09:36.476000Z \n",
"\n",
" trace_Pg_arrival_sample trace_name trace_Sg_arrival_sample \n",
"0 792.0 bucket0$0,:3,:2001 NaN \\\n",
" trace_Pg_arrival_sample trace_name trace_Sg_arrival_sample \\\n",
"0 792.0 bucket0$0,:3,:2001 NaN \n",
"1 NaN bucket0$1,:3,:2001 921.0 \n",
"2 872.0 bucket0$2,:3,:2001 NaN \n",
"3 NaN bucket0$3,:3,:2001 1017.0 \n",
@ -658,7 +680,7 @@
"4 ZNE "
]
},
"execution_count": 14,
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
@ -700,7 +722,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
"version": "3.10.6"
}
},
"nbformat": 4,

19
utils/convert_data.sh Normal file
View File

@ -0,0 +1,19 @@
#!/bin/bash
#SBATCH --job-name=mseeds_to_seisbench
#SBATCH --time=1:00:00
#SBATCH --account=plgeposai22gpu-gpu
#SBATCH --partition plgrid
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=24gb
## activate conda environment
source /net/pr2/projects/plgrid/plggeposai/kmilian/mambaforge/bin/activate
conda activate epos-ai-train
input_path="/net/pr2/projects/plgrid/plggeposai/datasets/bogdanka"
catalog_path="/net/pr2/projects/plgrid/plggeposai/datasets/bogdanka/BOIS_all.xml"
output_path="/net/pr2/projects/plgrid/plggeposai/kmilian/platform-demo-scripts/datasets/bogdanka/seisbench_format"
python mseeds_to_seisbench.py --input_path $input_path --catalog_path $catalog_path --output_path $output_path

View File

@ -0,0 +1,250 @@
import os
import pandas as pd
import glob
from pathlib import Path
import obspy
from obspy.core.event import read_events
import seisbench
import seisbench.data as sbd
import seisbench.util as sbu
import sys
import logging
import argparse
logging.basicConfig(filename="output.out",
filemode='a',
format='%(asctime)s,%(msecs)d %(name)s %(levelname)s %(message)s',
datefmt='%H:%M:%S',
level=logging.DEBUG)
logger = logging.getLogger('converter')
def create_traces_catalog(directory, years):
for year in years:
directory = f"{directory}/{year}"
files = glob.glob(directory)
traces = []
for i, f in enumerate(files):
st = obspy.read(f)
for tr in st.traces:
# trace_id = tr.id
# start = tr.meta.starttime
# end = tr.meta.endtime
trs = pd.Series({
'trace_id': tr.id,
'trace_st': tr.meta.starttime,
'trace_end': tr.meta.endtime,
'stream_fname': f
})
traces.append(trs)
traces_catalog = pd.DataFrame(pd.concat(traces)).transpose()
traces_catalog.to_csv("data/bogdanka/traces_catalog.csv", append=True, index=False)
def split_events(events, input_path):
logger.info("Splitting available events into train, dev and test sets ...")
events_stats = pd.DataFrame()
events_stats.index.name = "event"
for i, event in enumerate(events):
#check if mseed exists
actual_picks = 0
for pick in event.picks:
trace_params = get_trace_params(pick)
trace_path = get_trace_path(input_path, trace_params)
if os.path.isfile(trace_path):
actual_picks += 1
events_stats.loc[i, "pick_count"] = actual_picks
events_stats['pick_count_cumsum'] = events_stats.pick_count.cumsum()
train_th = 0.7 * events_stats.pick_count_cumsum.values[-1]
dev_th = 0.85 * events_stats.pick_count_cumsum.values[-1]
events_stats['split'] = 'test'
for i, event in events_stats.iterrows():
if event['pick_count_cumsum'] < train_th:
events_stats.loc[i, 'split'] = 'train'
elif event['pick_count_cumsum'] < dev_th:
events_stats.loc[i, 'split'] = 'dev'
else:
break
return events_stats
def get_event_params(event):
origin = event.preferred_origin()
if origin is None:
return {}
# print(origin)
mag = event.preferred_magnitude()
source_id = str(event.resource_id)
event_params = {
"source_id": source_id,
"source_origin_uncertainty_sec": origin.time_errors["uncertainty"],
"source_latitude_deg": origin.latitude,
"source_latitude_uncertainty_km": origin.latitude_errors["uncertainty"],
"source_longitude_deg": origin.longitude,
"source_longitude_uncertainty_km": origin.longitude_errors["uncertainty"],
"source_depth_km": origin.depth / 1e3,
"source_depth_uncertainty_km": origin.depth_errors["uncertainty"] / 1e3 if origin.depth_errors[
"uncertainty"] is not None else None,
}
if mag is not None:
event_params["source_magnitude"] = mag.mag
event_params["source_magnitude_uncertainty"] = mag.mag_errors["uncertainty"]
event_params["source_magnitude_type"] = mag.magnitude_type
event_params["source_magnitude_author"] = mag.creation_info.agency_id if mag.creation_info is not None else None
return event_params
def get_trace_params(pick):
net = pick.waveform_id.network_code
sta = pick.waveform_id.station_code
trace_params = {
"station_network_code": net,
"station_code": sta,
"trace_channel": pick.waveform_id.channel_code,
"station_location_code": pick.waveform_id.location_code,
"time": pick.time
}
return trace_params
def find_trace(pick_time, traces):
for tr in traces:
if pick_time > tr.stats.endtime:
continue
if pick_time >= tr.stats.starttime:
# print(pick_time, " - selected trace: ", tr)
return tr
logger.warning(f"no matching trace for peak: {pick_time}")
return None
def get_trace_path(input_path, trace_params):
year = trace_params["time"].year
day_of_year = pd.Timestamp(str(trace_params["time"])).day_of_year
net = trace_params["station_network_code"]
station = trace_params["station_code"]
tr_channel = trace_params["trace_channel"]
path = f"{input_path}/{year}/{net}/{station}/{tr_channel}.D/{net}.{station}..{tr_channel}.D.{year}.{day_of_year}"
return path
def load_trace(input_path, trace_params):
trace_path = get_trace_path(input_path, trace_params)
trace = None
if not os.path.isfile(trace_path):
logger.w(trace_path + " not found")
else:
stream = obspy.read(trace_path)
if len(stream.traces) > 1:
trace = find_trace(trace_params["time"], stream.traces)
elif len(stream.traces) == 0:
logger.warning(f"no data in: {trace_path}")
else:
trace = stream.traces[0]
return trace
def load_stream(input_path, trace_params, time_before=60, time_after=60):
trace_path = get_trace_path(input_path, trace_params)
sampling_rate, stream = None, None
pick_time = trace_params["time"]
if not os.path.isfile(trace_path):
print(trace_path + " not found")
else:
stream = obspy.read(trace_path)
stream = stream.slice(pick_time - time_before, pick_time + time_after)
if len(stream.traces) == 0:
print(f"no data in: {trace_path}")
else:
sampling_rate = stream.traces[0].stats.sampling_rate
return sampling_rate, stream
def convert_mseed_to_seisbench_format(input_path, catalog_path, output_path):
"""
Convert mseed files to seisbench dataset format
:param input_path: folder with mseed files
:param catalog_path: path to events catalog in quakeml format
:param output_path: folder to save seisbench dataset
:return:
"""
logger.info("Loading events catalog ...")
events = read_events(catalog_path)
events_stats = split_events(events, input_path)
metadata_path = output_path + "/metadata.csv"
waveforms_path = output_path + "/waveforms.hdf5"
logger.debug("Catalog loaded, starting conversion ...")
with sbd.WaveformDataWriter(metadata_path, waveforms_path) as writer:
writer.data_format = {
"dimension_order": "CW",
"component_order": "ZNE",
}
for i, event in enumerate(events):
logger.debug(f"Converting {i} event")
event_params = get_event_params(event)
event_params["split"] = events_stats.loc[i, "split"]
for pick in event.picks:
trace_params = get_trace_params(pick)
sampling_rate, stream = load_stream(input_path, trace_params)
if stream is None:
continue
actual_t_start, data, _ = sbu.stream_to_array(
stream,
component_order=writer.data_format["component_order"],
)
trace_params["trace_sampling_rate_hz"] = sampling_rate
trace_params["trace_start_time"] = str(actual_t_start)
pick_time = obspy.core.utcdatetime.UTCDateTime(trace_params["time"])
pick_idx = (pick_time - actual_t_start) * sampling_rate
trace_params[f"trace_{pick.phase_hint}_arrival_sample"] = int(pick_idx)
writer.add_trace({**event_params, **trace_params}, data)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Convert mseed files to seisbench format')
parser.add_argument('--input_path', type=str, help='Path to mseed files')
parser.add_argument('--catalog_path', type=str, help='Path to events catalog in quakeml format')
parser.add_argument('--output_path', type=str, help='Path to output files')
args = parser.parse_args()
convert_mseed_to_seisbench_format(args.input_path, args.catalog_path, args.output_path)

230
utils/utils.py Normal file
View File

@ -0,0 +1,230 @@
import os
import pandas as pd
import glob
from pathlib import Path
import obspy
from obspy.core.event import read_events
import seisbench.data as sbd
import seisbench.util as sbu
import sys
import logging
logging.basicConfig(filename="output.out",
filemode='a',
format='%(asctime)s,%(msecs)d %(name)s %(levelname)s %(message)s',
datefmt='%H:%M:%S',
level=logging.DEBUG)
logger = logging.getLogger('converter')
def create_traces_catalog(directory, years):
for year in years:
directory = f"{directory}/{year}"
files = glob.glob(directory)
traces = []
for i, f in enumerate(files):
st = obspy.read(f)
for tr in st.traces:
# trace_id = tr.id
# start = tr.meta.starttime
# end = tr.meta.endtime
trs = pd.Series({
'trace_id': tr.id,
'trace_st': tr.meta.starttime,
'trace_end': tr.meta.endtime,
'stream_fname': f
})
traces.append(trs)
traces_catalog = pd.DataFrame(pd.concat(traces)).transpose()
traces_catalog.to_csv("data/bogdanka/traces_catalog.csv", append=True, index=False)
def split_events(events, input_path):
logger.info("Splitting available events into train, dev and test sets ...")
events_stats = pd.DataFrame()
events_stats.index.name = "event"
for i, event in enumerate(events):
#check if mseed exists
actual_picks = 0
for pick in event.picks:
trace_params = get_trace_params(pick)
trace_path = get_trace_path(input_path, trace_params)
if os.path.isfile(trace_path):
actual_picks += 1
events_stats.loc[i, "pick_count"] = actual_picks
events_stats['pick_count_cumsum'] = events_stats.pick_count.cumsum()
train_th = 0.7 * events_stats.pick_count_cumsum.values[-1]
dev_th = 0.85 * events_stats.pick_count_cumsum.values[-1]
events_stats['split'] = 'test'
for i, event in events_stats.iterrows():
if event['pick_count_cumsum'] < train_th:
events_stats.loc[i, 'split'] = 'train'
elif event['pick_count_cumsum'] < dev_th:
events_stats.loc[i, 'split'] = 'dev'
else:
break
return events_stats
def get_event_params(event):
origin = event.preferred_origin()
if origin is None:
return {}
# print(origin)
mag = event.preferred_magnitude()
source_id = str(event.resource_id)
event_params = {
"source_id": source_id,
"source_origin_uncertainty_sec": origin.time_errors["uncertainty"],
"source_latitude_deg": origin.latitude,
"source_latitude_uncertainty_km": origin.latitude_errors["uncertainty"],
"source_longitude_deg": origin.longitude,
"source_longitude_uncertainty_km": origin.longitude_errors["uncertainty"],
"source_depth_km": origin.depth / 1e3,
"source_depth_uncertainty_km": origin.depth_errors["uncertainty"] / 1e3 if origin.depth_errors[
"uncertainty"] is not None else None,
}
if mag is not None:
event_params["source_magnitude"] = mag.mag
event_params["source_magnitude_uncertainty"] = mag.mag_errors["uncertainty"]
event_params["source_magnitude_type"] = mag.magnitude_type
event_params["source_magnitude_author"] = mag.creation_info.agency_id if mag.creation_info is not None else None
return event_params
def get_trace_params(pick):
net = pick.waveform_id.network_code
sta = pick.waveform_id.station_code
trace_params = {
"station_network_code": net,
"station_code": sta,
"trace_channel": pick.waveform_id.channel_code,
"station_location_code": pick.waveform_id.location_code,
"time": pick.time
}
return trace_params
def find_trace(pick_time, traces):
for tr in traces:
if pick_time > tr.stats.endtime:
continue
if pick_time >= tr.stats.starttime:
# print(pick_time, " - selected trace: ", tr)
return tr
logger.warning(f"no matching trace for peak: {pick_time}")
return None
def get_trace_path(input_path, trace_params):
year = trace_params["time"].year
day_of_year = pd.Timestamp(str(trace_params["time"])).day_of_year
net = trace_params["station_network_code"]
station = trace_params["station_code"]
tr_channel = trace_params["trace_channel"]
path = f"{input_path}/{year}/{net}/{station}/{tr_channel}.D/{net}.{station}..{tr_channel}.D.{year}.{day_of_year}"
return path
def load_trace(input_path, trace_params):
trace_path = get_trace_path(input_path, trace_params)
trace = None
if not os.path.isfile(trace_path):
logger.w(trace_path + " not found")
else:
stream = obspy.read(trace_path)
if len(stream.traces) > 1:
trace = find_trace(trace_params["time"], stream.traces)
elif len(stream.traces) == 0:
logger.warning(f"no data in: {trace_path}")
else:
trace = stream.traces[0]
return trace
def load_stream(input_path, trace_params, time_before=60, time_after=60):
trace_path = get_trace_path(input_path, trace_params)
sampling_rate, stream = None, None
pick_time = trace_params["time"]
if not os.path.isfile(trace_path):
print(trace_path + " not found")
else:
stream = obspy.read(trace_path)
stream = stream.slice(pick_time - time_before, pick_time + time_after)
if len(stream.traces) == 0:
print(f"no data in: {trace_path}")
else:
sampling_rate = stream.traces[0].stats.sampling_rate
return sampling_rate, stream
def convert_mseed_to_seisbench_format():
input_path = "/net/pr2/projects/plgrid/plggeposai"
logger.info("Loading events catalog ...")
events = read_events(input_path + "/BOIS_all.xml")
events_stats = split_events(events)
output_path = input_path + "/seisbench_format"
metadata_path = output_path + "/metadata.csv"
waveforms_path = output_path + "/waveforms.hdf5"
with sbd.WaveformDataWriter(metadata_path, waveforms_path) as writer:
writer.data_format = {
"dimension_order": "CW",
"component_order": "ZNE",
}
for i, event in enumerate(events):
logger.debug(f"Converting {i} event")
event_params = get_event_params(event)
event_params["split"] = events_stats.loc[i, "split"]
# b = False
for pick in event.picks:
trace_params = get_trace_params(pick)
sampling_rate, stream = load_stream(input_path, trace_params)
if stream is None:
continue
actual_t_start, data, _ = sbu.stream_to_array(
stream,
component_order=writer.data_format["component_order"],
)
trace_params["trace_sampling_rate_hz"] = sampling_rate
trace_params["trace_start_time"] = str(actual_t_start)
pick_time = obspy.core.utcdatetime.UTCDateTime(trace_params["time"])
pick_idx = (pick_time - actual_t_start) * sampling_rate
trace_params[f"trace_{pick.phase_hint}_arrival_sample"] = int(pick_idx)
writer.add_trace({**event_params, **trace_params}, data)
if __name__ == "__main__":
convert_mseed_to_seisbench_format()
# create_traces_catalog("/net/pr2/projects/plgrid/plggeposai/", ["2018", "2019"])