WDC for Geophysics, Beijing(中国地球物理学科中心)
 
   

Author-submitted data information


ID 624
Title Core drilling data and geoscience tabular dataset augmentation code
Creator Pengfei Lv
Subject Dataset augmentation, geoscience tabular data, generative adversarial network
Publisher Xiukuan Zhao
Description This code is designed to implement augmentation for geoscience tabular datasets. By using a synergistic strategy of local augmentation (input) - mini-batch sampling (training) - classification integrator (output), it addresses the issue of class imbalance during the generation process. The code includes data preprocessing, model training, and visualization analysis. Below are the detailed instructions:

1. Data (Data.xlsx)

The dataset Data.xlsx contains features and class labels, which are used to train the model and generate new data. Steps to Use Data for Training and Data Generation:
  • Load and preprocess the data (modify df = pd.read_excel('Data.xlsx') to your file path).
  • The generated data will be saved as generated_Data_consistent.xlsx. Adjust the save path as needed.

2. Custom Data (MyData.xlsx)

To run the code with your custom data, follow these steps:
  • Save your data as MyData.xlsx, ensuring that it includes feature columns and a 'Class' label column.
  • Modify the code path df = pd.read_excel('MyData.xlsx') to load your data.
  • Adjust data preprocessing (e.g., normalization methods) according to the characteristics of your data.
  • Run the code to train the model, which will automatically generate and save the result data after training.
  • Use built-in visualization functions (e.g., plot_histograms() and t-SNE) to analyze the distribution of the generated data.
Notes:
  • Adjust Experimental Parameters: You can modify training parameters such as latent_dim, batch_size, etc., to optimize generation results.
  • Class Imbalance: If your data has Class A significantly outnumbering Class B, use the local augmentation method and adjust the augmentation ratio as needed.

3. Software Requirements

The code can be executed directly with the following software and versions:
  • Python 3.8 or higher
  • PyTorch 1.9.0 or higher
  • pandas 1.3.0 or higher
  • NumPy 1.21.0
  • scikit-learn 0.24.2
  • matplotlib 3.4.2
  • seaborn 0.11.1 or higher
Contributor Pengfei Lv
Date 2022-2023
Type
Format .xlsx, .py
URL http://www.geophys.ac.cn/ArticleData/20241111DatasetAugmentation.zip
DOI 10.12197/2024GA025
Source
Language eng
Relation
Coverage
Rights Institute of Geology and Geophysics, Chinese Academy of Sciences