sdgp

Utilization Synthetic data generator plus Project

Python application codecov Unit Tests

For questions on this package contact the package Developer Damodhar Jangam at damodhar918@outlook.com

Overview

This project Synthetic data generator plus is a python script that generates mock data based on given configurations. It can also edit and scale existing data to create high volume data. It is useful for testing and prototyping purposes.

Features

Package Installation

Install on a Local Machine (optional)

Go through the following sequence:

PS > python -m venv .venv
PS > .\.venv\Scripts\activate
PS > pip install -r requirements.txt
PS > python setup.py install
# You can utilize the package in this case by invoking sdgp.
PS > sdgp -h
# Before proceeding, please review the usage section.
PS > sdgp -c m 1000000 csv test test_conf.csv
PS > sdgp -c e 1000000 parquet test test_conf.csv
PS > deactivate # when you need exit

Install on a edge node (optional)

Go through the following sequence:

$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt
$ python setup.py install
# You can utilize the package in this case by invoking sdgp.
$ sdgp -h
# Before proceeding, please review the usage section.
$ sdgp -c m 1000000 csv test test_conf.csv
$ sdgp -c e 1000000 parquet test test_conf.csv
$ deactivate # when you need exit

You are nowready to proceed, as the package Synthetic data generator plus has been installed and is now available for utilization within your virtual environment.

Usage

To run the script, you need to provide some arguments:

name type values
id1 uniqueIndex 800000000
date1 date 2022-10-26 |%Y-%m-%d
time1 time 00:00:00|23:59:59
dateRange1 dateRange 2021-10-10 | 2022-10-26 |%Y-%m-%d
incometime2 dateRange 2021-10-10 | 2022-10-26 |%Y-%m-%d %H:%M:%S
outcometime3 dependentDateRange incometime2|1D|3W|%Y-%m-%d %H:%M:%S
model1 category Customers|Lending|Web_Lending
model category Customers|Lending|Web_Lending |
null1 category |
gender1 category 0|1|~0.4|0.5|0.1
probability1 floatRange 0.001|1|3
float1 floatRange 0.001|0.3|5
number1 intRange 10|25
test1 constant Done
name1 regexPattern ([a-z]{3,10})\, ([a-z]{3,10})
phone_number regexPattern (+[4-9]{2,3})-([4-9]{5})-([4-9]{5})
zip_code regexPattern ([4-9]{5})
email_address regexPattern ([a-zA-Z0-9]{1,10})\@[a-z]{1,5}.(com|net|org|in)
compositeKey composite dateRange1 | model1 |number1 |phone_number|zip_code
name,type,values
id1,uniqueIndex,800000000000000000000000000000
date1,date,2022-10-26|%Y-%m-%d
time1,time,00:00:00|23:59:59|%H:%M:%S
dateRange1,dateRange,2021-10-10 | 2022-10-26|%Y-%m-%d
incometime2,dateRange,2021-10-10 | 2022-10-26|%Y-%m-%d %H:%M:%S
outcometime3,dependentDateRange,incometime2|1D|3W|%Y-%m-%d %H:%M:%S
model1,category,Customers|Lending|Web_Lending
model,category,Customers|Lending|Web_Lending|
null1,category,
gender1,category,0|1|~0.4|0.5|0.1
probability1,floatRange,0.001|1|3
float1,floatRange,0.001|0.3|5
number1,intRange,10|25
test1,constant,Done
name1,regexPattern,"([a-z]{3,10})\, ([a-z]{3,10})"
phone_number,regexPattern,"(\+[4-9]{2,3})\-([4-9]{5})\-([4-9]{5})"
zip_code,regexPattern,([4-9]{5})
email_address,regexPattern,"([a-zA-Z0-9]{1,10})\@[a-z]{1,5}\.(com|net|org|in)"
compositeKey1,composite,dateRange1|model1|number1|phone_number|zip_code

Explanation of data patterns as per defined in the configuration file :

Each row in this CSV file defines a rule for generating or handling data in a specific column of another dataset. The rules include generating unique indices, fixed or random dates/times, categorical values, float values within a range, integer values within a range, or constant values.

datetime formats you can use in the script:

To run the script, use the following command:

sdgp -c <choice> <volume> <format> <csv_file> <conf_csv_file>

positional arguments:
  volume                The size. An integer value that specifies how many rows to generate mock data. Recommended
                        minimum value is more than volume size or more than 1000.
  {csv,parquet}         The type of format to save the mock data. csv for CSV format, parquet for Parquet format.
  csv_file              The CSV file name. A string value that specifies the name of the CSV file to read or write.
  conf_csv_file         The configuration CSV file name. A string value that specifies the name of the configuration
                        CSV file to read. This argument is required if mode is e or g.
options:
  -h, --help            show this help message and exit
  -c {m,e,g}, --choice {m,e,g}
                        The type of function to select. m for mock data, e for edit mock data, g for generate high
                        volume data.

For example:

sdgp -c m 50000 csv mock_table conf.csv # Generate 50000 rows of mock data and save as mock_table_50000.csv
sdgp -c e 100000 parquet edit_table.csv conf.csv # Along with given data can edit with conf.csv, generate 100000 recrds and save as edit_table_100000.parquet\n
sdgp -c g 1000000 csv scale.csv # Generate 1000000 rows of mock data by scaling existing data and save as scale_1000000.csv

Sample output for sdgp -c m 1000000 csv test .\test_conf.csv:

image.png

id1,date1,model1,model,gender1,probability1,float1,number1,test1,time1,dateRange1,incometime2,outcometime3,name1,phone_number,zip_code,email_address,compositeKey1
800000000000000000000000022022,2022-10-26,Lending,,1,0.526,0.01349,18,Done,09:17:27,2021-12-21,2021-10-11 19:45:49,2021-10-29 15:22:51,"xfdqsmj, pfnzgqd",+69-68479-47968,45568,S@euly.in,5291fed2490313181144993e6f9d0e478a774cbe
800000000000000000000000069854,2022-10-26,Lending,Web_Lending,1,0.466,0.13702,23,Done,13:06:09,2022-06-16,2022-05-06 05:33:04,2022-05-15 23:17:00,"spjcnaumo, fmxd",+769-58564-59786,74648,6HzAItG@rcfsb.com,553d8e0a445569f8c329c4f2cad5bd0a217e2cf8
800000000000000000000000052417,2022-10-26,Lending,Customers,1,0.474,0.07092,15,Done,13:27:22,2022-05-12,2022-06-09 00:01:55,2022-06-20 04:30:03,"wnzebd, xuhqai",+99-88586-45856,49977,c0@u.in,65d22a12c4c95d2d14615f9d5b4c6582cd60c45f
800000000000000000000000068698,2022-10-26,Customers,Web_Lending,0,0.012,0.12498,23,Done,00:19:00,2022-09-15,2022-04-25 08:29:39,2022-05-14 14:29:06,"kccxqzujf, aqitzbuj",+47-86496-46488,75598,1h4xIF@dx.in,f2c1cd5e87cc1a5ed0cf800bcb7228c3c4f621cb

Ouput image Samlpe output

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

If you have any questions, feedback, or suggestions, please feel free to contact me at damodhar918@outlook.com. You can also open an issue or submit a pull request on GitHub if you want to contribute to this project. I hope you find this project useful and interesting. Thank you for reading! 😊