Often, beginners in SQL or data science struggle with the matter of easy access to a large sample database file (.DB or .sqlite) for practicing SQL commands. Would it not be great to have a simple tool or library to generate a large database with multiple tables, filled with data of one’s own choice?
When you start learning and practicing data science, often the biggest worry is not the algorithms or techniques but availability of raw data. Fortunately, there are many high-quality real-life datasets available on the web for trying out cool machine learning techniques. However, from my personal experience, I found that the same is not true when it comes to learning SQL. Now, for data science - having a basic familiarity of SQL is almost as important as knowing how to write code in Python or R. But access to a large enough database with real data (such as name, age, credit card, SSN, address, birthday, etc.) is not nearly as common as access to toy datasets on Kaggle, specifically designed or curated for machine learning task.
Apart from the beginners in data science, even seasoned software testers may find it useful to have a simple tool where with a few lines of code they can generate arbitrarily large data sets with random (fake) yet meaningful entries.
It is a lightweight, pure-python library to generate random useful entries (e.g. name, address, credit card number, date, time, company name, job title, license plate number, etc.) and save them in either Pandas dataframe object, or as a SQLite table in a database file, or in a MS Excel file.
How to install it?
It’s (current version 1.0.5) hosted on PyPI (Python Package Index repository). Remember you need to have Faker installed to make this work. So, just type,
pip install pydbgen
Note, it’s currently only tested on Python 3.6. It won’t work on Python 2 installations.
How to use it?
You have to initiate a pydb object to start using it.
from pydbgen import pydbgen
After that, you can access the various internal functions exposed by the pydbobject. For example to print random US cities,
This generates a .DB file which can be used with MySQL or SQLite database server. The resulting database table was opened in DB Browser for SQLite and looks like following,
How to generate an Excel file?
Similar to above, just use the following code to generate an Excel file with random data. Note the 'simple_phone' set is set to False, thereby generating complex long-form phone numbers. This can come handy to experiment with more involved data extraction codes!
A cool way to generate random email IDs for scrap use?
One built-in method in pydbgen is realistic_email , which generates random email IDs from a seed name. Can you think of any use of this on the web where you don’t want to give out your real email ID but something close?
The current version is 1.0.5 and may contain many bugs. If you notice any and your program crashes during execution (except for bad entry by you), please let me know. Also, if you have cool idea to contribute to the source code, the Github repo is all open for you. Some questions readily come to mind,
Can we integrate some machine learning/statistical modeling with this random data generator?
Should visualization function be added to the generator?
Possibilities are endless and exciting…
We also use analytics & advertising services. To opt-out click for more information.