Data Traffic Control - make your data files human-friendly
One of my least favorite things to do in my job as a Data Scientist is to open data files. I open data files dozens of times a day- to check on model results, modify configs, and view log files. And yet for being such a common task, it always fills me with a small amount of disgruntlement.
Consider an example. Let’s say I just trained a model, which I saved to the filesystem as a
.pkl. I want to open it up and see what hyper-parameters it landed on. To access that file, I would open up IPython and type the following:
import pickle data_dir = '/cluster/dataset/tenant/projects/my_project' with open(data_dir+'models/2021-03-01_16-55-32.pkl', 'rb+') as f: model = pickle.load(f)
That’s 163 characters to type, just to open a file! Not to mention the annoyance of having to look up the time stamp of the latest model filename, and the frustration of remembering “Does pickle use
r mode, or
I didn’t become a computer scientist to memorize long tedious strings! I became a computer scientist to automate all the things! That’s why I built data traffic control- a python library that automates everyday interactions with your data.
Here’s what the same data file access looks like in data traffic control:
import datatc as dtc dd = dtc.DataDirectory.load('my_project') model = dd['models'].latest().load()
That’s almost half as much typing! And more importantly, a lot less to remember: no finicky filepaths, no obtuse file read syntax, and no long timestamped filenames to look up.
Data files are the building materials we work with every day, all day. Working with them should be effortless.
data traffic control: a tour
Curious how data traffic control makes data interactions easier? Let’s take a closer look.
Navigate your data directories with ease, without having to memorize long file paths
Data traffic control remembers your project data directories for you. The first time you use data traffic control for your project, register that project’s data directory:
import datatc as dtc dtc.DataDirectory.register_project('project_name', '/path/to/project/data/dir/')
From then on, you can boot up a data traffic control
DataDirectory object with a simple
dd = dtc.DataDirectory.load('project_name')
Why do I need this
DataDirectory object, you ask?
DataDirectory provides a human-friendly interface for easily accessing your files.
ls your project directory and print a nicely formatted file tree:
>>> dd.ls() project_data/ raw/ iris.csv processed/ clean_iris.csv clean_iris_bugfix.csv feature_sets/ features.csv new_features.csv features_bugfix.csv models/ 2019-11-05_model.pkl 2019-11-27_model_v2.pkl 2019-12-04_model_v3.pkl 2020-01-13_model_final.pkl 2020-01-16_model_final2.pkl
Load data files without having to think about it (or looking up the syntax yet again)
Navigate the subdirectories using
. Once you’ve located the right file, just call
.load(). Don’t worry about what format the file is in- data traffic control will intuit how to load the file!
For example, you could access the
clean_iris_bugfix.csv file like this:
processed_df = dd['processed']['clean_iris_bugfix.csv'].load()
If that’s too much typing (I admire your high standards), you can use helper methods like
latest to quickly access the file you want.
The select shortcut
For example, you could access the
features_bugfix.csv file like this:
features_df = dd['feature_sets']['features_bugfix.csv'].load()
…but even better, you can use the
select shortcut, which matches filenames with a search substring:
features_df = dd['feature_sets'].select('bugfix').load()
The latest shortcut
Similarly, you could open up the latest model file by writing:
latest_model = dd['models']['2020-01-16_model_final2.pkl'].load()
… or use the
latest shortcut to get there faster:
latest_model = dd['models'].latest().load()
Saving files works the same way. Navigate to the destination directory, and call save with your data object and a name.
pip install datatc
Could data traffic control save you time interacting with your data files? Install it and give it a whirl!
You can find the full documentation on readthedocs.io