Data Traffic Control - make your data files human-friendly
One of my least favorite things to do in my job as a Data Scientist is to open data files. I open data files dozens of times a day- to check on model results, modify configs, and view log files. And yet for being such a common task, it always fills me with a small amount of disgruntlement.
Consider an example. Let’s say I just trained a model, which I saved to the filesystem as a .pkl
. I want to open it up and see what hyper-parameters it landed on. To access that file, I would open up IPython and type the following:
import pickle
data_dir = '/cluster/dataset/tenant/projects/my_project'
with open(data_dir+'models/2021-03-01_16-55-32.pkl', 'rb+') as f:
model = pickle.load(f)
That’s 163 characters to type, just to open a file! Not to mention the annoyance of having to look up the time stamp of the latest model filename, and the frustration of remembering “Does pickle use r
mode, or rb
, or rb+
…?"
I didn’t become a computer scientist to memorize long tedious strings! I became a computer scientist to automate all the things! That’s why I built data traffic control- a python library that automates everyday interactions with your data.
Here’s what the same data file access looks like in data traffic control:
import datatc as dtc
dd = dtc.DataDirectory.load('my_project')
model = dd['models'].latest().load()
That’s almost half as much typing! And more importantly, a lot less to remember: no finicky filepaths, no obtuse file read syntax, and no long timestamped filenames to look up.
Data files are the building materials we work with every day, all day. Working with them should be effortless.
data traffic control: a tour
Curious how data traffic control makes data interactions easier? Let’s take a closer look.
Navigate your data directories with ease, without having to memorize long file paths
Data traffic control remembers your project data directories for you. The first time you use data traffic control for your project, register that project’s data directory:
import datatc as dtc
dtc.DataDirectory.register_project('project_name', '/path/to/project/data/dir/')
From then on, you can boot up a data traffic control DataDirectory
object with a simple load
call:
dd = dtc.DataDirectory.load('project_name')
Why do I need this DataDirectory
object, you ask? DataDirectory
provides a human-friendly interface for easily accessing your files.
For example, DataDirectory
can ls
your project directory and print a nicely formatted file tree:
>>> dd.ls()
project_data/
raw/
iris.csv
processed/
clean_iris.csv
clean_iris_bugfix.csv
feature_sets/
features.csv
new_features.csv
features_bugfix.csv
models/
2019-11-05_model.pkl
2019-11-27_model_v2.pkl
2019-12-04_model_v3.pkl
2020-01-13_model_final.pkl
2020-01-16_model_final2.pkl
Load data files without having to think about it (or looking up the syntax yet again)
Navigate the subdirectories using []
. Once you’ve located the right file, just call .load()
. Don’t worry about what format the file is in- data traffic control will intuit how to load the file!
For example, you could access the clean_iris_bugfix.csv
file like this:
processed_df = dd['processed']['clean_iris_bugfix.csv'].load()
If that’s too much typing (I admire your high standards), you can use helper methods like select
and latest
to quickly access the file you want.
The select shortcut
For example, you could access the features_bugfix.csv
file like this:
features_df = dd['feature_sets']['features_bugfix.csv'].load()
…but even better, you can use the select
shortcut, which matches filenames with a search substring:
features_df = dd['feature_sets'].select('bugfix').load()
The latest shortcut
Similarly, you could open up the latest model file by writing:
latest_model = dd['models']['2020-01-16_model_final2.pkl'].load()
… or use the latest
shortcut to get there faster:
latest_model = dd['models'].latest().load()
Saving files
Saving files works the same way. Navigate to the destination directory, and call save with your data object and a name.
dd['feature_sets'].save(new_features_df, 'new_features.csv')
pip install datatc
Could data traffic control save you time interacting with your data files? Install it and give it a whirl!
You can find the full documentation on readthedocs.io