Shareable data and code in YDP Projects

- July 12, 2022

Datasets, everywhere...

One major ability I want in YDP is the ability to create datasets that can be interpreted as global, reproducable datasets which were created as part of a dataset generator. Although I presume datasets will primarily be used inside their own projects, I think there's great value in the ability to make these datasets global, and accessible by any other project.

The Problem

At the moment, the problem is in the implementation. This was no issue when YDP was a sandbox test project in my local filesystem, where I could easily WRITE a file here, then READ a file here under the same name. But, in a real production architecture, we're using Docker as of this moment to manage our independent filesystems.

When it comes to being able to have a dataset be generated in Project 1's file system (FS) then wants to be used in Project 2's FS, what has to occur is something along the lines of the following:

Flag a Python script as a generator for a dataset.
Execute that Python script and save the dataset to Project 1's FS.
Add a command to the YDP commands which exports a dataset when needed.
When Project 2 wants this dataset, ask the Host to call Project 1's command, stream it to the Host which will store it in memory, then stream it to Project 2 at runtime.

See the below diagram to outline this idea:

The advantages to this idea are that no additional storage is required, and data is sent around on-the-fly, with no need to partition "extra" space.

Host, Volumes, and Projects

The only problem with this architecture is the ability to cluster together two hosts. You'd have to use some sort of streaming implementation in this setup, because between hosts how can you share volumes? There may be a great solution to this by Docker already.

The advantage of this idea is that I/O is direct, and the ability to bring in a new dataset is instant and requires no "streaming". But, streaming may still be the best and simplest method.

See this article for an outline of how this is possible.

Conclusion

There's no conclusion at the moment, I'll implement the streaming idea first, test it with a big payload larger-than-ram, then go from there and see how it works. If it works well, I can solve the out-of-host streaming issue by streaming outside of hosts.

If you'd like to get your hands on this data platform as soon as it's open to the public, make sure to follow this blog for regular updates and voice what you need in a data platform.

Search This Blog

Your Data Place