Your Data Place Project Pipeline - From Folders, to ZIPs, to Docker.

So, to refresh readers, before Your Data Place becomes what I see as a bare minimum platform, it will require a few "functions":

  • Project building.
  • Common library importing/calling.
  • Dataset builders.
  • Dataset uploading.
  • Dataset versioning (potentially).

"A project is just like any other Python project. You build out the desired functionality, write some custom functions to help in that process, then have a script to orchestrate this logic. Your Data Place wants to build this comfortable environment, and add helpful features like function sharing to help improve productivity."

As for the first point, project building, there's a few things I'm trying to work out. To explain what I'm trying to work out, let me lay out the current project functionality:

  1. User creates a project.
  2. Server creates folders for project.
  3. User asks for the folder structure, server sends it to them.
  4. Project page renders the files from the server using a file explorer component.
  5. User clicks on a file in the file explorer.
  6. User writes some new content into the file, and it's saved in the server.

My immediate thoughts on this were "this is so cool!", but after the excitement of making a good step in the right direction occurs, further objections and refinements come into play:

  • What if I want to do backups on these folder filesystems? I'd need to store them as ZIPs.
  • Why not replace the raw filesystem altogether with a ZIP for each project?
  • What are the security risks of simply storing this data? This SO post seems to say there's NO issue of storing text data, but that doesn't end the attack vector since if the "save" logic is loose, it could result in overwriting files that are executed.

So, from those thoughts, it starts to become clear that storing the files in a ZIP would be a good possible option. But, as seen from the latest posts, I want to use Docker! One thing I NEED if this is to be scalable, is the ability to scale containers and not images. Images are huge, and as outlined in a SO post, it looks as if saving Docker "containers" is not a production thing. So, what is a proposed solution to this without a compromise to the user?

(What we need from the "filesystem implementation")

  • Ability to query the directory structure.
  • Ability to provide back (to the server) a file's contents.
  • Ability to pack working directory into a ZIP at command.
  • Ability to use some form of SDK in the container to import packages (set an environmental variable value by the server to give credentials to a container to only access a users content?)

(At the stage of saving the file)

  1. Server gets a file path + file data, asking to "save" it.
  2. Write file to a temp location, then use the Docker copy to move the file into the container.
  3. Delete the local temp file.
(At the stage of pulling the container project into the server)

  1. Client asks to download the project as a ZIP.
  2. Server gets the files/zip:
    1. Server executes a Docker copy command from the container project directory to the host.
    2. Server runs a common script "zipproj.py". When done, copies that file from the container to the host.
  3. Server ZIPs these files (if necessary), backups, and sends to client.

These are both good options, and I think that moving into this implementation will yield a lot of unexpected new ideas once that round is complete (the same way that implementing the first, rather basic idea, yielded this idea).

If you'd like to get your hands on this data platform as soon as it's open to the public, make sure to follow this blog for regular updates and voice what you need in a data platform.

Comments

Popular posts from this blog

The Petabyte Project, the Most Valuable Idea

Shareable data and code in YDP Projects

Siesta of YDP