Persistence
PrimeHub provides several types of persistent data stores. This document describes the characteristics of each of them, and the conditions in which they perform the best.
Volume Types
User Volume
A user's private storage.
A User Volume is good for:
- Storing personal data. (e.g. datasets, code)
A User Volume has the following limitations:
- It can only be accessed via notebooks.
Shared Volume
A volume shared among group members. All members can read and write data to this volume. It is like an NFS server for a group.
A Shared Volume is good for:
- Storing shared data among members in a group
- Exchanging data among notebooks, jobs, and apps
A Shared Volume cannot be used for:
- Downloading and uploading data through an API/CLI/SDK
A group's Shared Volume is not enabled by default. Please contact the system administrator to enable it. For more information, Please see Group Management.
PHFS Storage
PHFS (PrimeHub File System) is shared among group members, like a Shared Volume. However, PHFS has the added benefit of being object storage, similar to S3. Due to the characteristics of object storage, PHFS provides the best accessibility out of all kinds of storage.
Data stored in PHFS can be found under the subpath /groups/<group>
of an object storage bucket.
There are several ways to access data stored in PHFS:
- PHFS can be mounted in notebooks, apps, and jobs.
- Users can download/upload content from the Shared Files UI in the User Portal.
- Users can list and download files from the PrimeHub SDK/CLI.
PHFS is good for:
- Uploading and downloading files via the Shared Files UI
- Data exchange through PrimeHub SDK/CLI
- Storing the artifacts of a job's output
- The source of model files for model deployment
Even though we can access the PHFS from the filesystem, the access mode is not fully POSIX-compatible. It does not allow random access and append write. It's only suitable for sequential read and sequential write operation.
Due to this limitation PHFS cannot be used for:
Uploading a file with size > 1MB from the notebook UI (i.e. Jupterlab upload feature). An error will occur and the uploaded file will be truncated to 1MB. To upload files larger than 1MB, please use the Shared Files UI.
The output of training. Some ML frameworks cannot output training results successfully to PHFS. For example, in TensorFlow, writing model files in HDF5 format to PHFS will cause the error
Problems closing file (file write failed: ...)
due to HDF5 using seek while writing. To store training results in PHFS, first output to a User Volume or Shared Volume, and then copy to PHFS.The input of training. PHFS has the worst performance out of all kinds of storage. To train a dataset multiple runs, we recommend putting them in user volume or group volume.
PHFS is not installed by default, please check this document to configure PrimeHub store and PHFS.
Dataset Volume
A Dataset Volume is a storage type that can be shared among multiple groups. The following permission settings can be configured:
- Read-only on a global or per-group basis
- Writable on a per-group basis
There are several kinds of Dataset Volumes we can create:
- Persistent volume (PV): Like group volume, but can be shared among multiple groups rather than just a single group.
- NFS: A volume that connects to an external NFS server.
- Host Path: A special kind of volume that mounts the host filesystem.
- Git: A special kind of volume which syncs the upstream git repository periodically. The actual data is stored on the host filesystem.
- Env: Technically, this is not a volume, but a method to configure environment variables to be used in notebooks and jobs.
A Dataset Volume is good for:
- Sharing among groups. In an education environment, for example, datasets could be shared among multiple teams (groups) of students with read-only permissions, while the teaching assistants could be in another group with write permissions.
- Special storage destination (e.g. external NFS server, host path, git sync)
A Dataset Volume has the following limitations:
Data cannot downloaded and uploaded through API/CLI/SDK
If the volume is to be used by only one group then, due to its ease of use, a Shared Volume is preferred
A Dataset Volume is configured by the system administrator. For more information, Please see Dataset Management. In some types of the dataset, we can also configure a upload server to upload data to the dataset volume.
Comparison
Type | Shared by | API/UI Access | Use case |
---|---|---|---|
User Volume | No | No | Private data |
Group Volume | Group members of a group | No | Shared data in group |
PHFS | Group members of a group | Yes | Data import/export |
Dataset Volume | Multiple groups | No | Shared data among groups |
All four storage options can be accessed via the file system. The following table describes the mount points and characteristics:
Type | Available in | Mount point | Characteristic |
---|---|---|---|
User Volume | Notebooks | /home/jovyan | Best performance (like block device) |
Group Volume | Notebooks Apps Jobs | /project/<group> | Good performance (like NFS) |
PHFS | Notebooks Apps Jobs | /phfs | Limited access mode Sequential Read/Write (like object storage) |
Dataset Volume | Notebooks Apps Jobs | /datasets/<dataset> | Good performace (like NFS) |