Dataset Upload
Provide a upload server to upload data to pv
type dataset.
Configuration
Prerequisite
Required PRIMEHUB_FEATURE_USER_PORTAL
true. And PRIMEHUB_DOMAIN
be set.
Settings
Please add these variables to the .env
file
Name | Value |
---|---|
PRIMEHUB_FEATURE_DATASET_UPLOAD | true |
Install
make release-install-primehub
Migration
Set PRIMEHUB_STORAGE_CLASS
env to correct storage class.
Troubleshooting
- Check Primehub Console Container's Environment Variables
The environment variables should be added automatically.
PRIMEHUB_FEATURE_DATASET_UPLOAD
will be added to graphql
and ui
containers when PRIMEHUB_FEATURE_DATASET_UPLOAD
is true
in your cluster's .env
file.
CMS_APP_PREFIX
will be added to graphql
container.
PRIMEHUB_GROUP_SC
will be added to graphql
container. This value is based on
primehub-console:
graphql:
primehubGroupSC: {value}
And if you didn't specify primehubGroupSC's value in yaml, it will be set by PRIMEHUB_STORAGE_CLASS
env.
- Check Issuer
If you are using letsencrypt-prod-dns
issuer, your dataset upload ingress annotations should contain:
certmanager.k8s.io/acme-challenge-type: "dns01"
certmanager.k8s.io/acme-dns01-provider: "clouddns"
certmanager.k8s.io/cluster-issuer: "letsencrypt-prod-dns"
Design
We use tus protocol to do the resumable file uploads. Backend is tusd. Frontend package is uppy. In order to let user view/edit uploaded files, also have a flask server to view/edit uploaded files. The package to view files is Flask-AutoIndex. Therefore, dataset upload deployment contains two containers and both have a mounted pv
dataset.
Metacontroller is used to automatically create desired resources based on our settings.
Application code is under modules/primehub-dataset-upload. K8s and metacontroller related code is under modules/charts/primehub.
Start/Stop Dataset Upload Server
When dataset has an annotation dataset.primehub.io/uploadServer: "true"
, it will start a dataset upload server.
Otherwise, it is stopped.
Currently, dataset upload url is https://<primehub domain>/admin/dataset/<namespace>/<dataset name>/browse/
.
Enable Http Auth to Dataset Upload Server
First, need to have a secret which is created by htpasswd
. EX:
htpasswd -c auth <name>
kubectl -n hub create secret generic dataset-upload-<name> --from-file=auth
Then add an annotation dataset.primehub.io/uploadServerAuthSecretName: dataset-upload-<name>
to enable http auth.
Username is <name>
.
Current Post-Finish Hook in Tus Server (Tusd)
- Make a dir if we need
- Move .bin to their real file name
- Remove .info which generated by tusd
- If it is a zip file, unzip it
Other Notes
- Cli resumable ability now only handle bad network situations. It dose not handle the situation that user cancel a upload job. (web & cli can't resume interchangeably) (https://github.com/tus/tus-js-client/issues/62)
- Mechanism to clean up temporary state files.
Cli
- Download from https://github.com/avvertix/tus-client-cli/releases/tag/v0.3.0
- ./tus-client-macos upload
<filepath>
https://<primehub domain>/admin/dataset/<name>/upload/files/