Model Deployment (Alpha)
Allow users to deploy a model as a service.
Features
- Multiple ML framework support: Supports Tensorflow 1, Tensorflow 2, Keras, Pytorch, XGBoost, MXNet, Scikit-learn, LightGBM
- Multiple language support: Supports Python, Java, R, NodeJS, Go (Please see the Seldon wrapper in the Seldon wrapper document)
- Horizontal scaleout: The deployed model service can be scaled to multiple replicas of the model service. So that it can achieve load balancing and fault tolerance easily.
- Deployment history: Track the deployment history.
- Resource constraint: The usage of resources for model deployment is constrained by group resource quota.
- Ingress: Create the ingress resource to redirect external requests to internal model service.
User Journey
Deploy a model
- (Admin) Enable a model deployment for one group
- Create a deployment. Select the instance type, model image, and the number of replicas.
- Wait until the deployment is ready
- Use
curl
to test against the model deployment endpoint
Package a model
- Train a model for some ML framework and select the best model for deployment
- Wrap the model file and build the image by
s2i
- Test the packaged image locally
- Push the image to docker registry
Update a deployment
- Select a deployment
- Click the Update button
- Change the image and deploy
Configuration
Please add this variable to the .env
file.
Name | Value |
---|---|
PRIMEHUB_FEATURE_MODEL_DEPLOYEMNT | true |
Chart value
Path | Description | Default Value |
---|---|---|
modelDeployment.enabled | Enable the model deployment | false |
Design
Seldon
Seldon is a model deployment solution in the community. The reason why we select Seldon as the solution is because it provides a common way to package different framework's by different programming languages into a docker image.
Seldon also provides an operator under Seldon core
project to manage a SeldonDeployment
resource and reconcile it to the underlying deployments and services. However, for simplicity, we decide not to use the SeldonDeployment
resource. Instead, we define PhDeployment and the controller for generating/to generate the underlying Deployment and Service directly.
Custom Resource
A custom resource PhDeployment
is defined for PrimeHub-defined model deployment. The deployment is very similar to the Kubernetes native deployment, the controller would spawn a deployment according to the PhDeployment
's spec. The difference is that the spec contains the PrimeHub-specific concept, like User, Group, and InstanceType.
Here is an example of PhDeployment
.
apiVersion: primehub.io/v1alpha1
kind: PhDeployment
metadata:
name: spam-classifier-abcxy
namespace: hub
spec:
displayName: "spam classifier"
userId: "4d203a08-896a-4aa8-86e2-882f4d4aadec"
userName: "phadmin"
groupId: "ca6b032e-b8be-44d2-9646-092622d6ba15"
groupName: "phusers"
stop: false
description: |
This is my first deployment.
This is my first deployment.
This is my first deployment.
predictors:
- name: predictor1
replicas: 2
modelImage: sandhya1594/spam-classifier:1.0.0.1
instanceType: cpu-only
metadata:
LEARNING_RATE: "0.02"
MINI_BATCH: "20"
ACCURACY: "0.98"
status:
phase: deploying
message: "Deploying"
replicas: 2
availableReplicas: 2
endpoint: https://primehub.local/deployment/user-defined-postfix/predict
history:
- time: 2020-03-23T02:03:15Z
spec: <PhDeploymentSpec>
- time: 2020-03-22T23:45:23Z
spec: <PhDeploymentSpec>
The PhDeployment
resource has the following children
- Ingress: The ingress resource to route the traffic to given model deployment
- Service: The service resource of a given deployment
- Deployment: The deployment of the user's image.
Model Image
For each deployment, it requires to provide the model image. It is responsible to translate the REST request to the internal model prediction call.
In Seldon documentation, there are two ways to prepare the model image
Pre-Packaged Inference Servers
- MLflow Server
- SKLearn server
- Tensorflow Serving
- XGBoost server
Language Wrappers
- Python Language Wrapper (Production)
- Java Language Wrapper (Incubating)
- R Language Wrapper (ALPHA)
- NodeJS Language Wrapper (ALPHA)
- Go Language Wrapper (ALPHA)
Currently, primehub model deployment ONLY supports the language wrapper solution. In the future, we may provide a guideline to write a Dockerfile to pack the model file in the pre-packaged server image.
API Endpoint
According to Seldon external prediction and internal microservice API, the endpoint of prediction is
<prefix>/api/v0.1/predictions
# or
<prefix>/api/v1.0/predictions
For a primehub model deployment, the prefix would be
https://<primehub-domain>/deployment/<deployment-name>/api/v1.0/predictions
And the input and output of the prediction endpoint are tensor or ndarray.
You can also send an unstructured data (e.g. image file), please find more examples in our model deployment examples
Deployment Phases
There are 5 phases in the PhDeployment
- Deploying: Model is deploying. When a deployment is created, updated, or started, it will go to this phase immediately.
- Deployed: Model is deployed successfully. All replicas are in available state.
- Stopping: The deployment is stopping. When a deployment is stopped, it will go to this phase immediately.
- Stopped: The deployment is stopped successfully.
- Failed: The model deployment is failed.
There are several reasons for the Failed
phase. They include
- Group, or instance type not found
- Image invalid or cannot be pulled successfully
- Group resource not enough
- Cluster resource not enough
Resource Constraint
A model deployment consumes only group quota.
The pod of the model deployment has the label primehub.io/group=escape(<group>)
. The PrimeHub's validating webhook would invalidate the pod creation if the new pod exceeds the group resource.
Once the resource exceeds, the deployment would change to phase Failed
with "group resource not enough" error message.
Replicas Log
The pod of the model deployment has the labels
app=primehub-group
primehub.io/phdeployment=<deployment-id>
The GraphQL server would list the pod by these labels and show the log for the container name model
Deployment History
Whenever the .spec
changes, there appends a new record under .status.history
. It contains time
for update time and spec
for the snapshot of the current new updated .spec
.
The history array only keeps the latest 32 records.
Monitoring
We use Seldon engine to export Prometheus metrics. Under the hood, it accepts the prediction request and forwards it to the user wrapped model container. At the same time, it keeps track of the count and time for each request. The metrics details are described in Seldon metrics
Seldon has a project name Seldon analytics. In which, it installs the Prometheus and Grafana. However, our preferred Prometheus/Grafana installation is prometheus-operator. To adapt the metrics to Prometheus-operator, we implement our own PodMonitoring and Grafana dashboard to visualize the collected metrics.