TensorFlow Serving Installation and Configuration

TensorFlow Serving is a high-performance serving system designed for deploying TensorFlow models in production, providing REST and gRPC APIs with automatic model versioning and hot-swapping. This guide covers installing TensorFlow Serving on Linux, serving models, configuring multi-model setups, enabling GPU inference, and optimizing performance for production workloads.

Prerequisites

  • Ubuntu 20.04/22.04 (TensorFlow Serving has best support on Ubuntu)
  • TensorFlow 2.x for model training/exporting
  • For GPU: NVIDIA GPU with CUDA 11.8+
  • Docker (recommended for easy version management)
  • 4GB+ RAM minimum; 8GB+ for most production models

Installing TensorFlow Serving

# CPU-only
docker pull tensorflow/serving:latest

# With GPU support
docker pull tensorflow/serving:latest-gpu

# Verify
docker run tensorflow/serving:latest --version

Method 2: Package Manager (Ubuntu)

# Add TensorFlow Serving repository
echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | \
  sudo tee /etc/apt/sources.list.d/tensorflow-serving.list

curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | \
  sudo apt-key add -

sudo apt-get update

# Install (uses AVX/SSE4 optimizations)
sudo apt-get install -y tensorflow-model-server

# Or the universal version (no CPU optimizations, for older hardware)
sudo apt-get install -y tensorflow-model-server-universal

# Verify
tensorflow_model_server --version

Preparing Your Model

TensorFlow Serving requires the SavedModel format with a versioned directory structure:

/models/
  my_model/
    1/           <- version number
      saved_model.pb
      variables/
        variables.index
        variables.data-00000-of-00001
    2/           <- newer version (auto-loaded)
      saved_model.pb
      variables/

Export a model in Python:

import tensorflow as tf

# Example: simple classification model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(10,)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# model.fit(X_train, y_train, ...)

# Export for TF Serving
MODEL_DIR = "/models/my_model/1"
tf.saved_model.save(model, MODEL_DIR)

# Inspect the SavedModel signature
!saved_model_cli show --dir {MODEL_DIR} --all

Verify the SavedModel:

saved_model_cli show \
  --dir /models/my_model/1 \
  --tag_set serve \
  --signature_def serving_default

Serving a Model

With Docker

# CPU serving
docker run -d \
  --name tf-serving \
  -p 8501:8501 \
  -p 8500:8500 \
  -v /models/my_model:/models/my_model \
  -e MODEL_NAME=my_model \
  tensorflow/serving:latest

# Check logs
docker logs tf-serving -f

With tensorflow_model_server binary

tensorflow_model_server \
  --rest_api_port=8501 \
  --grpc_port=8500 \
  --model_name=my_model \
  --model_base_path=/models/my_model \
  &

# Or as a foreground process for debugging
tensorflow_model_server \
  --rest_api_port=8501 \
  --model_name=my_model \
  --model_base_path=/models/my_model \
  --enable_model_warmup=true \
  --file_system_poll_wait_seconds=60  # Check for new versions every 60s

REST API Usage

TensorFlow Serving exposes a REST API on port 8501:

# Check server health
curl http://localhost:8501/v1/models/my_model
# {"model_version_status": [{"version": "1", "state": "AVAILABLE", ...}]}

# Make a prediction (single instance)
curl -X POST http://localhost:8501/v1/models/my_model:predict \
  -H "Content-Type: application/json" \
  -d '{
    "instances": [[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]]
  }'

# Batch prediction
curl -X POST http://localhost:8501/v1/models/my_model:predict \
  -H "Content-Type: application/json" \
  -d '{
    "instances": [
      [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0],
      [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
    ]
  }'

# Specify a model version
curl -X POST http://localhost:8501/v1/models/my_model/versions/1:predict \
  -H "Content-Type: application/json" \
  -d '{"instances": [[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]]}'

Python client:

import requests
import json

def predict(instances, model_name="my_model", host="localhost"):
    url = f"http://{host}:8501/v1/models/{model_name}:predict"
    payload = {"instances": instances}
    response = requests.post(url, json=payload)
    response.raise_for_status()
    return response.json()["predictions"]

# Usage
result = predict([[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]])
print(result)

gRPC API Usage

gRPC is faster than REST for high-throughput production inference:

# Install gRPC client libraries
pip install grpcio tensorflow-serving-api
import grpc
import numpy as np
from tensorflow_serving.apis import predict_pb2, prediction_service_pb2_grpc
import tensorflow as tf

# Connect to TF Serving
channel = grpc.insecure_channel("localhost:8500")
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

# Build request
request = predict_pb2.PredictRequest()
request.model_spec.name = "my_model"
request.model_spec.signature_name = "serving_default"

input_data = np.array([[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]])
request.inputs["dense_input"].CopyFrom(
    tf.make_tensor_proto(input_data, dtype=tf.float32)
)

# Make prediction
response = stub.Predict(request, timeout=5.0)
output = tf.make_ndarray(response.outputs["dense"])
print(f"Prediction: {output}")

Multi-Model Configuration

Serve multiple models simultaneously using a model configuration file:

cat > /models/models.config << 'EOF'
model_config_list {
  config {
    name: "my_model"
    base_path: "/models/my_model"
    model_platform: "tensorflow"
  }
  config {
    name: "text_classifier"
    base_path: "/models/text_classifier"
    model_platform: "tensorflow"
    model_version_policy {
      specific {
        versions: 1
        versions: 2
      }
    }
  }
  config {
    name: "image_model"
    base_path: "/models/image_model"
    model_platform: "tensorflow"
  }
}
EOF

# Start server with config file
tensorflow_model_server \
  --rest_api_port=8501 \
  --model_config_file=/models/models.config \
  --model_config_file_poll_wait_seconds=60 &

GPU Serving

# With Docker and GPU
docker run -d \
  --gpus all \
  --name tf-serving-gpu \
  -p 8501:8501 \
  -p 8500:8500 \
  -v /models/my_model:/models/my_model \
  -e MODEL_NAME=my_model \
  tensorflow/serving:latest-gpu

# Limit to specific GPUs
docker run -d \
  --gpus '"device=0"' \
  -p 8501:8501 \
  -v /models:/models \
  -e MODEL_NAME=my_model \
  tensorflow/serving:latest-gpu

# Monitor GPU usage
watch -n 1 nvidia-smi

Troubleshooting

"Model not found" or 404 errors

# Check model directory structure
ls -la /models/my_model/1/

# Verify saved_model.pb exists
find /models -name "saved_model.pb"

# Check serving logs
docker logs tf-serving

High latency on first request

# Enable model warmup — create a warmup file
mkdir -p /models/my_model/1/assets.extra
# Create tf_serving_warmup_requests with sample requests

# Or pre-warm at startup
tensorflow_model_server \
  --enable_model_warmup=true \
  --rest_api_port=8501 \
  --model_name=my_model \
  --model_base_path=/models/my_model

"Signature not found" errors

# List available signatures
saved_model_cli show --dir /models/my_model/1 --all

# Use the correct signature name in your request
curl -X POST http://localhost:8501/v1/models/my_model:predict \
  -d '{"signature_name": "serving_default", "instances": [...]}'

Out of memory on GPU

# Limit GPU memory fraction
# Set environment variable before starting
TF_FORCE_GPU_ALLOW_GROWTH=true tensorflow_model_server ...

Conclusion

TensorFlow Serving provides production-grade model serving with automatic versioning, REST and gRPC APIs, and GPU acceleration, making it one of the most reliable options for deploying TensorFlow models at scale. Using Docker simplifies deployment and GPU configuration, while the multi-model config file lets you serve all your models from a single server instance.