Skip to content

Create Your First Data Analytics Cluster

Karrots provides an example data analytics setup that builds and deploys an example Python application. This setup gives you modern cloud software practices. A data analytics team can use this example as a starting point to build their own development-run environment. Once you have it running, you can easily add your own applications and services. We make it easy for you to go from laptop to deployed application.

This tutorial assumes this is your first time using Karrots, so it will walk you through the steps needed to create and run your first Karrots cluster. This first cluster will become your team's QA cluster and it will include CI/CD (continuous-integration/continuous-deployment) services that automate the build, test and deployment of your team's applications. Once running, Karrots handles the heavy lifting — you focus on writing and testing your code without the need to worry about automation. Karrots also helps you debug and support your deployed applications using observability and metrics tools.

Terminology

Before you create your first Karrots cluster, it will help to first learn some devops automation terminology. Don't be overwhelmed, these definitions are to help you understand what's going on inside Karrots. You don't need to learn detailed devops or become a devops expert.

Term Definition
Devops "Developer-Operators" is a class of workers who possess both software development and systems operation skills. To run a modern, cloud-based service you typically need devops staff to setup and maintain the infrastructure that runs your applications.
Infrastructure as code A practice where your entire infrastructure is code so that it comes up entirely through automation, and you maintain it through automation as well. Workers do not not issue commands or use developer consoles to modify the infrastructure.
CI Continuous Integration is the process of building and testing your code on every git commit. It means at all times the team knows the testing and quality state of the code committed to your git repo. If a commit fails to build or pass a test, the CI system will reject the commit and inform the team.
CD Continuous Delivery is the process of deploying every build that passes CI. By default we only do this automatically for developer branches and not production. Typically a team will inspect code manually before promoting it to production. The way to do this is to have the production cluster nominate a specific container tag for each application it should deploy. To release a new production version you edit that tag and git commit and push the change.
Kubernetes A self-contained, virtual system, originally created by Google, that runs containerized applications at a cloud provider such as AWS, Azure or Google Compute. A core feature is built-in resiliency and load-based autoscaling. Kubernetes is complex and normally requires a highly skilled devops team to set up and manage, but it also lends itself to full automation more than anything that came before. Karrots leverages this automation capability to build cookie-cutter environments that do not require devops staff to setup or run.
Cluster In Kubernetes, a cluster is a self-contained, complete environment to run one or more containerized applications or services. In our cookie-cutter model, the best way to think of a cluster as an isolated runtime environment such as dev, qa or production. Karrots makes it easy to create clusters that are copies (git branches) of other clusters so that you can have an isolated run environment to experiment without impacting the other run environments.
Gitops A form of automation that uses git to drive all cluster maintenance processes. In every cluster Karrots installs a gitops tool called Flux, and it maps Flux to a git repo branch. That repo branch contains the cluster's runtime goal state, and the Flux service ensures the cluster matches that goal state at all times. Once you create a cluster, the only actions you need for automation are git commit and push — the cluster updates itself to match those git changes.
Environment Control Repo Every team has its own Karrots git repo that controls all of the team's clusters. This repo has a standard directory structure that contains base services and the deployed clusters. Karrots seeds this repo when you first set it up for the team.
Seed Repo A public Github repo, maintained by the Karrots team, that Karrots uses to seed a team's Environment Control Repo.
Cluster Control Directory Within every Environment Control Repo there is a clusters directory and each sub-directory is a "Cluster Control Directory" used to configure, create and destroy a cluster. N.B.: because Karrots uses gitops automation, you will likely have multiple running cluster instances for each cluster control directory — each controlled by a different git branch.
HelmRelease Helm is a tool used by Kubernetes to package and deploy resources to make it easy for a human to bring up and maintain an application or service. HelmRelease is a resource used by Flux (gitops) to automate the Helm service for an application or service without human intervention.
karrots.yaml Every Cluster Control Directory has a karrots.yaml file that controls the creation of the cluster.

Install Kubectl and the Karrots Binary

Install Homebrew if you haven't already:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Then install the Karrots binary using brew:
brew tap zero-diff/karrots
brew install karrots

If you don't already have kubectl installed, make sure to install it and create the ~/.kube/config file.

brew install kubectl
mkdir ~/.kube/
touch ~/.kube/config

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys B4D285C0003B4D71
sudo add-apt-repository "deb https://zero-diff.github.io/karrots/debian-repo/ karrots-github main"
sudo apt-get install karrots

If you don't already have kubectl installed, make sure to install it and create the ~/.kube/config file.

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
mkdir ~/.kube/
touch ~/.kube/config

If you don't already have helm installed, make sure to install it, then add the fluxcd repo.

curl https://baltocdn.com/helm/signing.asc | sudo apt-key add -
sudo apt-get install apt-transport-https --yes
echo "deb https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update
sudo apt-get install helm
helm repo add fluxcd https://charts.fluxcd.io
kubectl repo update

Now verify that you have the Karrots and kubectl binaries installed:

karrots --help
kubectl version --client

Seed Your Environment Control Repo

Change to the parent directory where you want to create your environment control repo. We recommend this:

cd ~/git

First, create an empty git repo at your git hosting provider. We suggest you pick a name that is meaningful for your team. If you are the only data analytics team in your organization, it will make sense to pick a name like data-analytics. Then, execute the seed command, with the --directory with the name of your git repo:

karrots seed --directory=data-analytics

This will create a new directory ~git/data-analytics and populate it using the Karrots seed repo (https://github.com/zero-diff/karrots-seed). It will prompt you to specify your git repo url, such as ssh:git@github.com:<your-team>/data-analytics.git. You will need to be able to push to this repo by an SSH key. (If you don't know how to make sure you can access your git repo by SSH, you can find examples like this one for Github: connecting-to-github-with-ssh).

Note

If the karrots seed command fails, make sure to remove the partially created directory and delete and re-create the git repo before fixing the problem and running the command again.

Once complete, the directory will have the following structure:

base-services/
  charts/
  helm-releases/
  resources/
clusters/
  karrots-helloworld/
    charts/
    helm-releases/
    resources/
    karrots.yaml
  karrots-data-analytics-example/
    charts/
    helm-releases/
    resources/
    karrots.yaml
docs/
README.md

Once seeded, your environment control repo will contain a directory that will drive cluster creation and automation for this example

clusters/karrots-data-analytics-example:

This directory contains the elements that Karrots uses to create the karrots-data-analytics-example cluster. Change into this directory before you continue to setup the cluster.

cd ~/git/data-analytics-control/clusters/karrots-data-analytics-example

Note

It's important to not change anything in the base-services directory. We have organized the base services so that you don't need to edit them directly. Any custom configuration these service need comes from secrets and configMaps that you manage in your cluster control directory. (In the future Karrots will update the base-services directory to keep it up to the latest version.)

Karrots Data Analytics Example

The Karrots data analytics example has two key parts: an example cluster control directory and a HelmRelease for an example data analytics Python application. Before we build a cluster from the example setup, it will be helpful to understand these two parts.

Example Cluster Control Directory

Your environment control repo has two top-level directories: base-services and clusters. The clusters directory contains all of the cluster your environment control repo is capable of creating and running. When Karrots seeds your environment control cluster, it adds a folder clusters/karrots-example-python that contains everything needed to create the example cluster. (In a later section we will walk you through the process of creating this cluster.)

Example HelmRelease / Python Application

The Karrots team manages a Github repo, https://github.com/zero-diff/karrots-example-python that contains a sample Python application that is both a Flask server and a background task processor. In a later section we will walk you through how to use this repo to build your own production applications that you can run in Karrots clusters. Once setup, any changes you commit and push to this repo will automatically build and deploy into your cluster.

Prepare to Create Your First Cluster

Hosting Provider Setup

Before you can use Karrots to create your cluster, you must first give it access to your hosting provider account. You should follow the instructions specific to your hosting provider and then return here:

AWS/EKS hosting provider setup

Azure hosting provider setup

GKE hosting provider setup

Create a Git Deploy Key (Optional)

Because karrots is gitops-based, we need to supply a pair of deploy keys that gives Flux/CD access to the cluster control repo. The private key lives in the cluster as a Kubernetes secret called flux-ssh and the public key lives in the remote git repo. If your karrots.yaml file has the configuration gitDeploKey.process=manual then you need to add your git ssh deploy keys by-hand. (if the value is gitDeploKey.process=github then karrots create-cluster will prompt you for an Github Personal Access Token so that karrots can generate and add these deploy keys automatically.)

Generate the Keys

The first step is to generate your deploy key using the following command:

ssh-keygen -t rsa -b 4096 -f karrots-deploy-key

This will create two files: karrots-deploy-key a private key in Open SSH RSA format, and karrots-deploy-key.pub a public key in OpenSSL RSA format. Karrots needs a PEM key to build the flux-ssh secret, so execute the following command to change the private key in-place:

ssh-keygen -p -N "" -m pem -f karrots-deploy-key

Because karrots will prompt you for the private in Base64 format (to deal with newline characters) it's best to store it in your password manager in that format. To get the private key in Base64 format use the following command:

cat karrots-deploy-key | base64

To get the pubic key in RSA format use the following command:

cat karrots-deploy-key.pub

You now need to add this public key to the git repo so that karrots (Flux/CD) can process changes between the cluster and git.

Configure Your Data Analytics Example Cluster

Our recommended process for your first data analytics cluster is to include Jenkins and Sealed-Secrets base services.

Configure Karrots.yaml File

Each cluster control directory contains a karrots.yaml file that Karrots uses to create and manage the cluster. You can visit this page for more information about each element of this file: karrots-yaml.

Configure Ambassador

We don't need to configure Ambassador for this example.

Configure RBAC-Manager

We don't need to configure RBAC-Manage for this example.

Configure Sealed Secrets

Create a Sealed-Secrets Certificate

Because Karrots automates all clusters using gitops, it means that everything needed to run a cluster has to exist in git. This includes secrets needed to run the cluster. It would be unsafe to commit these secrets to git in cleartext, so we use Bitnami's Sealed-Secrets to encrypt (seal) these secrets using a certificate stored in the cluster. We need to supply this certificate during the Karrots create-cluster process.

Install kubeseal

Sealed-Secrets comes with a binary that helps you seal your secrets: kubeseal. To install this command execute the following:

brew install kubeseal
Generate an RSA Key Pair (Certificates)

Execute the following command to make two files rsa.key and rsa.crt:

openssl req -x509 -nodes -newkey rsa:4096 -keyout "rsa.key" -out "rsa.crt" -subj "/CN=sealed-secret/O=sealed-secret"

Once you have these keys, you should store them securely in a group secrets manager like 1Password or LastPass so that members of your team can seal secrets in the future. Later when you run karrots create-cluster, it will ask you for the rsa.key contents so that it can install it into the cluster to decode secrets encoded using the rsa.crt file. It is best to store them in Base64 format, which you can get using the following commands:

cat rsa.crt | base64
cat rsa.key | base64
Use the RSA Key to Seal a Secret

Using kubeseal, your rsa.crt certificate (key) and the --cert flag you can seal a secret that is safe to commit to git. You don't need to do it now, but later you issue a command similar to:

kubeseal --cert "rsa.crt" --format=yaml --scope cluster-wide < mysecret.yaml > mysecret.sealed.yaml

Configure Sumologic

Get your Sumologic account access ID and access key. Fill in the missing values in the helm release in base-services/helm-release/sumologic/sumologic.yaml.

Configure Jenkins

Jenkins needs two important secrets in order to do its work: an administrator login and a token to write images to your container repository (E.g. ECR, GCR, etc.). We will create Kubernetes secrets for each of these and then use Sealed-Secrets kubeseal to seal the secrets with your rsa.key from above.

Rename the jenkins-casc.unsealed.sample in resources/secrets/jenkins to jenkins-casc.unsealed. Fill in the location and credential ID of your container repo (an example for Amazon ECR is already present). Generate and store a passowrd in your password manager, and fill it in for <YOUR_ADMIN_PASSWORD>.

Make sure to check configmaps/jenkins-jobs.yaml to ensure that no unnecessary jobs like karrots-example-kotlin are present.

Set Up Your Base Repo

Create a new private repo, and copy in the files from https://github.com/zero-diff/karrots-example-python. Make sure that you are not cloning karrots-example-python. You will need some method to allow Jenkins to clone the repo. If you're using GitHub, this is done by simply creating a read-only deploy key for the repo (named something like karrots-git-key): https://docs.github.com/en/developers/overview/managing-deploy-keys#setup-2.

Make sure to save your public and private key in your password manager (in base64 format). Fill in the private key and passphrase in the appropriate sections of jenkins-casc.unsealed. In your jenkins-jobs.yaml, replace zero-diff/karrots-example-python with your own repo. If you're using a different platform (e.g. GitLab, Bitbucket), you'll need to do this differently, as it uses the GitHub plugin by default.

Seal the Jenkins CASC

After setting up Jenkins, run the following command to seal the CASC using sealed secrets:

kubeseal --cert "rsa.crt" --format=yaml --scope cluster-wide < resources/secrets/jenkins/jenkins-casc.unsealed > resources/secrets/jenkins/jenkins-casc.yaml

Now commit and push all of your changes.

Configure Jupyterhub

Jupyterhub has three setup tasks: authentication, git-sync and notebook pod PIP requirements.

Authentication

There are a couple of ways to setup Jupyterhub authentication with existing corporate identity, but the only one we've ever tested is LDAP. Unfortunately, the out-of-the-box Jupyterhub LDAP does not include TLS stream encryption, so we include some PythonextraConfig: that implements secure LDAP. It's a little bit messy, but it works. The file jupyterhub-config.unsealed.sample contains a baseline that you can edit. Near the bottom of that file you can enter data to configure this LDAP to connect to your corporate identity server that supports LDAP. Configuring LDAP is a bit tricky and depends highly on the server setup. the sample shows how to validate against Google Workspaces LDAP.

Note

To use Google LDAP you will need to first signup for Google Cloud Identity Premium. This will allow you to enable the LDAP "application" in your Google Workspace. (It will cost you a few dollars per month for each user with LDAP enabled.) You can find more information here: Google Cloud Identity

Note

Jupyterhub has other means to get identity which we haven't tried yet. If you needs SSO/OAuth, it may be possible, but we will need to research how to do it.

Git-Sync

The default and preferred way to organize Jupyterhub Notebooks is via the /shared folder available to all notebooks. (We manage this folder via an NFS server that makes the volume accessible in any region or availability zone)

The /shared folder has two sub-directories:

  • collab: A place, not backed up, where humans share notebooks will-nilly.
  • model: A place where the data science team publishes proven, tested notebooks via git.

The model directory is the one we're setting up here that syncs with a git repo you provide. You configure this git synchronization from the config map jupyterhub-sync.yaml. The only value you probably need to change in this file is repo — the git/ssh url for that repo. The git sync configuration has the following defaults that you can override with an entry in `jupyterhub-sync.yaml:

gitSyncEnv:
  sshKeyFile: "/etc/git-secret/ssh"
  ssh: "true"
  gitKnownHosts: "false"
  rev:
  branch: "main"
  repo: "git@github.com:<user>/<repo_name>.git"
  depth:
  root: "/home/git-sync"
  dest: "main"
  addUser: "true"
  wait: "60"
  maxSyncFailures: "0"

Next you need to add a deploy key to the git repo that grants read-only access. You add the private part of the deploy key to the file jupyterhub-sync-secret.unsealed in the ssh field using base64 format. You then seal that secret using sealed secrets:

kubeseal --cert "rsa.crt" --format=yaml --scope cluster-wide < resources/jupyterhub/jupyterhub-sync-secret.unsealed > resources/jupyterhub/jupyterhub-sync-secret.yaml
PIP Requirement

When a Jupyterhub notebook pod spins up, it runs a clean, bare notebook/Python image provided by the project. The Python install will not include any PIP modules. Keep in mind that when a user idles for more than 60 minutes, Jupyterhub will terminate their pod and the next time the pod spins up it will not include any modules the user installed by-hand using the %%pip install <module> command. To accommodate PIP modules that all users need, you can update the pip-requirements.yaml file. If in the future your users need new modules to be available, you add them to this file.

Note

Supply chain attacks are real and it is up to you to verify the PIP modules you install into the pods. Jupyterhub users also have broad latitude to add the owm modules to the currently running pod. It is important to educate them about the risks of supply chain attacks. Also, Jupyterhub users have broad latitude to install plugins from the user menu. Even plugins are subject to supply chain attacks and you will want to educate your users about how to validate plugins before installing them.

Now commit and push all of your changes.

Create the Cluster

Before you get started, make sure to add a link to your container respository (E.g. ECR, GCR) in the helm-releases/karrots-example-python/karrots-example-python.yaml. Make sure you commit and push all the configuration changes you've made to the cluster directory. We can now create the cluster. Enter this command:

karrots create-cluster

Configure NS Records Manually

If you use a DNS service from a provider other than your hosting provider (E.g. Route53 for EKS), you'll need to configure the NS records for your domain manually (if the providers are the same, you don't need to do this). Make sure that you are using the ACME staging URL until you are sure the records are set up.

Find the location of the NS records for your specific provider, and copy them by hand to the records at your domain provider. After the cluster becomes available at your domain name, you can change the URL in resources/configs/ambassador.tls.yaml to a production URL, and then wait for it to update or delete the host resource.

Build the Example Python Repo

If everything is set up correctly, you should be able to access Jenkins at <cluster-name>.<your-domain>/jenkins. Login in with the username admin and the admin password you set earlier.

Setup a Webhook to Build Automatically

Create a new webhook in the Github repository containing the python app. Set the payload url to https://<cluster-name>.<your-domain>/jenkins/generic-webhook-trigger/invoke?token=123456, the content type to application/json, the secret to 12345, and set the webhook to just trigger on the push event.

Whenever you push to the repo with a tag, Jenkins should automatically build the container, tag it with the tag from git, and push it to your container repository.