Setup PyFlink Development Environment
How to setup PyFlink for your local development
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. It has been designed to run in all common cluster environments perform computations at in-memory speed and at any scale. Initially Flink is written in Java and Scala, versions 1.9.0 and later they added support for Python. It means, that we can use Python to develop Flink application. Python is famous for it’s open source and easy to learn programming language, for developers who already comfortable with Python, this could be their choice.
Here, I’m gonna share how to setup PyFlink for your local development environment, with Visual Studio Code and Remote Development.
Pre-requisites
In this setup, I’ll use Docker and Visual Studio Code to develop PyFlink Application.
- Docker
- Visual Studio Code, also we need a couple of extensions for Remote Development
- Flink Docker Image
- Apache Flink from PyPI (download the .tar.gz file)
- Apache Flink Libraries from PyPI (download the .tar.gz file)
Build It
First, download Apache Flink Docker Image from the link above, I’ll use version 1.14.0.
docker pull flink:1.14.0
By default, current Flink Docker Image (Flink version 1.14.0) doesn’t have Python and PyFlink installed, you need to build new image based on existing Flink Image. Prepare new directory for the project and create a Dockerfile.
Put the files downloaded from no 4 and 5 in the same directory as the Dockerfile.
Then, build a new image with command below:
docker build --tag pyflink:1.14.0 .
Command above will build a new docker image with name pyflink and tag 1.14.0, if the command success, there should be an image with Repository pyflink and Tag 1.14.0. Check it with command below:
docker images
Compose It
Using Docker Compose is one of several ways to run Flink containers, another way you can deploy containers is using Kubernetes Cluster or Docker Swarm.
Flink image can be deployed as an Application Cluster or Session Cluster. A Flink Application Cluster basically a cluster dedicated to run a single job, while a Session Cluster can be used to run multiple jobs. In this example, I’ll used Session Cluster, here is the code:
Save the file as docker-compose.yml, and then you can launch the cluster using command:
docker compose up -d
You can check if the services are up using command:
docker ps
If everything works as expected, you should be able to browse Apache Flink Dashboard in localhost:8085. Here you can see information about Running Jobs, Completed Jobs and others.
Remote Development
Now the cluster are running, then you can open Visual Studio Code and install these Remote Extensions from Microsoft, restart if needed:
Click the gear menu on the bottom left of the Visual Studio Code, and then click Command Palette, or just type on your keyboard Ctrl + Shift + P. Look for Remote Containers: Attach to Running Container menu, if you click it, there should be a list of containers running, in this case there should be two pyflink containers running. Choose one container as your development container, here I choose the Job Manager container. Then click Open Folder in the Explorer menu, choose /home/pyflink since this folder is synced with the root directory on the host.
After that you can start developing directly on the container, you can also initialize a git repository from the host.
Thank you for taking the time to read my article. Next I’ll share how to create streaming data pipeline using PyFlink.