3.1 Database Connection Parameters
Create a directory called .kettle in your home directory. Copy the file transmart-ETL/Kettle/postgres/kettle.properties to this directory. An example of this file is shown in Appendix A. At the top of this file, there should be three variables. These variables contain the details for connecting to the database. Change them to match your situation. The variables are:
- COMMON_DB_SERVER: URL or IP-address of the database server
- COMMON_DB_PORT: Port of the database server. The default port number for Postgres is 5432
- COMMON_DB_NAME: the database name. This is most likely transmart.
3.2 SSH Tunnel
By default Postgres only allows incoming connections from localhost. This means that only applications running on the same machine as Postgres can access the database. If you are performing ETL on the same machine as where the Postgres databases lives, you should have no issue connecting to the database. If you are on a different machine however, you will probably need to create an SSH tunnel to the Postgres database.
An SSH tunnel is a way to make a remote machine set up a connection to a server on your behalf, and act as a middle man between you and the server. We will use this to make the machine where Postgres is hosted connect to itself (Postgres will allow this since the connection originates from localhost) and forward all traffic to us. This is visualized conceptually in Figure 1.
Figure 1: Conceptual representation of an SSH tunnel.
For this, we will use Local forwarding. This means that we will specify all components of the tunnel manually. The Remote host is the host we will connect to, and subsequently the host that will set up another connection on our behalf. The Source port identifies the entrance to the tunnel, i.e. the port number on your own machine. Any free local port could be used, 9001 being a safe choice. The Destination is where the remote host should connect to for us. This must point to the Postgres server, which is by default located at 127.0.0.1, port 5432. Once this is set up, any application on your machine can make a connection to local port 9001, and it will effectively be connected to the Postgres database on a remote host. Now that we know the components of an SSH tunnel, the following sections will describe how to set up this tunnel on Linux and Windows hosts.
3.2.1 Linux hosts
On a Linux system, you can set up a tunnel with the following command:
ssh -L 9001:127.0.0.1:5432 user@remote-host
where you should substitute remote-host with the URL of the host where Postgres lives, and user with your username on that host. This command will start a server listening on port 9001 on your machine.
3.2.2 Windows Hosts
This section assumes you have installed Putty as described above. Putty uses a different format for private keys than OpenSSH. The Putty private key files have a .ppk extension. You can convert between the two formats using PuttyGen. Open Putty and select the Tunnels subcategory from the Connection/SSH category. In the Source port input box, type 9001. In the Destination box, type 127.0.0.1:5432. Make sure the radio button Local is selected. Now click the Add button to add this tunnel to your connection. You can add more than one tunnel over the same SSH connection, if you need to. Your window should look like the example in Figure 2.
If you need to use a private key le for authentication, you can load your private key in the Auth category. Click the Browse button to select your private key le. As discussed above, this should be a .ppk-file. This is shown in Figure 3.
Now return to the Session category to ll out the host name. Give this connection a name in the box right below Saved sessions and click Save. Now you can re-open this session anytime by double clicking its name, without having to re-enter all the parameters. This is shown in Figure 4.
Figure 2: Filling out the tunnel configuration parameters in Putty.
Figure 3: Selecting a private key file in Putty.
Finally click Open to open the connection. When the connection is established, Putty is listening for incoming connections on the local port 9001.
Figure 4: Entering the remote host and saving the session conguration in Putty.
3.2.3 Connecting Through The Tunnel
Now that you have a tunnel set up on port 9001, any connections on that port will be forwarded to postgres-host. On the other side SSH will connect to 127.0.0.1:5432, and allow traffic to
flow from your machine to this connection. To Postgres it effectively looks as if you are connecting from localhost, so it will accept your connection. Security is maintained since all traffic between you and postgres-host is encrypted by SSH. In your kettle.properties le you should set COMMON_DB_SERVER to 127.0.0.1, and COMMON_DB_PORT to 9001, since this is now your entrance point to the Postgres server.
On both Windows and Linux systems, you can test your tunnel connection with the following command:
psql -h 127.0.0.1 -p 9001
On a Windows host you might need to write the full path to psql.exe, but the same arguments can be passed. If you get a password prompt, you have successfully connected and you can close the connection by hitting Ctrl-C. If your tunnel is not set up properly, you will get a connection refused message. In that case double-check your tunnel set-up.
You can choose any port number instead of 9001 for the local side of the tunnel. Although you should pick a number higher than 1023, since the first 1023 port numbers are reserved for system processes.
- Original text https://wiki.transmartfoundation.org/display/transmartwiki/3+Set-Up