Skip to main content

Brainspace

APPENDIX C. TROUBLESHOOTING COMMON ISSUES

Fix or Update a Docker Configuration Setting

Note

The Brainspace 7.0 Install UI stores install choices made into a file in the user's home directory at: ~/.brainspace-configs/swarm-configuration.env. Brainspace 7 upgrades rely on this file so that file should NOT be deleted and should ideally be backed up periodically. When an uninstall is performed, this file gets renamed with a .bak extension in case you need to recover it.

Because installation choices are stored in a file on disk, if you need to alter a previous choice made or to make a change to the Docker Swarm compose file, you can edit a file and then update the stack. To do this, you can edit either one of the files (using vi for example):

  • ~/.brainspace-configs/custom.yml - contains overrides to the docker-compose.swarm.yml configuration that will survive upgrades.

  • ~/.brainspace-configs/swarm-configuration.env – contains choices made during the process of running the Install UI.

  • swarm.env inside of the extracted tar directory - contains default values used when deploying Brainspace via Docker Swarm. Here you can set things like NFS protocol version, prior to running the Brainspace 7 installation.

After editing the file, run the install UI again. Select the “Upgrade” option, and the Docker stack will get updated with the new setting. This is a good way to work around any issues related to the installation process, such as the incorrect NFS protocol version used.

Note

If making a change to the NFS volume configuration, it is not always enough to just update the config file and redeploy the stack, you may need to delete the volumes that contain the incorrect configuration from each docker host, using the docker volume rm command prior to running the update.

Common issues

Machines with multiple interfaces/IPs

Error:

Error response from daemon: could not choose an IP address to advertise since this system has multiple addresses on different interfaces (10.0.2.15 on eth0 and 192.168.33.10 on eth1) - specify one with --advertise-addr

To resolve this, set the ENV variable SWARM_IP to the IP address associated with the network interface you want the other hosts to use when joining the Swarm. Add this ENV var to /etc/profile to make it persist through sessions and restarts.

Or you can initialize the Swarm manually prior to running the install and use the “join” option in the install UI:

docker swarm init --advertise-addr <IP address to use>

Connectivity Issues Between Containers

If encountering 502 errors when using the app and seeing errors like “No route to host” in the brains-ui service logs, or seeing connection timeouts between containers, try doing a “docker system prune” on each node, then restart the brains-ui service.

If that doesn’t work, restart the Docker service itself using “systemctl restart docker”.

Problems downloading Images from ECR

If, when status’ing the stack, you see errors like “No such image: 053522176595.dkr.ecr.us-east-1.amazonaws.com” or if services are taking an excessively long time in the “Preparing” state, this can be indicative of problems authenticating or connecting to the Amazon ECR registry. A good debugging strategy in this situation is to try downloading one of the images manually using the ‘docker pull’ command.

The batch-tools image can be quite large, and on slow networks, this can cause problems when the Brainspace stack is starting up. To mitigate this, you can download the batch-tools image manually prior to running the install, so it will be in place during the stack deployment.

Note

This needs to be done on each Brainspace host, except on the DB host if using a separate DB host.

To do this, first authenticate with the Docker Registry you are using. If using ECR use the script:

./aws-ecr-authenticate.sh

…then pull the batch-tools image by itself:

docker pull <batch-tools image from docker-compose.swarm.yml>

For example:

docker pull 053522176595.dkr.ecr.us-east-1.amazonaws.com/prod/brainspace/batch-tools:line.r12.1-githash-1e14f0f

Dataset Startup Failed

Error:

While trying to start a dataset in Brainspace, the following error is encountered:

java.lang.NullPointerException: Cannot invoke "com.purediscovery.batchtools.util.OutputConfig.contains(com.purediscovery.batchtools.util.Output)" because "this.outputConfig" is null

Resolution:

This error means that Brainspace couldn’t find a file it needs to startup the dataset, specifically:

/localdata/brainspace/datasets/<dataset UID>/output.config

If this occurs after a migration from a Brainspace 6 environment, check to ensure that all dataset data from the ‘localdata’ share has been rsync’ed to the correct location in the new ‘localdata’ share.

If this is not a result of an incomplete/incorrect Brainspace 6 migration, it’s possible that the process of deploying the dataset was interrupted. Inspect the files in the directory above and note any that appear to be missing. To workaround this case you can use the “Force Deploy Dataset” option under the dataset’s build options.

NFS Related Issues

In some cases, the installation will complete successfully, and the “brainspace” Docker Stack will have deployed, but one or more services may fail to startup due to incorrect NFS configuration. In this case, the NFS errors can usually be seen when you issue the following command:

docker stack ps brainspace --no-trunc

Below is a list of common errors and their resolutions.

Error:

failed to mount local volume ... permission denied

Resolution:

Ensure that all 3 hosts are defined in the /etc/exports file. Even if data and localdata are shared from the app host, the app host’s IP address must be listed in the /etc/exports file so that the brains-app container running on the application host has permission to connect to the NFS share.

Error:

error mounting /data or /localdata ... invalid protocol

Resolution:

Check to ensure that the NFS server protocol is version 4.0. If it is version 3 or 4.1 or anything other than 4.0 you will need to configure the same protocol version in the swarm.env file, and then completely uninstall and reinstall Brainspace, being sure to remove the data and localdata volumes on each Brainspace host.

If /data and /localdata are mounted on the Brainspace hosts, you can check the NFS information using:

cat /proc/mounts

For example:

:/data/var/lib/docker/volumes/brainspace_data/_data nfs4 rw,relatime,vers=4.0

Note the vers attribute should match the following attribute in swarm.env:

nfsVers=4

Error:

error mounting /data ... no such file

Resolution:

Check the location of the data and localdata shares on the NFS server. This is the path where the data and localdata shares reside on the NFS server, not on the local host, if the shares are mounted locally. The path to the data and localdata shares on the NFS server can be configured during the install process, by setting the correct values when asked: “Enter the NFS source directory for ‘localdata' or 'data’ share”.

If /data and /localdata are mounted on the Brainspace hosts, you can check the NFS information using:

cat /proc/mounts

Example:

ip-10-224-66-52.ec2.internal:/brainspace_shares_data /data nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.224.66.223,local_lock=none,addr=10.224.66.52 0 0

In the example above, the source directory that needs to be configured for the data share during installation is: /brainspace_shares_data since this is the location of the data share on the NFS server.

Last Resort

Finally, if you’ve tried all of the above and still can’t figure out the issue with configuring the NFS share inside of Docker, then mount the /data and /localdata shares on each Brainspace host (app, ana, oda) and then use the “bind mount” option when performing the install.

Enable Docker Debug Logging

Enable Docker daemon debugging

echo "

{

  "debug": true

}" > /etc/docker/daemon.json

Restart Docker:

systemctl restart docker

Daemon logs can be found in:

/var/log

/var/log/syslog

Example:

app dockerd[11191]: time="2024-03-01T23:47:05.294006696Z" level=debug msg="app(ebfb10ae5359): Bulk sync to node 79fc270f87dd took 9.239315ms"

Mar  1 23:47:05 app dockerd[11191]: time="2024-03-01T23:47:05.403770163Z" level=debug msg="pull in progress" current=129209785 image="053522176595.dkr.ecr.us-east-1.amazonaws.com/prod/brainspace/batch-tools:line.r2024.3-githash-4bf7b9b@sha256:4f57d9171d29dae852329ba51a4a47d3a16921666655e033b1b63f8cf82b923a" status=Downloading total=13180108894

Maintenance

Run Docker system prune occasionally.

Clean up and archive log files in log volumes.

Backup /var/lib/docker occasionally and before minor version upgrades.

Restart Docker Daemon if errors persist.

If errors remain, try removing the stack and then reinstalling it using the Upgrade / Continue option. If you do not delete the Docker volumes your data won’t be deleted.