Skip to content

installation of new cluster doesn't complete #34

Description

@boegel

I've made two attempts this afternoon to create a new CitC on AWS using the one-click installer, but for some reason the installation "hangs".

The management node is being created, and I can SSH into that, but the finish command keep producing this (with or without a limits.yaml file):

[citc@mgmt ~]$ finish
Error: The management node has not finished its setup
Please allow it to finish before continuing.
For information about why they have not finished, check the file /root/ansible-pull.log

The last part in /root/ansible-pull.log is this:

TASK [slurm : open all ports] **************************************************
Friday 19 February 2021  14:19:11 +0000 (0:00:00.045)       0:06:17.021 *******

That was over 1 hour ago, no progress since then...

/var/log/slurm exists, but it entirely empty.

Running processes:

Details
root        1515  0.0  1.0 372592 40816 ?        Ss   14:12   0:00 /usr/libexec/platform-python /usr/bin/cloud-init modules --mode=final
root        1997  0.0  0.0 217052   732 ?        S    14:12   0:00  \_ tee -a /var/log/cloud-init-output.log
root        2037  0.0  0.0 235744  3412 ?        S    14:12   0:00  \_ /bin/bash /var/lib/cloud/instance/scripts/part-001
root        4767  0.0  0.9 406240 34832 ?        S    14:12   0:00      \_ /usr/bin/python3 -u /usr/bin/ansible-pull --url=https://github.com/clusterinthecloud/ansible.git --checkout=6 --inventory=/root/hosts management.yml
root        9929  7.3  1.6 590508 61548 ?        Sl   14:12   5:24          \_ /usr/bin/python3.6 /usr/bin/ansible-playbook -c local /root/.ansible/pull/ip-10-0-16-0.eu-west-1.compute.internal/management.yml -t all -l localhost,mgmt,ip-10-0-16-0,ip-10-0-16-0.eu-west-1.com
root       27615  0.0  1.4 583004 54488 ?        S    14:19   0:00              \_ /usr/bin/python3.6 /usr/bin/ansible-playbook -c local /root/.ansible/pull/ip-10-0-16-0.eu-west-1.compute.internal/management.yml -t all -l localhost,mgmt,ip-10-0-16-0,ip-10-0-16-0.eu-west-1
root       27616  0.0  0.0 235744  3372 ?        S    14:19   0:00                  \_ /bin/sh -c /usr/libexec/platform-python && sleep 0
root       27617  0.0  0.8 415588 30484 ?        S    14:19   0:00                      \_ /usr/libexec/platform-python
dirsrv     17078  0.1  2.1 662068 81740 ?        Ssl  14:14   0:06 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-mgmt -i /run/dirsrv/slapd-mgmt.pid
citc       17138  0.0  0.2  93904  9968 ?        Ss   14:15   0:00 /usr/lib/systemd/systemd --user
citc       17142  0.0  0.1 257440  5068 ?        S    14:15   0:00  \_ (sd-pam)
mysql      21671  0.0  2.4 1776020 93568 ?       Ssl  14:15   0:01 /usr/libexec/mysqld --basedir=/usr
munge      22577  0.0  0.1 125220  4048 ?        Sl   14:17   0:00 /usr/sbin/munged
root       24674  0.0  1.0 509096 41380 ?        Ssl  14:17   0:00 /usr/libexec/platform-python -s /usr/sbin/firewalld --nofork --nopid
root       27703  0.0  0.0 232532  2036 ?        Ss   15:01   0:00 /usr/sbin/anacron -s

Any suggestions on how to figure out what went wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    AWSbugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions