Thursday, February 11, 2016

Recovering network access to EC2 instances

So you've screwed something up. You made a typo in your sshd_config file. You added a firewall rule, or a route, or some other thing, and lost your network access to your EC2 instance. And of course whatever you broke, you broke permanently - you wrote your firewall rules directly to /etc/sysconfig/iptables, you made your goofy change to /etc/sysconfig/network-scripts/whatever-interface; so rebooting won't make a damn bit of difference. You read the warnings, you know you shouldn't have. But you did anyway.

Oh, and you don't have any backups. Or you have backups from three months ago. Restoring from your crappy backups would mean hours to days of non-stop work and consistent downtime. Or Amazon or whatever other company you're using for backups actually broke your backups/lost your backups/never actually provided you with the backups you paid for.

Don't panic. You've got this. You remember that Amazon has some sort of Java-based something or other. Its got to be a virtual KVM. You login to the web console and find out that the Java-based something is a completely worthless SSH client, and not a KVM at all.

You are going to be fired.

Unless you found this post. I will save your backside, sir and/or ma'am.

Well, I will save your backside provided your environment has a couple of caveats. I will make them clear so that if you don't meet them you can get going somewhere else to find a solution ASAP. Here they are:

    - This is for Linux. If you are using Windows you are fired. Just kidding! You can mount Windows volumes in Linux, but reconfiguring network settings in this way is much more complicated since those settings are often stored in the registry rather than flat files. This walkthrough is just for volume management side of things; if you're dealing with Windows consider mounting the volume on a Linux VM and then using a tool like this one to modify the registry in the broken volume.
    - This only works for EBS volumes. There may be a way to do it with other types of volumes, but I haven't had to worry about it, and it will be much more complex than this if there is a way to do this with non-EBS instance store volumes.
    - I'm going to take for granted that you know how to start and stop an EC2 instance, and how to deploy an EC2 instance. I'm assuming this because you had to have done these things to make the instance you just broke. If you broke somebody else's instance and you don't know how to even restart the damn thing, well, first off - lol. And second, you're fired.
    - You need to either already have or be able to provision a second linux EBS-back EC2 instance in the same availability zone as your broken server

Those should be the only requirements. It won't matter if your broken volume is magnetic or SSD. Here is what to do:

1. For this to work you need a second EBS-backed EC2 instance running linux, other than the broken one, within the same region and availability zone (i.e. us-west-1a) as the broken server. It doesn't need to be the same "flavor" of Linux, but it makes things a lot easier if the kernel version is pretty close to one another. If you do not already have one deployed, create one now. Make a note of the instance-id of the second server (if you created your instance a while back, the instance-id will look like this: i-123a45fe - if you just created your instance, the id will be longer, 17 characters, like this: i-1234567890abcdef0).

2. From the AWS Management Console, select Instances and then highlight the broken instance. Make a note of the instance-id . Then STOP the instance.

3. Next, select Volumes. If you haven't already, give the volume of both the broken instance and your second troubleshooting instance a descriptive Name so you can quickly tell them apart. Make a note of the names and volume-id's and which instances they are connected to.



4. Highlight the volume of the BROKEN server, right-click, and select DETACH VOLUME.



5. Detaching volumes should be processed quickly, but your browser won't recognize the change right away. Refresh your screen to make sure the volume is detached. Then, right click the detached volume and select ATTACH VOLUME.


This will open a new window asking you which instance to attach the volume to and what to name the volume on the new server. Select your secondary, working server to attach the volume to. It should be alright to leave the default device label - it should be /dev/sdf. The only concern here is that you don't want to name the new volume a label that is already assigned. If you only have one EBS volume attached to your server, it will automatically be assigned /dev/sda1. If you've customized volume management for your server, you know these settings; if you haven't, then this walkthrough will assume you use /dev/sdf for the broken disk volume label on the secondary server.


6. SSH to your working secondary server and make a new folder under /. You will be mounting the broken disk to this directory

    # cd /
    # mkdir broken/

7. Here's where things can get a bit complicated, and where a lot of the walkthroughs available on this subject get things wrong. In Step 5 we created a volume label /dev/sdf for mounting the broken disk to our secondary server; but it won't show up as /dev/sdf on your secondary server.

You should have two EBS devices attached: /dev/sda1, which is the default volume, and /dev/sdf, which is the broken drive. /dev/sda1 will show up as /dev/xvda1 - the "s" is translated to "xv" to indicate that it is a virtual disk. /dev/sdf will show up as two additional devices: /dev/xvdf and /dev/xvdf1. You will want to use /dev/xvdf1.

Where you go from here depends on the sort of filesystems that are in use. In most instances, the filesystem in use will be XFS. You can check the filesystem by running this command:

    # mount -l |grep xvd
    /dev/xvda1 on / type xfs (rw,relatime,attr2,inode64,noquota)

The filesystem is shown directly after the "type". This is important because attempting to mount the broken volume directly will fail when it uses XFS, like this:

    # mount /dev/xvdf1 /broken/
    mount: /dev/xvdf1 is write-protected, mounting read-only
    mount: unknown filesystem type '(null)'

Even though the error appears to indicate the volume was mounted "read-only", nothing get's mounted - the /broken/ directory will be empty and `mount -l` will not display /dev/xvdf1.

The problem here is that the filesystem must be specified by using the "-t" flag (using -t auto will also fail). Here is the correct command:

     # mount -t xfs /dev/xvdf1 /broken/

If successful, the command will output nothing. You can confirm by checking for content in the /broken/ directory and by running this:

    # mount -l |grep xvdf1
    # /dev/xvdf1 on / type xfs (rw,relatime,attr2,inode64,noquota)

8. You can now navigate through the /broken/ directory as if it were / on the broken server. You can use /broken/var/log/ to identify errors, and rewrite configuration files like /broken/etc/sysconfig/network-scripts/. Be sure to remember to prepend /broken/ when navigating! It's easy to forget where you are and change something on your secondary working server, so don't do that, or else ...


9. Once you have reversed whatever was broken, unmount the disk from the broken server:

    # umount /broken

10. Detach the now-fixed volume from the secondary server.


11. Refresh your window and reattach the volume to the original server.


12. Restart the server and you should now be back in business.

There are so many reasons why this process is a huge pain in the ass as compared to a virtual KVM utility. I recently had to perform this procedure to resolve a networking issue on a server where most of the services were still responding - http & https were all fine, but SSH was dead. With a virtual console I could have repaired the issue without any downtime. Using this procedure forced me to bring down the server for 5 minutes or so to perform the repairs. That sucks. And up-to-date image backups would not have made anything better; remaining the server may have shaved a minute or two off of the total downtime that was required to run this procedure, but there would still be downtime.

I'm not sure why Amazon has declined to implement this sort of feature; Rackspace and others make it available. My guess would be that there are security issues involved, but that's just a guess. In any case, hopefully this walkthough helps out.

h/t Several of the images here were taken from a walkthrough by Mike Culver. Mike's screenshots were great and spared me having to take my own; unfortunately his walkthrough as currently published in Amazon's tutorials section fails in a variety of cases, including my recent one, which is why I wrote this.

No comments:

Post a Comment