Matt Healy - PropTech CTO
Technology Leader, Solutions Architect and Business Analyst
Perth, Western Australia
 

S3 Sync - Danger!

  1. It's insecure. The data is sent across the Internet via an insecure protocol, FTP. This could have been remedied by using a secure file transfer protocol such as SFTP.

  2. It's inefficient. Every night, we're sending an entire snapshot to the backup server. Obviously, there's a whole lot of data on the server that never changes, so there is really no need to back it up every night. To make things worse, the backup process takes longer and longer each night, as the amount of data on the server increases over time. We're needlessly wasting precious bandwidth with this solution. The ideal solution involves doing a full backup as a baseline, then only sending the changes across in each subsequent backup.

Eventually the FTP account was no longer available, so I started looking in to a better solution. I'd done a lot of work recently with Amazon Web Services, so I decided to investigate that as a backup solution. Amazon's S3 offering allows infinitely scalable, cheap object storage in the cloud. I went ahead and created a bucket which would store the backup files.

The official AWS Command Line Interface (CLI) provides an easy way to perform operations on S3 (along with most of the the other AWS services). In fact, they have a command which can sync a local folder with an S3 bucket, sending only the modified files with each sync. Armed with this new tool, I wrote a basic Perl script which would call the following system command:

system("/usr/bin/aws s3 sync --delete /var/www/html/ s3://maad-backup/html");

I set the script to run automatically at 10pm each night, and went to bed happy with the knowledge that I'd set up a simple backup solution which would cost mere cents per month.

When I awoke the next morning, I checked my emails and saw that I had a Billing Alarm email from Amazon - my monthly spend had already gone past the alarm threshold I had set! What had gone wrong? Had my account been compromised somehow? I had a sick feeling just thinking about what the current monthly spend would be by now. If indeed I had been compromised, the intruders would have spun up as many powerful EC2 machines as they could, most likely to mine for Bitcoin.

After logging on to my AWS account, I could breathe a sigh of relief - I hadn't been compromised, only stupid. My monthly spend was at about $50, up from my usual of $25-$30. Drilling in to the AWS Bill, I could see that my S3 costs for the month were well above my normal spend. The bill was reporting that I had made several million PUT operations to S3 - that is, I had loaded millions of files. Surely this couldn't be right... There would definitely be a lot of files on the web server, but nowhere in the order of millions.

Clicking around the S3 console I could immediately see the problem. One of the subfolders that I had synced to S3 contained a symbolic link - to itself.

[matthewh@folio mydir]$ pwd
/home/matthewh/mydir

[matthewh@folio mydir]$ ln -s ./ mydir

[matthewh@folio mydir]$ ll
total 0
lrwxrwxrwx 1 matthewh matthewh 2 Feb  1 12:43 mydir -> ./
[matthewh@folio mydir]$ cd mydir/
[matthewh@folio mydir]$ cd mydir/
[matthewh@folio mydir]$ cd mydir/
[matthewh@folio mydir]$ cd mydir/
[matthewh@folio mydir]$ pwd
/home/matthewh/mydir/mydir/mydir/mydir/mydir

Turns out that by default, the aws sync command will follow these symbolic links and create separate, distinct objects in the S3 bucket, even when the link points to its own directory. My script had been running for hours, creating infinite levels of directories containing themselves.

I checked the process list and confirmed that the script was no longer running, so I assume I must have hit some sort of memory limit on the server, or perhaps a hard limit on directory depth on S3.

After consulting the documentation (this would have been the ideal first step!) I realised the mistake - I had to include a parameter --no-follow-symlinks.

system("/usr/bin/aws s3 sync --no-follow-symlinks --delete /var/www/html/ s3://maad-backup/html");

Running the script again fixed up the existing mess (by virtue of the --delete parameter) and gave me the desired result - a simple, secure backup of the web server.

This could have been a lot worse - if I didn't have a billing alarm set up, and if the script had not killed itself - I could have run up a very massive AWS bill because of this mistake.


 

Server Gremlins

I recently came across an interesting problem with one of the companies I'm involved with. Their main product is an iPad app with a backend server for syncing data back to the cloud. Their production environment resides in the AWS data centre in Sydney, but they have a secondary server in Perth which is still used by some clients on the older version of the app.

Seemingly without cause, the Perth server started experiencing problems loading image files up to S3. I investigated with the other technicians. We ruled out the obvious first:

  1. The code on the server hadn't been touched in over a week, so it couldn't be a code issue
  2. The same code was working on the production site in Sydney, indicating that the issue could be to do with the server itself.
  3. The same issue occurred when trying to load a file to a different S3 bucket, so the bucket wasn't the problem.

Ok, so all signs point to the server being the problem. But where to start? And how could a problem just materialise out of nowhere, with no changes being made in the last week? I already inspected our code and couldn't find anything, so it was time to look in to the third-party libraries we used. The backend is written in Perl and utilises the Amazon::S3 module.

I looked in to the Bucket.pm file and drilled down to basics. The add_key_filename method made a call to add_key. The add_key method contains a comment stating:

If we're pushing to a bucket that's under DNS flux, we might get a 307

This made me think that the problem might be related to DNS, so I deviated a bit and sanity checked the DNS responses being obtained by the server. I even tried changing the server's resolver just to make sure, but still had no success, so back to checking the Perl modules.

Looking back in to the Amazon::S3 module, specifically the S3.pm file, I investigated the actual LWP object that gets instantiated. Passing the LWP object into Data::Dumper revealed the cause of our woes:

      'Code' => 'RequestTimeTooSkewed',

      'RequestId' => '183892BBCA7FF3D2',

      'ServerTime' => '2015-06-08T06:25:25Z',

      'Message' => 'The difference between the request time and the current time is too large.',

      'HostId' => 'oOvyHbSk2B7hFlk0UgVREBzq7f5seJhCdbxf8B+cOkrYaZ76qgqt9Z0H+5CU80Xk',

      'MaxAllowedSkewMilliseconds' => '900000',

      'RequestTime' => 'Mon, 08 Jun 2015 06:41:00 GMT'

Unbelievable... after all this investigation, the answer was hiding in plain sight.

The system clock was fast by just over 15 minutes. According to the response, MaxAllowedSkewMilliseconds is 900,000 which turns out to be exactly 15 minutes. After syncing the system clock back to the correct time, the issue disappeared.