Encrypted USB drives with eCryptfs

I’ve got several USB flash drives that I carry around with me on a regular basis, and it’s nice to be able to use those for small backups of important things, in addition to my usual system-wide backups that get dumped onto a couple external hard drives. This way, even if something happens to my computer and the backup drives, (the house burns down, my computer and hard drives are stolen, or an EMP bomb goes off in the garage) I still have the important things with me. Being me, I like these important things to be encrypted so that if I lose the USB drive, it’s not cause for panic.

That said, I also use my USB drives for more mundane things, like transferring files between two computers and carrying around my music. Since I’m a Linux geek surrounded by people using Microsoft and Apple products, it’s nice to have something that works with both. After all, it’s hard to sing the praises of Linux effectively when they’re asking questions about why your USB drives won’t work.

Today I drew up a list of features that my ideal USB drive would have.

  1. Must be capable of storing encrypted data in such a way that it can be mounted as a filesystem. Backup software shouldn’t need to worry about the encryption.
  2. The USB drive must be capable of being read and written by Linux, Windows, and Mac OSX. The encrypted data only needs to be read on my personal computers, which all run Linux.
  3. If I have free space left over after backing up, I should be able to use it to transfer files, store music, or whatever else I want to use the USB drive for.

The second requirement means that the USB drive will have to have at least part of it formatted as a filesystem that everyone can read and write to. I could format it as NTFS, which is what Windows uses now, but I don’t believe Mac OSX allows writes to NTFS filesystems, and I have previously had issues with using NTFS on Linux myself. It may be the case that I could work around any Linux-NTFS issues, but I don’t want to have to convince Mac users to install things or tweak system settings to make my USB drive work. I can’t use Mac’s normal filesystem, HFS+,  either, because Windows users wouldn’t be able to use it without installing new drivers. Finally, I can’t use something like ext4 or Btrfs, because then it would be limited to Linux users only without extra software. FAT32 is not my favourite filesystem, but it’s supported on all the target platforms with no extra software needed, so that will be my filesystem of choice, guaranteeing that whatever systems I’m likely to encounter will be compatible with my USB drive.

Turning to the first requirement, I need to be able to store encrypted data such that it can be mounted and used like a regular filesystem. In Linux, there are two basic ways of going about this, you can use block level encryption like dm-crypt/LUKS, which I use for encrypting my hard disk or you can use filesystem-level encryption, like eCryptfs, which I also previously mentioned. To figure out which of these two solutions to use, I’ll look at the final requirement, which is the ability to dynamically resize to avoid wasting space.

I could partition my USB drive and format the first part as FAT32 with the rest encrypted using dm-crypt with whatever filesystem I like (encrypted data only needs to be usable in Linux) on top of the encrypted block device. This is not very nice however, as I’m pre-allocating the space before I know what I’m storing. If I don’t store much in my backup area, then the space is wasted and I can’t use it for other things. On the other hand, if I suddenly need more backup space, I’d have to manually repartition, expand the filesystem, and so on. That’s annoying, so I won’t go that route.

The second alternative, eCryptfs, is a lot better suited to this dynamically resizable storage problem. I can format the entire drive as FAT32, then create a directory to use for eCryptfs backing files. If I then mount an eCryptfs filesystem with that directory as the backing directory, all the files I write into the mount are encrypted before being written onto my USB stick. Now I can just use that as output for my backups, and I have the dynamically resizable encryption scheme that I wanted. It only takes up as much space as the encrypted files, since they are just files stored in the FAT32 filesystem. If I create a new file in the eCryptfs mount, one new file is created in FAT32, and if I delete something in eCryptfs, the file goes away for FAT32 as well.

So the final solution is to format the whole USB drive as FAT32, then stack eCryptfs on top of that. Now I can safely carry around my backups and still have room should I need to use a USB drive like a normal person, or should normal people want to use my USB drive.

What I’ve been up to recently

As part of my degree requirements, I’ve got to complete a large project within a group of four people. The project goal is self selected, but must be useful and related to software engineering. I’ll be working with Michael Chang, Zameer Manji, and Alvin Tran until early 2014 to add integrity protection to eCryptfs, a cryptographic file system that can be used on Linux. In this post, I’ll present some concepts related to the project along with some details about the project itself.

Confidentiality and Integrity

Two important concepts in computer security are those of confidentiality and integrity. There is also usually a third concept mentioned alongside these, that of availability, but I’m only mentioning it here for completeness.

Confidentiality protection attempts to prevent certain parties from reading information while allowing other parties to access the information. In the case of a cryptographic file system like eCryptfs, this is done by encrypting the files before writing them to disk, and decrypting the files when they are needed later. This could be done manually, but it is much easier and less error prone to have the file system handle this sort of thing than to try to encrypt all sensitive information by hand.

Integrity protection attempts to ensure that information has not been unintentionally changed. This might entail actually trying to prevent modifications to the information, or it may simply indicate when the information has been changed. Cryptographically, this can be done using a Message Authentication Code (MAC), which is a short binary string that can be easily calculated with a file and a key, but cannot be calculated without both. Additionally, if the file changes then the calculated MAC will be different. Anyone knowing the key and having access to the file can calculate the MAC and compare it to one that was calculated and stored earlier, and if the two are different, then the file must have been changed.

Current state of eCryptfs

The eCryptfs file system is a stacking file system, which means that it relies on a lower file system to handle stuff like I/O and buffering, and just manages file encryption and decryption. Currently, that is all it manages, as it does not include any integrity protection. The contents of files are made unreadable to anyone without the correct key, but it is still possible to modify those files in partly predictable ways, as presented below.

There is already a wide user base for eCryptfs, with Ubuntu and it’s derivatives using it to provide the encrypted home directory feature, and within Google’s ChromeOS.

Attack against CBC mode

Cipher Block Chaining (CBC) is one of the most common modes of operation for block ciphers, and is used currently by eCryptfs. In this mode of operation, each block of plaintext is XORed with the previous ciphertext block before encryption. This ensures that the same block of plaintext won’t encrypt to the same ciphertext, unless the previous ciphertext block is the same as well. A one-block initialization vector stands in for the previous ciphertext block during the first encryption. CBC decryption just reverses the process, first decrypting the ciphertext block, then XORing it with the previous ciphertext block.

Operations can be expressed in the following way (Taken from Wikipedia)

Encryption: C_i = E_K(P_i \oplus C_{i-1}), C_0 = IV

Decryption: P_i = D_K(C_i) \oplus C_{i-1}, C_0 = IV

Now let’s perform the attack. Let’s say we want to change a certain plaintext block P_n into a different plaintext {P_n}' by flipping some bits. We’ll denote this change as \Delta. That is, {P_n}' = P_n \oplus \Delta

It turns out that if we don’t care what happens to the previous plaintext block, P_{n-1}, all we have to do is replace C_{n-1} with {C_{n-1}}' = C_{n-1} \oplus \Delta

We can substitute this into the decryption formula above to see what will happen.

{P_n}' = D_K(C_n) \oplus {C_{n-1}}'

{P_n}' = D_K(C_n) \oplus {C_{n-1}} \oplus \Delta

{P_n}' = P_n \oplus \Delta

This is an integrity issue, as an attacker can now modify files without ever knowing the key used to encrypt them. It’s also not guaranteed that this modification is detectable, depending on whether the previous block can be checked for validity. If it can be checked, great, but that’s just another form of integrity protection, and the project I’m working on aims to implement integrity protection regardless of the data stored. If it can’t be checked for correctness, or is ignored (maybe it’s a different record in a database) then the modification will go unnoticed.

Galois Counter Mode

Galois Counter Mode (GCM) is another mode of operation for block ciphers, but in addition to encryption, also produces a piece of data known as an authentication tag. This tag acts as a MAC taken over the data that was encrypted. An attacker could still modify the ciphertext, but now the resultant changes to the plaintext will invalidate the tag, making them detectable. The attacker cannot modify the tag so that it validates the new data, because calculating the tag requires the cryptographic key that was used to encrypt the data, and the attacker does not know this key.

Another benefit to GCM is speed. It’s true that the same effect on security could be had by encrypting the data and calculating a MAC separately, but that requires two passes of the file, one for each operation. GCM does both in one pass over the file, speeding things up. This is important in a file system, as you’d rather have access to your files quickly.

The project aims to implement GCM as the mode of operation for eCryptfs, thus providing both integrity and confidentiality protection. Integrity protection was something the original developers wanted to have from the beginning, but didn’t have the time to implement. I’m proud to be helping to create the first widely used integrity protected cryptographic file system.

Password Restrictions Really Bug Me

Warning, rant ahead.

Maybe it’s just me, but when I hit restrictions on what I can use as my password, I get annoyed. Lower limits, such as “at least 6 characters long” are fine, but there are several things that I can see no reason for, that make me doubt the competence of the programmers involved in the system. When I realised that my bank’s website did all of the things mentioned in this post, I was really annoyed. Thousands of dollars of my money are sitting there, just one string of characters away from an attacker getting it all, and of course, reading the fine print of their security agreement reveals that it’s not their responsibility if my password or reset questions are compromised.

Character restrictions
Passwords are passwords, not HTML, not shell scripts, not anything that needs to be parsed by machines. They should be treated as opaque sequences of bytes, and the only thing that should be done with them while logging in is salting and hashing. When I see restrictions like, “To preserve online security, your information cannot contain unacceptable symbols or words (for example, “%”, “<”, “{“, “www.”, “ftp”,”https”, etc.)”, I’m astounded that they let these people touch code at all. There is no reason at all that they should need to check for that sort of thing in passwords. Passwords should never be displayed, not on their website, not in email, not anywhere. There should be no cause to worry about XSS attacks, SQL injection, or any other sort of incomplete mediation attack via passwords if they’re properly handled as opaque data.

I’m not complaining about entropy figures here. A 12 character string composed of random alphanumeric character would be approximately 71 bits of entropy. Adding in the printable punctuation and whitespace characters on a standard US keyboard only adds about 5 bits of entropy to that figure for a random 12 character string. The reason I’m annoyed by these restrictions is that they hint at deeper problems about how the password is handled by the system. They also impede those of us who do want to use “special” characters in our passwords for whatever reason, from non-ASCII characters in a preferred language to an obsession over password entropy.

Upper limit on length
I would have thought that the days of small fixed size strings were behind us, but apparently not. My bank puts an upper limit of 12 characters on password length, which precludes using an easily memorable passphrase. They put a similarly low limit of 25 characters on the password reset questions and answers. One of the few things they did right is allowing me to write my own security question, as I definitely didn’t want to use the default questions. I then got cut off halfway through writing a short sentence by this low character limit. Is this due to some aspiring database administrator learning that CHARs were faster than VARCHARs, and deciding to speed up logging in by a few milliseconds? Is the process of logging into an account, or resetting a password really where the bottlenecks are? If the problem doesn’t lie with the storage, but with the login system itself, then it’s time the programmers learned about dynamic allocation.

In addition to making it impossible to use longer passwords, this upper length limit also hints at improper handling of the passwords, as properly salted and hashed passwords would be constant length, and the length of the original password would be completely irrelevant to the storage requirements.

Required character classes
This sort of attempt to increase security is what leads to users choosing “Password1″ instead of “password” to protect their life savings. No system is idiot proof, so rather than treating the symptoms by attempting to programmatically enforce good passwords, try to treat the problem by educating the idiots on how to choose a good password, and mentioning why they care. Suggest using a diceware passphrase, or use my password suggestions. Of course, this is only effective if users actually can use good passwords, so fixing the first two issues is a priority.

While I’m ranting about programmatically enforcing password strength, I should mention that I’m of the opinion that the only checks that I think are valid for this would be checking against a list of common passwords, and checking against information like account name or other public details associated with the account. These are going to be among the first things a social engineering attacker would try, and common workarounds for required character classes are not going to stop them, making that method of enforcement worthless.

Default password reset questions
Not really a password restriction, but it’s related, and it ticks me off. The usual culprits, along the lines of ”what was your first [job, school, pet's name, car]“, just train people to think this information is unguessable. After all, if so many places use the same questions, there must be a good reason right? I did a quick experiment where I looked at the public information about some of my friends on Google+, searched through their post history, and if I could find a link to their website, looked at that as well. In many cases, I could find answers to at least one of those questions just from these sources of information. The current system might as well be an all-you-can-eat buffet for social engineers.

Perhaps more obscure questions could be used by default, or perhaps websites should start using alternate methods of authentication, for example, resetting via SMS or OpenID.  Approached from an alternate perspective, why should we even need to give answers to these questions in the first place? What business do random websites have knowing trivia about me, and more importantly, why is it the same default trivia that protects my bank account?

</rant>

Killing Machines

Automated systems are already doing much of our work for us, making everyday decisions to remove the burden from humans. Anything from Google’s licensed self driving car navigating the roads alongside human-guided vehicles to the computers doing stock market trading on behalf of investors. Both of these are technologies that do what humans can do, but better, faster, or more reliably. Humans suffer lapses of concentration and fatigue, and do so unpredictably, whereas computers don’t. Computers have their own set of interesting problems, like the priority inversion bug that crashed Mars Pathfinder, but those can be found and removed from systems.

The question that springs to mind for me is “Is there anything we shouldn’t let a computer decide, even if it could make that decision faster or more reliably than a human?” My answer is that a machine should not be allowed to decide whether to kill a human. I’m not against computers aiding humans in acts of war, that’s just technological progression, and it’s happening already. Modern fighter jets are extremely unstable, to aid in maneuverability, so much so that a human pilot could not possibly stabilize the aircraft, so the jets are stabilized by computer. The human pilot still has the decision of whether or not to fire the weapons, and at what targets.

I’d also like to point out that I’m not mentioning anything about sentient computers when I say “decision”. Computers make decisions every time they execute conditional statements, without any sort of capacity for sentience or consciousness. A machine could be programmed to identify humans and kill some subset of them without human intervention, and it would be making a decision whether or not to kill humans. A machine that identified humans and then asked whether or not to kill them would not be making the decision to kill a human, as it would be passed off to a human operator.

By giving a machine the decision to kill a human being, we have created something capable of autonomously waging war. Most people consider war to be something to be avoided and minimized if possible. It is also known that the more indirect the method of killing, the easier it will be for a person to rationalize it, and the less aversion to performing the killing they will have. Psychologically, it is much easier to kill someone by pressing a button that launches a missile than to shoot someone that you can see, and shooting someone is psychologically easier than stabbing someone to death. If we remove the decision entirely from humans, the killing is now out of sight, out of mind, making it much easier for mass killings to take place without psychological consequence for those waging war.

One problem with this answer is how to define “deciding to kill”. Through various means, computers can control much of how we view the world. For example, information people view on the Internet is managed by computers. If the search engine ratings on some website are low, fewer people will be impacted by that website. If the search engines raise the rating, more people are likely to see it. By controlling what we see, computers could indirectly control what we think, and could theoretically manipulate one human into killing another. As such, it is likely an exercise in futility to try to prove that a computer could not decide to kill a human, even though that scenario seems to be very unlikely at this point in time.

SSH X Forwarding

Recently I was messing about with X forwarding through SSH, and I realized something that perhaps should have been obvious, but caught me off guard, so I’ll share it here.

The scenario involves 2 computers and a fairly resource intensive graphical application that could take advantage of 3D acceleration. The computers involved were a desktop computer with a reasonable graphics card, and a netbook with whatever graphics capabilities were built into the motherboard, but no 3D acceleration. I wanted to take advantage of the netbook screen, effectively using it as a second monitor, but running the application on the more powerful desktop. I figured I should be able to forward the X session over SSH, and end up with the netbook displaying it, but all the hard work done on my desktop. Unfortunately, when I set this up, I found the application could not take advantage of 3D acceleration anymore.

The reason for this is that when an X session is forwarded, only the X traffic is transferred across the network. This traffic consists of things like “draw a rectangle here” or “draw this bitmap there”. Where I had gone wrong was assuming that my graphical application would have the 3D acceleration done on the desktop’s fully capable graphics card, then have the resulting bitmap sent over the network. In reality, the netbook was doing all the graphical rendering, leading to a lack of 3D acceleration capability.

Despite the fact that I couldn’t get my 3D acceleration, this is actually a much smarter way of doing things, as it significantly reduces the network traffic involved. My netbook’s screen size is 1024×768, and let’s assume I wanted 30fps and 32 bit colours. The resultant network traffic would be (1024×768 pixels/frame)x(30frames/second)x(32 bits/pixel), coming out to a little over 750Mb/s just for the image going one way. If I recall correctly, the actual network load while I was doing this was a little over 100Mb/s.

The lesson learned here is that hardware acceleration is done on the display end (the X server) rather than the client end of an X connection.

Turkey Trivia

It’s Thanksgiving here in Canada, and so instead of the usual range of topics, here’s some trivia about that favourite food, the turkey.

  1. Canadians consumed 143.4 million kg (Mkg) of turkey in the year 2011.
  2. At Thanksgiving 2011, 3.0 million whole turkeys were purchased by Canadians, equal to 32% of all whole turkeys that were sold over the year.
  3. At Christmas 2011, 4.4 million whole turkeys were purchased by Canadians, equal to 46 % of all whole turkeys that were sold over the year.
  4. Turkeys are omnivorous. Most of their diet is grass and grain, but they will also eat insects, berries and small reptiles.
  5. The wild turkey’s bald head can change color in seconds with excitement or emotion. The birds’ heads can be red, pink, white or blue.
  6. Turkeys see in color and have excellent daytime vision that is three times better than a human’s eyesight and covers 270 degrees, but they have poor vision at night.
  7. The long, red fleshy growth from the base of the beak that hangs down over the neck of a turkey is called the snood.
  8. The fastest time to carve a turkey is 3 min 19.47 sec and was achieved by Paul Kelly (UK) at Little Claydon Farm, Essex, UK, on 3 June 2009.
  9. Turkeys originated in North and Central America, and evidence indicates that they have been around for over 10 million years.
  10. Wild turkeys can fly for short distances at up to 88 kilometres per hour. Wild turkeys are also fast on the ground, running at speeds of up to 40 kilometres per hour.

Info taken from

They’re Always Watching

Right now, as you read this page, you are secretly being watched. Ad networks and analytics companies are tracking you across the Web, using a variety of techniques designed for one purpose: knowing everything about you. There is an entire industry built upon identifying and tracking web users, and chances are you haven’t heard of most of them. This post will discuss some of the ways they track you, how you can make it harder for them, and why you should.

You may be wondering why tracking is so bad, and why you should care. You might think that there are so many people that picking you out of the crowd is practically impossible. It’s not. The problem is that you don’t know who has your data, you don’t know what they have, and you can’t control how they use it. The data associated with you might seem fairly innocuous, such as the IP addresses you’ve used, what sort of sites you visit, or your email address, but in the wrong hands it can be rather dangerous.  Potential employers could see your search history, including not only search terms, but what you clicked on. Phishers and identity thieves could launch targeted attacks using information off your social networking sites and your location from the IP addresses you’ve used. Commerce sites could adjust their prices and charge you more for things you’re interested in, assuming you’re still likely to buy them.

From the point of view of the people doing the tracking, there are two main goals. First they have to get as much information as they can from each place you visit, and second, they have to link it together to create a profile of you. This allows them to build the detailed view of your Web history that they can then sell to anyone willing to pay them for it.

Gathering Information

Every time you make a request to a web server, your browser sends a bunch of information along with that request. It will generally send an “agent string” identifying the browser you use, the operating system you’re using it on, and some other information such as version numbers. It also sends a header telling the server what the last web page you looked at was, one telling the server what languages you prefer, and if you connect through an HTTP proxy, it will add a header with your original IP address. On top of all this, any cookies previously set by the domain (more on this later) will be sent to the server. This is quite a lot of information, and it may be sent out multiple times, not only for the web page itself, but also for any extra resources in the page, such as images or ad banners.

This is one major source of information for the tracking companies. All they have to do is arrange for some resource hosted on their servers to be included in the web page you visit, and all that information will be sent to them when the page loads. This isn’t hard, it happens every time you see an advertisement on a page you visit. Even if you don’t see any ads on a page, you’re still not in the clear. Web analytics companies will often place transparent 1×1 pixel images, also known as “web bugs”, in the pages they keep stats on. Every time someone loads the page, their browser sends all that information to the tracking company along with their request for these images.

So how do you stop this from happening? There are two ways to do this, either you can change or reduce the information in the headers, or better, you can avoid fetching the third party content altogether. If you never request things from their servers, it becomes much harder to track you. Two browser extensions that help you avoid connecting to these servers are Adblock Plus for Firefox and Chrome, and Do Not Track Plus For Firefox, Chrome, Safari and IE. These extensions attempt to keep you from loading the third party resources that come from advertisers and Web analytics companies without affecting anything else. If the site you’re connecting to directly is tracking you, then they won’t help, but they do a pretty good job at blocking third party tracking.

The other method for reducing the information given out is to avoid giving it out, or at least make it meaningless. There are extensions for Firefox and Chrome (And probably others) that will allow you to suppress sending the URL of the last page you visited along with each request. To protect yourself from trackers identifying where you live based on your IP address, you can use something such as Tor to hide your location. (Remember what I said about proxies sending the original IP address along with the request. Tor is safer, if slower) Other extensions exist that will allow you to modify the requests you send to web servers by removing or changing other headers.

Building a Profile

To track you across the web, the tracking companies need some way of identifying who is making a request to their servers when a web bug or ad banner is loaded. The easiest way to do this is with HTTP cookies, which are pieces of information that your browser stores, and then sends back to the server on each new request. This works across sites, as long as there’s some third party resources to fetch from the same domain on each site. Let’s say you log in to Facebook to check up on your friends. Along with the main page content, your browser fetches some ad banners from ads.tracking.com. Along with these come some tracking cookies, which just contain some unique string. Later on you visit an online store, which also has ads hosted on ads.tracking.com. When your browser goes to fetch these ads, it sends back the cookie it got while you were on Facebook. Now the tracking company knows that those two requests came from the same web browser, and they can link your social information to your shopping history. Of course, as you continue to browse the Web, they’ll link in other stuff, like all your search terms, or the list of all those sites you don’t want people finding out that you visit.

To make matters worse, HTTP cookies are only one type of information that can be stored and sent back later. Flash also allows special cookies to be stored, and these won’t be thrown out when you clear your browser cookies. Java applets and Microsoft Silverlight can do similar things, and are just as dangerous to your privacy as Flash. I recommend turning these off entirely except for a few trusted sites, e.g. Youtube.

An alternative to cookie-based tracking is to use something like JavaScript to fingerprint browsers. Information such as your browser version, operating system, any fonts or plugins installed, and much more can be collected and sent back after the page loads, even if you didn’t send it along with the page request. This information may be enough to uniquely identify you, or at least to place you into a very small group of users that share your exact settings.

To avoid tracking companies linking your requests together, you would need to avoid storing cookies from their servers and avoid running any code that they provide. The suggestions above for avoiding connecting to tracking servers in the first place will help you out here, but some additional protection is warranted as well. Most browsers have a setting to prevent them from storing third party cookies. This is a good defence against tracking via HTTP cookies. Additionally, most browsers let you clear cookies when you shut down the browser. While this will result in your logins not being remembered across browser sessions, it can help avoid you being tracked across browser sessions as well. Finally, you can also use your browser’s private browsing mode. This will generally prevent cookies from being saved and used outside the private session, in addition to not saving history to your computer. Keep in mind that none of this helps against Flash, Java, or Silverlight.

To avoid browser fingerprinting through JavaScript (as well as a lot of other annoying or dangerous stuff) you could turn JavaScript off completely. This will break the functionality on a lot of sites however, so I would recommend NoScript for Firefox or ScriptNo for Chrome to help you manage which sites you trust to run code in your browser. This will require some tweaking at the beginning as you choose which sites you trust and which you don’t, but it becomes much smoother afterwards, and you’ve greatly reduced the potential for attacks on your browser through these channels too.

If you follow the advice given in this article, you can go a long way towards regaining some control over your personal information. There is likely no magic bullet, and never will be, but you can make it a lot harder for the usual range of tracking techniques to build a profile of you. You can’t take back what you’ve already given away, but you can limit the information you give away in the future.