Ξ

Understanding files 101

Published on 2020-11-21 code

One topic that seems to always stay relevant to me is the divide between software developers and regular users. The common approach seems to be that regular users are not trusted to understand many fundamental ideas of computers and therefore get very limited apps compared to the power tools that are used by developers. This approach of course only widens the divide.

On the other side of that argument we get something like code.org which is trying to teach kids how to code, but doing so in a toy environment which is completely detached from reality.

I personally believe that we should concentrate on the basics that can be applied in everyday life. If people know how to look up something on the internet they can teach themselves. If people know how to use right click menus or common keyboard shortcuts or browser tabs they can complete their tasks much more efficiently. Teach a man to fish and all that.

One of these basics that is often overlooked is a general knowledge of common file types. So this is what this article is about. After reading this you will have a basic understanding of what any file is and how you might be able to interact with it.

A file is not a program

Before I get into it, I have to clear up a common misconception:

Files are often associated with programs. We say things like “Word file”, “Excel file”, or “Photoshop file” and we expect that the respective program will launch when we double-click such a file.

But that doesn’t mean that these files are exclusive to those programs. There could be many other programs that can open those files just as well. The widespread claim “you need Adobe Reader to open PDF files” is simply a lie used for marketing.

Just think of a mp3 audio file. When you double-click it, an audio player launches. But you could install a different audio player and set it as the default any time. The same goes for jpg or png images and many other types of files.

With that out of the way, let’s get to the actual files.

Text files

You probably already know that files just consist of ones and zeroes, also called bits. How these bits are interpreted depends on the specific type, but many file types use the same basic structure: Groups of eight bits (we call them bytes) are mapped to characters. For example, 01100001 is mapped to a. This way we can create simple text files.

There are different mappings, but the most important ones are ASCII and UTF-8. ASCII is old and simple and only contains the most important characters. UTF-8 is new and complicated and contains everything that ASCII does and then a lot more, e.g. chinese characters or emojis.

When I say that bytes are mapped to characters, I use a very loose definition of the term “character”. This does not only contain letters and digits, but also punctuation, spaces (00100000), and even line breaks (00001010).

The programs that are used to view and edit text files are called “text editors”. The default text editor on Windows is called Notepad, the one on MacOS is called TextEdit, and the most common one on Linux is called gEdit. Word is not a text editor in this sense, because it stores its documents in much more complicated files that can also contain formatting and images, which goes far beyond the simple mapping we are talking about here (we will get to that).

If you come across a file which you don’t know it is often a good idea to look at it in a text editor. If the file happens to contain text, you can read it and maybe understand enough to know what to do next.

XML

Mapping bits to characters already gives us some structure, but apparently not enough. So people have invented different formats on top of that. Probably the most widespread of these formats is XML. An XML file looks roughly like this:

<animals>
    <animal name="dog">
        <sound>Bark</sound>
        <legs>4</legs>
    </animal>
    <animal name="cat">
        <sound>Meow</sound>
        <legs>4</legs>
    </animal>
</animals>

I am not going to explain all details of XML, and I hope it is somewhat self-explanatory. You can usually identify it by the use of all those angle brackets.

XML is used virtually everywhere. For example: Every website is essentially just an XML file.

ZIP

Text files are great because they are much easier to read compared to a stream of ones and zeroes. However, that comes at the price of being less efficient. For example: “14”, when encoded as two characters, is 0011000100110100. When we encode “14” as a number directly we get 1110, which is obviously much shorter.

Some smart people have come up with a great solution for that: We can compress files to get back some of that efficiency. So now we get the best of both worlds: We can write some understandable text files and then bundle them together into a single compressed ZIP file.

This is actually how many file types work these days. For example, try changing the file extension of any MS Office file to .zip. You can now unpack it and see what it contains: You guessed it, a bunch of XML files!

Conclusion

Of course there are a lot more file types. But text files and ZIP alone already cover a lot of ground.

I believe the awareness of text files might be the biggest factor in the divide between software developers and regular users. All programming is done in text files. Settings for programming tools are usually changed not in some graphical dialog but by editing a configuration file. I write this article not in Word but in my text editor.

I don’t think you have to do everything in a text editor. But I also don’t think you should be dependent on a specific application for each task. Knowing about file types gives you the freedom to switch applications or even interacting with files you have never seen before. It also opens up the world of developer power tools, if you ever want to go there. It just generally increases your ability to effectively and responsible navigate the modern digital world.