If you’ve already made the choice to transition your career from web development to embedded systems engineering then just skip ahead to How to Survive the Transition section. If you’re still on the fence or perhaps just curious about the main team that makes your android/iPhone fast or slow then you should read on.
Early Career Goals of a Generalist
Shortly after graduating college, I decided against pursuing a master’s degree in order to focus on gaining real world technical and leadership experiences while freely exploring topics that interest me in my own time. As a result, I promised myself to keep exploring new areas in computer science and avoid falling in the comfort trap that breeds stagnation. More than 5 years have passed and I’m happy to have firmly kept that promise. I’ve learned that a new beginning cleanses out any accumulated bad habits and offers a more mature perspective to new ideas. It also boosts your confidence in tackling novel problems as repeatedly surviving the learning curve for a new team is no cake walk. Aside from that, you get exposure to different people, products, technologies, release cycles, skill sets and expectations. Nowadays though, I have dramatically shifted to playing the role of the specialist in one area after successfully passing my 5-year generalist phase.
What Drove Me Away From Web Development as a Career
I’m not bashing web engineering in general, but rather cautioning against planning to spend your entire engineering career on building websites for consumer facing products. This career path involves applying the latest web development frameworks and tools for rendering a payload of backend data according to a predefined mockup design (usually created by UX designers). The two dangers for this approach are:
- Most of the heavy computations and algorithms are already done in the backend so a deep understanding of computer science is not needed in general.
- Web development tools and frameworks are constantly becoming simpler and easier to use which makes acquiring that skill set more accessible for non-computer science and engineering types. Consequently, a UX or HCI Designer can easily learn web development and extend their indispensable responsibilities in the web design arena thus replacing the software engineer for that task. Web development tools are on a trajectory to be as accessible to designers as Adobe Photoshop and Illustrator are.
Nowadays, I’m enjoying the more user-friendly nature of web development tools (MEAN stack, bootstrap and heroku) in order to rapidly turn ideas into beautiful websites (without worrying if they’re compatible with ie6 or not).
In a Nutshell What is Systems Engineering
The word ‘systems’ is very generic so I will limit its scope to the conglomerate of the various hardware components in a power restricted computing device (e.g. mobile device). Consequently I will break down the various systems engineering roles based on how focused are they on the inter workings of a group of components (macro) as opposed to the inner working of an individual component (micro).
- Bringing up a device in a new system: Porting the individual subsystem code of each component in the operating system and implementing the necessary code to wire the components together so the system boots properly. For example, to plugin the LTE functionality in a phone, you’ll have to:
- Port the modem’s device driver code and add it to the device’s kernel.
- Fetch the modem’s firmware code from the hardware vendor.
- Fetch the modem vendor’s code or pre-compiled libraries to:
- Flash the firmware code during device startup.
- Kickstart the modem’s initialization process.
- Implement the system-specific interface for radio communication (e.g. The Android RIL interface).
- Identifying performance bottlenecks in an existing system: Are we constantly maxing out the CPU cycles (i.e. CPU bound) or are we always waiting for the disk to finish reading/writing data (i.e. I/O bound)? Are we running low on memory? Is the CPU throttled due to improper thermal setup or is it just heating up due to excessive utilization because some app is tight looping in the background? A huge part of this team involves creating, porting and using code to monitor the system’s behavior (e.g. Systrace tool for Android).
- Owning the system level code for a subsystem: A subsystem could be any individual hardware component (e.g. wifi, camera, usb…etc.) or general processing component (e.g. CPU scheduler, Memory allocator, flash block allocator…etc.). This includes understanding the subsystem’s inner workings while staying on top of its latest developments.
- Facilitating the subsystem’s integration by any other systems engineer who is bringing up a new device: This involves producing documentation, cookbooks and useful scripts for rapid development. For example: exploring and documenting the variety of disk block allocators, filesystems, CPU schedulers and writing recipes and benchmarking tools to determine the most relevant candidates for a specific new system.
What Drew Me Into Systems Engineering
Aside from the reasons I described in the early career goals of a generalist section, I decided to specifically pursue operating systems engineering due to the following:
- Increased interest in mastering the fundamentals of computation in order to be a better programmer.
- Improve my general system design skills which is a crucial skill for senior engineers.
- Strong interest in the increased adoption of IoT (internet of things) devices which use their own miniature embedded operating system.
- Have heard about a lot of phones with stellar specs but terrible performance compared to much older phones. So I wanted a better understanding of hardware limitations and how clever software design can offset that (more on that in the following section).
- Very fascinated with the process of continued learning so there was my chance to master a new notoriously challenging topic with very little prior background knowledge.
Lessons Learned From my 3 Years Journey in Systems Engineering
Performance trumps cleaner design in large scale systems
Maximizing a system’s performance often comes at the expense of a simple design. At times, the code may prove challenging to digest and the design may have special implementations for tricky edge cases (e.g. ext4 filesystem stores files in 3 completely different mechanisms depending on the size of the file to reduce wasted space).
Case and point in real world systems:
- Despite its object-oriented nature, kernels are still written in C not C++ to avoid the performance overhead of extra C++ boilerplate code.
- MINIX is an operating system that advocates for a microkernel that separates device drivers from the core kernel to avoid crashing the kernel in the event of a driver malfunctions. Despite being hugely influenced by MINIX, Linux on the other hand opted for a monolithic design to avoid the performance penalty of separating both portions. MINIX is taught at college level operating systems courses due to its simple and clean design yet it’s hardly relevant outside of academia as opposed to Linux (the operating system of Android). Read here for further comparison.
- Browser based apps (i.e. HTML5) are not (yet) the future as opposed to OS specific apps (e.g. Android and iPhone apps). Native apps enjoy direct access to the operating system as opposed to browser-based apps which is why they currently perform better. The idea of writing code ONCE for a website and have it run on every device is the dream of every developer but won’t be fulfilled until the performance gaps between the two are bridged. Firefox OS was one initiative to address this problem but they failed to win the battle.
Stellar hardware specs and good benchmarks are poor performance indicators
As a systems engineer your role is to first understand the constraints of your system before blindly cranking turbo mode on every subcomponent. For a typical smartphone, the main constraints are:
- Number of hours of battery life with normal usage.
- Peak temperature after-which performance throttling kicks in so you don’t burn your hands or pockets.
- Expected longevity of the various parts (usually battery and storage are among the first to experience degradation).
No point in getting an 8 core processor if activating all 8 cores will heat the device to intolerable levels. No point in ordering a 256GB flash storage with only 32GB partitioned space (just to increase its longevity) if the device needs to be replaced every 2 years. Those specs end up being good marketing material.
Here are some examples that show the disconnection between specs/benchmarks and actual performance:
- Benchmarks don’t tell the full story: F2fs is a flash-friendly filesystem that differs from other conventional filesystems (e.g. ext4) in that it finds and allocates a new empty segment on the flash partition for every disk write regardless if we’re modifying an existing file chunk or writing a new file. In effect, a scattered write I/O job is converted to a much quicker sequential write. While this sounds like a win for a freshly clean system with lots of continuous empty space, it turns out to be a nightmare for an older partition where continuous empty chunks are scarce. As a result, write speed degrades significantly as the device ages compared to conventional filesystems that stay fairly consistent so simply comparing write speed benchmarks do not tell the full story.
- Poorly Optimized system libraries can slow down the execution of the whole system even with a stellar CPU: ‘memcpy’ is a library call that copies arbitrary bytes of memory from one place to another. It is a heavily called function in large programs and as a result its execution speed can directly change the system performance. It works by repeatedly copying n-bytes of memory from the source address into the destination address with a fixed chunk size (e.g. 8-bits, 32-bits…etc). A generic one-size fits all solution might not make use of wider memory buses that could host larger chunk sizes and thus will miss out on speeding up the system. See here for a good analysis.
- Badly configured hardware can slow down good hardware: Flash storage hardware is divided into large chunks called erase blocks which in turn are divided into pages. After running out of pages, an erase block will need to be erased before allowing further writes and that erase operation is very slow. If the files on the partition are not aligned on erase blocks, then file chunks will likely overlap extra erase blocks causing additional writes and erases that slows down the storage write speed.
- Good software can speed up slow/cheap hardware: consider a smartwatch’s cheap flash storage with slow read speeds. One way to potentially speed up disk reads is by compressing the contents. Consequently, if the processor is powerful enough in decompressing, then the apparent I/O read speed will be improved. Such partitions are formatted with filesystems like squashfs.
How to Survive the Transition
One big mistake I did while getting up to speed in embedded systems engineering is diving through the theoretical side of the kernel for extended periods of time before getting hands on experience in a real system. If I was to go back in time and restart my journey then I would rather attack it in the following manner:
- If you’ve never developed any system level application before then I would suggest you start with Robert Love’s Linux System Programming book first. The book shows you how to develop programs that talk directly to the kernel so I highly suggest you pick up one example program and try to implement it yourself as you read on. One good example that would encompass a lot of concepts from the book would be creating your own simplified dummy system daemon coupled with a program that talks with that daemon. Whatever you do, make sure that you’re writing code otherwise I promise none of what you read will stick.
- Walk through a hands-on embedded systems class. An excellent one is the embedded systems class provided by free-electrons. If you can’t afford the live training, then they have graciously provided all the slides and practical lab documents for free so you can go through it on your own.
- Learn the basics of hardware buses (e.g. i2c, spi, uart, scsi…etc.). Every embedded system will use a combination of them to connect all its components together so it helps to understand how they are different in all aspects of performance, power, and limitations. You can simply read about them here. Moreover, if you’ll be adding components to your embedded system, then I highly recommend going through an Arduino crash course as it also teaches you how to read hardware schematic diagrams.
- Now that you have a basic applied knowledge in systems programming and embedded systems, it’s time to learn the core kernel fundamentals. My recommended book on this subject is Robert Love’s Linux Kernel Development. I would suggest skimming through the whole book first to get a feel of what’s coming up before taking a more thorough deep dive on it. You will find yourself always coming back to that book whenever you’re dealing with a new subsystem so keep it handy.
- Deep dive on a development device that you actually like (e.g. Beaglebone or Android). This involves downloading the source code of the platform and kernel code, building the code and flashing the device with your custom builds. To get started with Android I recommend the following:
- Pick your favorite development Nexus/Pixel device. I would stay away from non-google devices as most android systems documentation are geared towards google devices. In addition, the devices are bloatware-free and the bootloaders are unlocked to make it possible to flash your own kernels (i.e. boot image).
- A great book that helped me get a better understanding of the Android platform (even while working for the Android Systems Team) is Yagmour’s Embedded Android.
- Download, compile and build the Android Open Source code for your device. See https://source.android.com/source/downloading for instructions. Yagmour’s book will follow you through the instructions, but is a bit dated at this point so it’s good to keep an eye on the official code and documentation as the source of truth.
- Now would be the time to write some kernel code to solidify all the previous concepts. The majority of the kernel code consists of device driver code so it’s imperative to understand how those work. A great hands on device driver development resource is also offered by free-electrons, just like the embedded systems training, they also provide all the slides and the training material for free to do it on your own.
- Now that you have theoretical and applied kernel knowledge with the ability to flash a custom modified kernel and platform builds on your favorite device it’s time to pick a subsystem (e.g. storage, memory, cpu, display…etc.) to master. Reread the related the sections in the kernel book, read the relevant kernel and platform code, research the tools and benchmarks to measure its performance and tweak that subsystem to get a feel of how it works. For example if you pick storage, then you can:
- Learn how to use benchmarking tools such as fio.
- Try flashing different filesystems on your device partitions and measure how I/O speed changes as a result.
- Turn on kernel tracing at the block and filesystem level and see what kind of operations the kernel is doing for different settings.
- Read the vfs code in the kernel and read the codes for different filesystems (e.g. ext4 vs f2fs).
- If you’re ambitious, then try writing your own filesystem from scratch. There are a few simplified ones out there for educational purposes.
- Finally, refer to lwn.net for a plethora of up to date kernel related documentation and discussions.
Would you like me to provide you with regular feedback and advice on your journey to transition into systems engineering?
Then head to my page on Mindcrumb to start a new systems engineering journey and add me as a reviewer. You’ll be able to share your progress with me (and your friends) and I will provide you with more detailed personal advice and feedback as you progress (FREE).