Creating the Original Xbox Boot Sound: Using “old school” game audio techniques in a modern console

Brian Schmidt
Oct 16, 2021
9 min read

Updated: Apr 29, 2024

It’s hard to believe it’s been 20 years since the original Xbox hit the store shelves on November 15, 2001. As a member of the original Xbox design team, one of the most fun parts for me was creating the boot sound, which ended up having some very surprising challenges.

In case you need a refresher, here is the original Xbox Boot sequence: https://www.youtube.com/watch?v=E1ebJZUOtL8

One of the very last things we did when designing the original Xbox was create the startup sequence. The purpose of the startup sequence is more than to just be entertaining and serve as the audio/visual brand of the device. There is a practical component as well—hide the boot time. One of the design and PR precepts of the Xbox was that it was not a PC; PC’s at the time were associated with long boot times (a minute or more), blue screens of death and the like. We wanted something that would be entertaining from the moment you pressed the “Start” button, and felt like a consumer device that instantly turned on.

But the system still has to ‘boot’ itself—the disk drives have to spin up and systems have to initialize. The Xbox hardware and software team went to great efforts to make the boot time fast. In the end they got it down to about 8 seconds. And because of how they designed the bootup software, we could display a cool visual animation during those 8 seconds, ‘hiding’ the boot time by booting while showing the opening visuals.

When it came to sound, though, we had a big challenge.

Usually, creating sound for visuals is straightforward; get a copy of the video, bring it into your favorite DAW (Digital Audio Workstation), use whatever studio software or hardware you have to create a cool sound synced to the visual, render the wave file and you’re done. However for the Xbox, we couldn’t do that. During the boot sequence, the only memory the system could access was a small “boot ROM” on the motherboard. That stored the Xbox OS kernel as well as opening visual sequences. That ROM was only 256 kilobytes, and after accounting for the OS kernel and visuals there was only about 25kilobytes of room left. If you do the math, 25kilobytes gets you barely half second of 8-bit mono audio. So, creating the boot sound in a DAW wasn’t an option.

The question then became how to create an 8-second boot sound using only 25kbytes of memory. As luck would have it, I’d spent a good deal of my career in game audio prior to Microsoft doing just that—creating sound and music for extremely limited memory for arcade games and 16-bit console games for the Genesis (Megadrive) and Super Nintendo; the entire sound budget for Desert Strike (1992), for example, was less than128kb. So back then, we didn’t use wave files; we generated both music and sound effects dynamically, typically from hand-coded note-lists driving an on-board synthesizer chip.

The Xbox had a very powerful sound chip made by nVidia, the MCPX; essentially ProTools on a chip, along with a sophisticated wavetable synthesizer that could play 256 concurrent sounds, with a programmable filter and DAHDSR envelope. I modified a version of one of my old console/arcade sound drivers to work with the MCPX. This driver took as input lists of notes, durations, and parameters and sent the commands to this chip, which generated the actual sound. The sequences themselves were created as simple .txt files

The visual sequence was finished first, and I took a video from a hand-held camcorder of the opening sequence from a prototype console, then went through it frame by frame to get the timing of the most important visual elements.

Column “E” is the number of “system tick” that each event occurred on. A system tick is a quirky timing system that my sequencer used to control timing and durations.

With the sequencer created and the visual timings in hand, I was ready to create the boot sound.

From the beginning of Xbox, we wanted to emphasize its power. One of the design phrases that sticks in my mind from that time is “immense power striving to break forth into your living room.” I’m not sure that was ever an official design statement, but in creating the boot sound, it was what I had in mind, and it matched what the visuals did as well.

First, I needed to define a sound palette; something that would let me create the boot sound from constituent components and that let me express this power ‘breaking forth into your living room.’ However I also knew that these audio components had to be exceptionally small; I couldn’t exactly store a 5.1 recording of a thunderbolt in my 25kbytes of memory.

I wanted sounds that were rich in harmonics, but could be generated on the fly, so I wrote some simple code to create a few very useful waveforms: white noise, sine and sawtooth waves. Since the output of the MCPX was 48kHz, these waveforms were full fidelity, 24-bit. The best thing about those waveforms was that because they were generated by code, they required almost none of the precious ROM memory.

But I knew that if I relied purely on simple waveforms like saws and triangles, the sound would have a certain ‘chip-tune’ character to it, which was most definitely not what we were going for. To augment the synthesized waves, I recorded a few 8-bit sounds—they were 8-bit to keep the memory usage minimal-- concentrating on the attacks of the sounds. By downsampling the sounds to a horrifyingly low 6kHz sampling rate, I was able to squeeze the 3 very short sounds into the 25k: a thunder sound, a cannon attack and the attack portion of a glockenspiel. To increase the high end of the low-fidelity samples, I wrote some code to resample them to 48Khz, and deliberately distort them via clipping, which sort of worked. I was also able to create a 4th wave: ‘reverse thunder’ by using code to reverse the thunder sound in memory. You can hear the reverse thunder as part of the lead-in to the big green flash about 6 seconds in.

One nice thing about combing synthesized waveforms together with digitized is that I could get the full fidelity of the synthesized waveforms (48kHz sampling rate, 24-bit) combined with the punch of the digitized but far lower fidelity sounds.

Here’s one of the tracks from the boot sequence. The opening, low pitched “wwwwaaaaaaa,” at the beginning of the boot sequence is a 256-sample looping sawtooth wave. The wave is sent through a low-pass filter which slowly opens up. This track selects the patch (PatchSaw1), sets the volume and sets the lowpass filter parameters. The “note” command initiates the sound. As the note plays, the ‘finc’ command is used to gradually increase the cutoff frequency of a low-pass filter from about 300Hz and to about 3kH. Another loop then closes the filter slowly again, resulting in the wwwaaaaaauummmm sound which starts the boot sound.

Since the direction for the boot sequence was the notion of immense power striving to break free from the confines of the box, increasing the cutoff frequency of the low pass filter literally puts more energy into the sound itself, so that matched well.

The visuals also had some explosive, flashing elements. To create those, I used the digitized thunder and canon sounds, layered with filtered white noise, with a long release time. That provided the realism of the thunder, but using the white noise let me create the illusion that the thunder was longer than it actually was. And the white noise was at full 48kHz fidelity, so that masked the otherwise atrocious fidelity of the distorted thunder sample itself.

The fast, tinkling hi-hat-like notes are actually very short filtered white noise notes, with a fast attack and decay. By narrowly filtering them and playing them at different pitches they take on an almost metallic character.

The bubbly sound in the opening section is a low pitched triangle wave, but with an extreme pitch LFO, which gives it that organic, warbly, bubbly sound.

The little jingle at the end is done with a combination of the glockenspiel attack sample, but with sine waves concurrently played at the same pitch as the glockenspiel, with a slow decay one the sines to extend the duration, since the attack sample itself was so short.

The final sequence used 9 tracks of sound: digitized and synthesized wavefiles played back at various pitches and times, with dsp processing controlled as the sound played. Put it all together and you have the original Xbox boot sequence.

Normally something like a boot sound would go through multiple levels of approval: marketing, executives, etc. But due to the extreme time pressure (this was one of the very last things done on the Xbox before it was finalized for production), we didn’t really have any of that; we were just happy it made sound! If I recall, I did two revisions after the initial concept, which were mainly small tweaks, including the addition of the jingle at the end.

The only bit of contention on the boot sound was how loud it should be! Back then, there weren't any loudness standards, or even recommended practices for console games. The PS2 had a very loud boot sound (-0dBFS). But I had deliberately created the sound to be softer than that; around -18dBFS peak in its first version. The reason for the softer boot sound? One additional thing the boot sound is used for is a poor-mans volume calibration. If the boot sound is REALLY LOUD, people will naturally turn down the volume of their TV or stereo. This has the unfortunate byproduct of pushing game developers to make their games loud to match the loudness of the boot sound. If their games have to be loud, there's no headroom for the REALLY exciting moments to get loud.

By having a softer boot sound, the thinking was, we would encourage people to turn UP their TV volumes. This would let game developers create games with a greater dynamic range, letting them save their really loud sounds for truly exciting moments in their game. But the Xbox marketing people felt our softer boot sound wasn't as exciting as the louder PS2 boot sound, and they wanted it to be LOUD as well.

In the end we sort of compromised, and if I recall the Xbox boot sound peaks out around -12dBFS.

I also had the rather humbling experience of being the cause of an early production problem! On the early factory runs, there was a report that every so often—maybe one in a thousand boots—an Xbox wouldn’t boot the first time, and would have to be re-started. Turns out there was a bug in my audio code, that, if the timing was juuust right, would cause the whole system to crash. Fortunately, one of the razor-Sharp programmers on the team was able to find and fix my boot sound bug, and production could continue. [EDIT: enough people have asked me what the bug was that I'm adding it here. Warning! Wonky explanation follows!]

Recall that I had ported my old arcade/console driver to the Xbox. The arcade systems and consolers had integer only CPU's; no floating point. That means they only could operate on integers: eg 2,3,4,-44, etc, but not number like "3.14159." But the Xbox processor does have a "FPU" -- floating point unit. So in one tiny piece of code that dealt with timing, because it was easier, i used a tiny bit of floating point code. That code was inside the interrupt routine of the Xbox's CPU, and it turns out, on that CPU you are NOT supposed to do any floating point processing in an interrupt routine or, if the timing is juuuust right, it will cause all sorts of problems. The programmer, Tracy Sharpe, referenced above tracked down the issue and re-wrote that little bit of code to not use floating point, which eliminated the problem.

At the time, I got a lot of questions why we didn’t create the boot sound to be 5.1. Although commonplace today, the Xbox was the first console to enable real-time interactive digital surround sound in games and Dolby was very prominently featured in the Xbox marketing material. So why is the boot sound only stereo?

Digital audio receivers of that era supported digital surround via their digital optical input. As it happens, a receiver starts out assuming any digital signal it gets is a regular stereo signal. If it detects a Dolby Digital signal, it switches itself into Dolby Digital mode. The problem was that that detection takes 2-4 seconds, and the receiver mutes itself during the changeover. By that time, the boot sound would be half over. If we had turned on 5.1 for the boot sound, the first few seconds of it would be totally silent. So, we had to go, unfortunately, with just a stereo boot sound.

Sometimes, despite having all sorts of tech at our disposal, we have to rely on 'old school' techniques to solve creative game audio challenges. You might say, then, in the most literal sense of the word, the Xbox boot sound is a 'chip tune!'

Brian Schmidt is a 34 year veteran of the game audio industry. He worked for 10 years at Microsoft as the program manager for the Xbox and Xbox 360 audio systems and created the Xbox startup sound. Currently he’s an independent composer, sound designer, educator and is founder and Executive Director of GameSoundCon.

Read more at the GameSoundCon blog

Game Music and Sound Design Conference

Creating the Original Xbox Boot Sound: Using “old school” game audio techniques in a modern console

Recent Posts