I keep being asked about automated bootloader testing and the phrase which crops up is “SD mux” – hardware to multiplex SD card access (typically microSD). Each time, the questioner comes up with a simple solution which can be built over a weekend, so I’ve decided to write out the actual objective, requirements and constraints to hopefully illustrate that this is not a simple problem and the solution needs to be designed to a fully scalable, reliable and maintainable standard.

The objective

Support bootloader testing by allowing fully automated tests to write a
custom, patched, bootloader to the principal boot media of a test
device, hard reset the board and automatically recover if the bootloader
fails to boot the device by switching the media from the test device to
a known working support device with full write access to overwrite
everything on the card and write a known working bootloader.

The environment

100 test devices, one SD mux each (potentially), in a single lab with support for all or any to be switched simultaneously and repeatedly (maybe a hundred times a day to and fro) with 99.99% reliability.

The history

First attempt was a simplistic solution which failed to operate reliably. Next attempt was a complex solution (LMP) which failed to operate as designed in a production environment (partially due to a reliance on USB) and has since suffered from a lack of maintenance. The most recent attempt was another simplistic solution which delivered three devices for test with only one usable and even that became unreliable in testing.

The requirements

(None of these are negotiable and all are born from real bugs or real failures of previous solutions in the above environment.)

  1. Ethernet – yes, really. Physical, cat5/6 RJ45 big, bulky, ugly gigabit ethernet port. No wifi. This is not about design elegance, this is about scalability, maintenance and reliability. Must have a fully working TCP/IP stack with stable and reliable DHCP client. Stable, predictable, unique MAC addresses for every single board - guaranteed. No dynamic MAC addresses, no hard coded MAC addresses which cannot be modified. Once modified, retain permanence of the required MAC address across reboots.
  2. No USB involement – yes, really. The server writing to the media to recover a bricked device usually has only 2 USB ports but supports 20 devices. Powered hubs are not sufficiently reliable.
  3. Removable media only – eMMC sounds useful but these are prototype development boards and some are already known to intermittently fry SD card controller chips causing permanent and irreversible damage to the SD card. If that happened to eMMC, the entire device would have to be discarded.
  4. Cable connections to the test device. This is a solved problem, the cables already exist due to the second attempt at a fix for this problem which resulted in a serviceable design for just the required cables. Do not consider any fixed connection, the height of the connector will never match all test device requirements and will be a constant source of errors when devices are moved around within the rack.
  5. Guaranteed unique, permanent and stable serial numbers for every device. With 100 devices in a lab, it is absolutely necessary that every single one is uniquely addressable.
  6. Interrogation – there must be an interface for the control device to query the status of the SD mux and be assured that the results reflect reality at all times. The device must allow the control device to read and write to the media without requiring the test device to acknowledge the switch or even be powered on.
  7. No feature creep. There is no need to make this be able to switch ethernet or HDMI or GPIO as well as SD. Follow the software principle of pick one job and do it properly.
  8. Design for scalability – this is not a hobbyist project, this is a serious task requiring genuine design. The problem is not simple, it is not acceptable to make a simple solution.
  9. Power – the device must boot directly from power-on without requiring manual intervention of any kind and boot into a default safe mode where the media is only accessible to the control device. 5V power with a barrel connector is preferable – definitely not power over USB. Device must raise the TCP/IP control interface automatically and be prepared to react to commands immediately that the interface is available.
  10. Software: some logic to prevent queued requests from causing repeated switching without any interval in between, e.g. if the device had to be power cycled.
  11. Ongoing support and maintenance of hardware, firmware and software. Test devices continue to develop and will require further changes or fixes as time goes on.
  12. Mounting holes – sounds obvious but the board needs to be mounted in a sensible manner. Dangling off the end of a cat5 cable is not acceptable.

If any of those seem insurmountable or awkward or unappealing, please go back to the drawing board or leave well alone

Beyond the absolutes, there are other elements. The device is likely to need some kind of CPU and something ARM would be preferable, Cortex-M or Cortex-A if relevant, but creating a cape for a beaglebone-black is likely to be overkill. The available cables are short and the device will need to sit quite close to the test device. Test devices never put the SD card slot in the same place twice or in any location which is particularly accessible. Wherever possible, the components on the device should be commodity parts, replaceable and serviceable. The device would be best not densely populated – there is no need for the device to be any particular size and overly small boards tend to be awkward to position correctly once cables are connected. There are limits, of course, so boards which end up bigger than typical test devices would seem excessive.

So these are the reasons why we don’t have automated bootloader testing and won’t have it any time soon. If you’ve got this far, maybe there is a design which meets all the criteria so contact me and let’s see if this is a fixable problem after all.