Date Tags LAVA

First of two posts reproducing some existing content for a wider audience due to delays in removing viewing restrictions on the originals. The first is a bit long... Those familiar with LAVA may choose to skip forward to Core elements of automation support.

A summary of this document was presented by Steve McIntyre at Linaro Connect 2018 in Hong Kong. A video of that presentation and the slides created from this document are available online: http://connect.linaro.org/resource/hkg18/hkg18-tr10/

Although the content is based on several years of experience with LAVA, the core elements are likely to be transferable to many other validation, CI and QA tasks.

I recognise that this document may be useful to others, so this blog post is under CC BY-SA 3.0: https://creativecommons.org/licenses/by-sa/3.0/legalcode See also https://creativecommons.org/licenses/by-sa/3.0/deed.en

Automation & Risk

Background

Linaro created the LAVA (Linaro Automated Validation Architecture) project in 2010 to automate testing of software using real hardware. Over the seven years of automation in Linaro so far, LAVA has also spread into other labs across the world. Millions of test jobs have been run, across over one hundred different types of devices, ARM, x86 and emulated. Varied primary boot methods have been used alone or in combination, including U-Boot, UEFI, Fastboot, IoT, PXE. The Linaro lab itself has supported over 150 devices, covering more than 40 different device types. Major developments within LAVA include MultiNode and VLAN support. As a result of this data, the LAVA team have identified a series of automated testing failures which can be traced to decisions made during hardware design or firmware development. The hardest part of the development of LAVA has always been integrating new device types, arising from issues with hardware design and firmware implementations. There are a range of issues with automating new hardware and the experience of the LAVA lab and software teams has highlighted areas where decisions at the hardware design stage have delayed deployment of automation or made the task of triage of automation failures much harder than necessary.

This document is a summary of our experience with full background and examples. The aim is to provide background information about why common failures occur, and recommendations on how to design hardware and firmware to reduce problems in the future. We describe some device design features as hard requirements to enable successful automation, and some which are guaranteed to block automation. Specific examples are used, naming particular devices and companies and linking to specific stories. For a generic summary of the data, see Automation and hardware design.

What is LAVA?

LAVA is a continuous integration system for deploying operating systems onto physical and virtual hardware for running tests. Tests can be simple boot testing, bootloader testing and system level testing, although extra hardware may be required for some system tests. Results are tracked over time and data can be exported for further analysis.

LAVA is a collection of participating components in an evolving architecture. LAVA aims to make systematic, automatic and manual quality control more approachable for projects of all sizes.

LAVA is designed for validation during development - testing whether the code that engineers are producing “works”, in whatever sense that means. Depending on context, this could be many things, for example:

  • testing whether changes in the Linux kernel compile and boot
  • testing whether the code produced by gcc is smaller or faster
  • testing whether a kernel scheduler change reduces power consumption for a certain workload etc.

LAVA is good for automated validation. LAVA tests the Linux kernel on a range of supported boards every day. LAVA tests proposed android changes in gerrit before they are landed, and does the same for other projects like gcc. Linaro runs a central validation lab in Cambridge, containing racks full of computers supplied by Linaro members and the necessary infrastucture to control them (servers, serial console servers, network switches etc.)

LAVA is good for providing developers with the ability to run customised test on a variety of different types of hardware, some of which may be difficult to obtain or integrate. Although LAVA has support for emulation (based on QEMU), LAVA is best at providing test support for real hardware devices.

LAVA is principally aimed at testing changes made by developers across multiple hardware platforms to aid portability and encourage multi-platform development. Systems which are already platform independent or which have been optimised for production may not necessarily be able to be tested in LAVA or may provide no overall gain.

What is LAVA not?

LAVA is designed for Continuous Integration not management of a board farm.

LAVA is not a set of tests - it is infrastructure to enable users to run their own tests. LAVA concentrates on providing a range of deployment methods and a range of boot methods. Once the login is complete, the test consists of whatever scripts the test writer chooses to execute in that environment.

LAVA is not a test lab - it is the software that can used in a test lab to control test devices.

LAVA is not a complete CI system - it is software that can form part of a CI loop. LAVA supports data extraction to make it easier to produce a frontend which is directly relevant to particular groups of developers.

LAVA is not a build farm - other tools need to be used to prepare binaries which can be passed to the device using LAVA.

LAVA is not a production test environment for hardware - LAVA is focused on developers and may require changes to the device or the software to enable automation. These changes are often unsuitable for production units. LAVA also expects that most devices will remain available for repeated testing rather than testing the software with a changing set of hardware.

The history of automated bootloader testing

Many attempts have been made to automate bootloader testing and the rest of this document cover the issues in detail. However, it is useful to cover some of the history in this introduction, particularly as that relates to ideas like SDMux - the SD card multiplexer which should allow automated testing of bootloaders like U-Boot on devices where the bootloader is deployed to an SD card. The problem of SDMux details the requirements to provide access to SD card filesystems to and from the dispatcher and the device. Requirements include: ethernet, no reliance on USB, removable media, cable connections, unique serial numbers, introspection and interrogation, avoiding feature creep, scalable design, power control, maintained software and mounting holes. Despite many offers of hardware, no suitable hardware has been found and testing of U-Boot on SD cards is not currently possible in automation. The identification of the requirements for a supportable SDMux unit are closely related to these device requirements.

Core elements of automation support

Reproducibility

The ability to deploy exactly the same software to the same board(s) and running exactly the same tests many times in a row, getting exactly the same results each time.

For automation to work, all device functions which need to be used in automation must always produce the same results on each device of a specific device type, irrespective of any previous operations on that device, given the same starting hardware configuration.

There is no way to automate a device which behaves unpredictably.

Reliability

The ability to run a wide range of test jobs, stressing different parts of the overall deployment, with a variety of tests and always getting a Complete test job. There must be no infrastructure failures and there should be limited variability in the time taken to run the test jobs to avoid the need for excessive Timeouts.

The same hardware configuration and infrastructure must always behave in precisely the same way. The same commands and operations to the device must always generate the same behaviour.

Scriptability

The device must support deployment of files and booting of the device without any need for a human to monitor or interact with the process. The need to press buttons is undesirable but can be managed in some cases by using relays. However, every extra layer of complexity reduces the overall reliability of the automation process and the need for buttons should be limited or eliminated wherever possible. If a device uses on LEDs to indicate the success of failure of operations, such LEDs must only be indicative. The device must support full control of that process using only commands and operations which do not rely on observation.

Scalability

All methods used to automate a device must have minimal footprint in terms of load on the workers, complexity of scripting support and infrastructure requirements. This is a complex area and can trivially impact on both reliability and reproducibility as well as making it much more difficult to debug problems which do arise. Admins must also consider the complexity of combining multiple different devices which each require multiple layers of support.

Remote power control

Devices MUST support automated resets either by the removal of all power supplied to the DUT or a full reboot or other reset which clears all previous state of the DUT.

Every boot must reliably start, without interaction, directly from the first application of power without the limitation of needing to press buttons or requiring other interaction. Relays and other arrangements can be used at the cost of increasing the overall complexity of the solution, so should be avoided wherever possible.

Networking support

Ethernet - all devices using ethernet interfaces in LAVA must have a unique MAC address on each interface. The MAC address must be persistent across reboots. No assumptions should be made about fixed IP addresses, address ranges or pre-defined routes. If more than one interface is available, the boot process must be configurable to always use the same interface every time the device is booted. WiFi is not currently supported as a method of deploying files to devices.

Serial console support

LAVA expects to automate devices by interacting with the serial port immediately after power is applied to the device. The bootloader must interact with the serial port. If a serial port is not available on the device, suitable additional hardware must be provided before integration can begin. All messages about the boot process must be visible using the serial port and the serial port should remain usable for the duration of all test jobs on the device.

Persistence

Devices supporting primary SSH connections have persistent deployments and this has implications, some positive, some negative - depending on your use case.

  • Fixed OS - the operating system (OS) you get is the OS of the device and this must not be changed or upgraded.
  • Package interference - if another user installs a conflicting package, your test can fail.
  • Process interference - another process could restart (or crash) a daemon upon which your test relies, so your test will fail.
  • Contention - another job could obtain a lock on a constrained resource, e.g. dpkg or apt, causing your test to fail.
  • Reusable scripts - scripts and utilities your test leaves behind can be reused (or can interfere) with subsequent tests.
  • Lack of reproducibility - an artifact from a previous test can make it impossible to rely on the results of a subsequent test, leading to wasted effort with false positives and false negatives.
  • Maintenance - using persistent filesystems in a test action results in the overlay files being left in that filesystem. Depending on the size of the test definition repositories, this could result in an inevitable increase in used storage becoming a problem on the machine hosting the persistent location. Changes made by the test action can also require intermittent maintenance of the persistent location.

Only use persistent deployments when essential and always take great care to avoid interfering with other tests. Users who deliberately or frequently interfere with other tests can have their submit privilege revoked.

The dangers of simplistic testing

Connect and test

Seems simple enough - it doesn’t seem as if you need to deploy a new kernel or rootfs every time, no need to power off or reboot between tests. Just connect and run stuff. After all, you already have a way to manually deploy stuff to the board. The biggest problem with this method is Persistence as above - LAVA keeps the LAVA components separated from each other but tests frequently need to install support which will persist after the test, write files which can interfere with other tests or break the manual deployment in unexpected ways when things go wrong. The second problem within this fallacy is simply the power drain of leaving the devices constantly powered on. In manual testing, you would apply power at the start of your day and power off at the end. In automated testing, these devices would be on all day, every day, because test jobs could be submitted at any time.

ssh instead of serial

This is an over-simplification which will lead to new and unusual bugs and is only a short step on from connect & test with many of the same problems. A core strength of LAVA is demonstrating differences between types of devices by controlling the boot process. By the time the system has booted to the point where sshd is running, many of those differences have been swallowed up in the boot process.

Test everything at the same time

Issues here include:

Breaking the basic scientific method of test one thing at a time

The single system contains multiple components, like the kernel and the rootfs and the bootloader. Each one of those components can fail in ways which can only be picked up when some later component produces a completely misleading and unexpected error message.

Timing

Simply deploying the entire system for every single test job wastes inordinate amounts of time when you do finally identify that the problem is a configuration setting in the bootloader or a missing module for the kernel.

Reproducibility

The larger the deployment, the more complex the boot and the tests become. Many LAVA devices are prototypes and development boards, not production servers. These devices will fail in unpredictable places from time to time. Testing a kernel build multiple times is much more likely to give you consistent averages for duration, performance and other measurements than if the kernel is only tested as part of a complete system.Automated recovery - deploying an entire system can go wrong, whether an interrupted copy or a broken build, the consequences can mean that the device simply does not boot any longer.

Every component involved in your test must allow for automated recovery

This means that the boot process must support being interrupted before that component starts to load. With a suitably configured bootloader, it is straightforward to test kernel builds with fully automated recovery on most devices. Deploying a new build of the bootloader itself is much more problematic. Few devices have the necessary management interfaces with support for secondary console access or additional network interfaces which respond very early in boot. It is possible to chainload some bootloaders, allowing the known working bootloader to be preserved.

I already have builds

This may be true, however, automation puts extra demands on what those builds are capable of supporting. When testing manually, there are any number of times when a human will decide that something needs to be entered, tweaked, modified, removed or ignored which the automated system needs to be able to understand. Examples include /etc/resolv.conf and customised tools.

Automation can do everything

It is not possible to automate every test method. Some kinds of tests and some kinds of devices lack critical elements that do not work well with automation. These are not problems in LAVA, these are design limitations of the kind of test and the device itself. Your preferred test plan may be infeasible to automate and some level of compromise will be required.

Users are all admins too

This will come back to bite! However, there are other ways in which this can occur even after administrators have restricted users to limited access. Test jobs (including hacking sessions) have full access to the device as root. Users, therefore, can modify the device during a test job and it depends on the device hardware support and device configuration as to what may happen next. Some devices store bootloader configuration in files which are accessible from userspace after boot. Some devices lack a management interface that can intervene when a device fails to boot. Put these two together and admins can face a situation where a test job has corrupted, overridden or modified the bootloader configuration such that the device no longer boots without intervention. Some operating systems require a debug setting to be enabled before the device will be visible to the automation (e.g. the Android Debug Bridge). It is trivial for a user to mistakenly deploy a default or production system which does not have this modification.

LAVA and CI

LAVA is aimed at kernel and system development and testing across a wide variety of hardware platforms. By the time the test has got to the level of automating a GUI, there have been multiple layers of abstraction between the hardware, the kernel, the core system and the components being tested. Following the core principle of testing one element at a time, this means that such tests quickly become platform-independent. This reduces the usefulness of the LAVA systems, moving the test into scope for other CI systems which consider all devices as equivalent slaves. The overhead of LAVA can become an unnecessary burden.

CI needs a timely response - it takes time for a LAVA device to be re-deployed with a system which has already been tested. In order to test a component of the system which is independent of the hardware, kernel or core system a lot of time has been consumed before the “test” itself actually begins. LAVA can support testing pre-deployed systems but this severely restricts the usefulness of such devices for actual kernel or hardware testing.

Automation may need to rely on insecure access. Production builds (hardware and software) take steps to prevent systems being released with known login identities or keys, backdoors and other security holes. Automation relies on at least one of these access methods being exposed, typically a way to access the device as the root or admin user. User identities for login must be declared in the submission and be the same across multiple devices of the same type. These access methods must also be exposed consistently and without requiring any manual intervention or confirmation. For example, mobile devices must be deployed with systems which enable debug access which all production builds will need to block.

Automation relies on remote power control - battery powered devices can be a signficant problem in this area. On the one hand, testing can be expected to involve tests of battery performance, low power conditions and recharge support. However, testing will also involve broken builds and failed deployments where the only recourse is to hard reset the device by killing power. With a battery in the loop, this becomes very complex, sometimes involving complex electrical bodges to the hardware to allow the battery to be switched out of the circuit. These changes can themselves change the performance of the battery control circuitry. For example, some devices fail to maintain charge in the battery when held in particular states artificially, so the battery gradually discharges despite being connected to mains power. Devices which have no battery can still be a challenge as some are able to draw power over the serial circuitry or USB attachments, again interfering with the ability of the automation to recover the device from being “bricked”, i.e. unresponsive to the control methods used by the automation and requiring manual admin intervention.

Automation relies on unique identification - all devices in an automation lab must be uniquely identifiable at all times, in all modes and all active power states. Too many components and devices within labs fail to allow for the problems of scale. Details like serial numbers, MAC addresses, IP addresses and bootloader timeouts must be configurable and persistent once configured.

LAVA is not a complete CI solution - even including the hardware support available from some LAVA instances, there are a lot more tools required outside of LAVA before a CI loop will actually work. The triggers from your development workflow to the build farm (which is not LAVA), the submission to LAVA from that build farm are completely separate and outside the scope of this documentation. LAVA can help with the extraction of the results into information for the developers but LAVA output is generic and most teams will benefit from some “frontend” which extracts the data from LAVA and generates relevant output for particular development teams.

Features of CI

Frequency

How often is the loop to be triggered?

Set up some test builds and test jobs and run through a variety of use cases to get an idea of how long it takes to get from the commit hook to the results being available to what will become your frontend.

Investigate where the hardware involved in each stage can be improved and analyse what kind of hardware upgrades may be useful.

Reassess the entire loop design and look at splitting the testing if the loop cannot be optimised to the time limits required by the team. The loop exists to serve the team but the expectations of the team may need to be managed compared to the cost of hardware upgrades or finite time limits.

Scale

How many branches, variants, configurations and tests are actually needed?

Scale has a direct impact on the affordability and feasibility of the final loop and frontend. Ensure that the build infrastructure can handle the total number of variants, not just at build time but for storage. Developers will need access to the files which demonstrate a particular bug or regression

Scale also provides benefits of being able to ignore anomalies.

Identify how many test devices, LAVA instances and Jenkins slaves are needed. (As a hint, start small and design the frontend so that more can be added later.)

Interface

The development of a custom interface is not a small task

Capturing the requirements for the interface may involve lengthy discussions across the development team. Where there are irreconcilable differences, a second frontend may become necessary, potentially pulling the same data and presenting it in a radically different manner.

Include discussions on how or whether to push notifications to the development team. Take time to consider the frequency of notification messages and how to limit the content to only the essential data.

Bisect support can flow naturally from the design of the loop if the loop is carefully designed. Bisect requires that a simple boolean test can be generated, built and executed across a set of commits. If the frontend implements only a single test (for example, does the kernel boot?) then it can be easy to identify how to provide bisect support. Tests which produce hundreds of results need to be slimmed down to a single pass/fail criterion for the bisect to work.

Results

This may take the longest of all elements of the final loop

Just what results do the developers actually want and can those results be delivered? There may be requirements to aggregate results across many LAVA instances, with comparisons based on metadata from the original build as well as the LAVA test.

What level of detail is relevant?

Different results for different members of the team or different teams?

Is the data to be summarised and if so, how?

Resourcing

A frontend has the potential to become complex and need long term maintenance and development

Device requirements

At the hardware design stage, there are considerations for the final software relating to how the final hardware is to be tested.

Uniqueness

All units of all devices must uniquely identify to the host machine as distinct from all other devices which may be connected at the same time. This particularly covers serial connections but also any storage devices which are exported, network devices and any other method of connectivity.

Example - the WaRP7 integration has been delayed because the USB mass storage does not export a filesystem with a unique identifier, so when two devices are connected, there is no way to distinguish which filesystem relates to which device.

All unique identifiers must be isolated from the software to be deployed onto the device. The automation framework will rely on these identifiers to distinguish one device from up to a dozen identical devices on the same machine. There must be no method of updating or modifying these identifiers using normal deployment / flashing tools. It must not be possible for test software to corrupt the identifiers which are fundamental to how the device is identified amongst the others on the same machine.

All unique identifiers must be stable across multiple reboots and test jobs. Randomly generated identifiers are never suitable.

If the device uses a single FTDI chip which offers a single UART device, then the unique serial number of that UART will typically be a permanent part of the chip. However, a similar FTDI chip which provides two or more UARTs over the same cable would not have serial numbers programmed into the chip but would require a separate piece of flash or other storage into which those serial numbers can be programmed. If that storage is not designed into the hardware, the device will not be capable of providing the required uniqueness.

Example - the WaRP7 exports two UARTs over a single cable but fails to give unique identifiers to either connection, so connecting a second device disconnects the first device when the new tty device replaces the existing one.

If the device uses one or more physical ethernet connector(s) then the MAC address for each interface must not be generated randomly at boot. Each MAC address needs to be:

  • persistent - each reboot must always use the same MAC address for each interface.
  • unique - every device of this type must use a unique MAC address for each interface.

If the device uses fastboot, then the fastboot serial number must be unique so that the device can be uniquely identified and added to the correct container. Additionally, the fastboot serial number must not be modifiable except by the admins.

Example - the initial HiKey 960 integration was delayed because the firmware changed the fastboot serial number to a random value every time the device was rebooted.

Scale

Automation requires more than one device to be deployed - the current minimum is five devices. One device is permanently assigned to the staging environment to ensure that future code changes retain the correct support. In the early stages, this device will be assigned to one of the developers to integrate the device into LAVA. The devices will be deployed onto machines which have many other devices already running test jobs. The new device must not interfere with those devices and this makes some of the device requirements stricter than may be expected.

  • The aim of automation is to create a homogenous test platform using heterogeneous devices and scalable infrastructure.

  • Do not complicate things.

  • Avoid extra customised hardware

    Relays, hardware modifications and mezzanine boards all increase complexity

    Examples - X15 needed two relay connections, the 96boards initially needed a mezzanine board where the design was rushed, causing months of serial disconnection issues.

  • More complexity raises failure risk nonlinearly

    Example - The lack of onboard serial meant that the 96boards devices could not be tested in isolation from the problematic mezzanine board. Numerous 96boards devices were deemed to be broken when the real fault lay with intermittent failures in the mezzanine. Removing and reconnecting a mezzanine had a high risk of damaging the mezzanine or the device. Once 96boards devices moved to direct connection of FTDI cables into the connector formerly used by the mezzanine, serial disconnection problems disappeared. The more custom hardware has to be designed / connected to a device to support automation, the more difficult it is to debug issues within that infrastructure.

  • Avoid unreliable protocols and connections

    Example. WiFi is not a reliable deployment method, especially inside a large lab with lots of competing signals and devices.

  • This document is not demanding enterprise or server grade support in devices.

    However, automation cannot scale with unreliable components.

    Example - HiKey 6220 and the serial mezzanine board caused massively complex problems when scaled up in LKFT.

  • Server support typically includes automation requirements as a subset:

    RAS, performance, efficiency, scalability, reliability, connectivity and uniqueness

  • Automation racks have similar requirements to data centres.

  • Things need to work reliably at scale

Scale issues also affect the infrastructure which supports the devices as well as the required reliability of the instance as a whole. It can be difficult to scale up from initial development to automation at scale. Numerous tools and utilities prove to be uncooperative, unreliable or poorly isolated from other processes. One result can be that the requirements of automation look more like the expectations of server-type hardware than of mobile hardware. The reality at scale is that server-type hardware has already had fixes implemented for scalability issues whereas many mobile devices only get tested as standalone units.

Connectivity and deployment methods

  • All test software is presumed broken until proven otherwise
  • All infrastructure and device integration support must be proven to be stable before tests can be reliable
  • All devices must provide at least one method of replacing the current software with the test software, at a level lower than you're testing.

The simplest method to automate is TFTP over physical ethernet, e.g. U-Boot or UEFI PXE. This also puts the least load on the device and automation hardware when delivering large images

Manually writing software to SD is not suitable for automation. This tends to rule out many proposed methods for testing modified builds or configurations of firmware in automation.

See https://linux.codehelp.co.uk/the-problem-of-sd-mux.html for more information on how the requirements of automation affect the hardware design requirements to provide access to SD card filesystems to and from the dispatcher and the device.

Some deployment methods require tools which must be constrained within an LXC. These include but are not limited to:

  • fastboot - due to a common need to have different versions installed for different hardware devices

    Example - Every fastboot device suffers from this problem - any running fastboot process will inspect the entire list of USB devices and attempt to connect to each one, locking out any other fastboot process which may be running at the time, which sees no devices at all.

  • IoT deployment - some deployment tools require patches for specific devices or use tools which are too complex for use on the dispatcher.

    Example - the TI CC3220 IoT device needs a patched build of OpenOCD, the WaRP7 needs a custom flashing tool compiled from a github repository.

Wherever possible, existing deployment methods and common tools are strongly encouraged. New tools are not likely to be as reliable as the existing tools.

Deployments must not make permanent changes to the boot sequence or configuration.

Testing of OS installers may require modifying the installer to not install an updated bootloader or modify bootloader configuration. The automation needs to control whether the next reboot boots the newly deployed system or starts the next test job, for example when a test job has been cancelled, the device needs to be immediately ready to run a different test job.

Interfaces

Automation requires driving the device over serial instead of via a touchscreen or other human interface device. This changes the way that the test is executed and can require the use of specialised software on the device to translate text based commands into graphical inputs.

It is possible to test video output in automation but it is not currently possible to drive automation through video input. This includes BIOS-type firmware interaction. UEFI can be used to automatically execute a bootloader like Grub which does support automation over serial. UEFI implementations which use graphical menus cannot be supported interactively.

Reliability

The objective is to have automation support which runs test jobs reliably. Reproducible failures are easy to fix but intermittent faults easily consume months of engineering time and need to be designed out wherever possible. Reliable testing means only 3 or 4 test job failures per week due to hardware or infrastructure bugs across an entire test lab (or instance). This can involve thousands of test jobs across multiple devices. Some instances may have dozens of identical devices but they still need not to exceed the same failure rate.

All devices need to reach the minimum standard of reliability, or they are not fit for automation. Some of these criteria might seem rigid, but they are not exclusive to servers or enterprise devices. To be useful mobile and IoT devices need to meet the same standards, even though the software involved and the deployment methods might be different. The reason is that the Continuous Integration strategy remains the same for all devices. The problem is the same, regardless of underlying considerations.

A developer makes a change; that change triggers a build; that build triggers a test; that test reports back to the developer whether that change worked or had unexpected side effects.

  • False positive and false negatives are expensive in terms of wasted engineering time.
  • False positives can arise when not enough of the software is fully tested, or if the testing is not rigorous enough to spot all problems.
  • False negatives arise when the test itself is unreliable, either because of the test software or the test hardware.

This becomes more noticeable when considering automated bisections which are very powerful in tracking the causes of potential bugs before the product gets released. Every test job must give a reliable result or the bisection will not reliably identify the correct change.

Automation and Risk

Linaro kernel functional test framework (LKFT) https://lkft.validation.linaro.org/

We have seen with LKFT that complexity has a non-linear relationship with the reliability of any automation process. This section aims to set out some guidelines and recommendations on just what is acceptable in the tools needed to automate testing on a device. These guidelines are based on our joint lab and software team experiences with a wide variety of hardware and software.

Adding or modifying any tool has a risk of automation failure

Risk increases non-linearly with complexity. Some of this risk can be mitigated by testing the modified code and the complete system.

Dependencies installed count as code in terms of the risks of automation failure

This is a key lesson learnt from our experiences with LAVA V1. We added a remote worker method, which was necessary at the time to improve scalability. But it massively increased the risk of automation failure simply due to the extra complexity that came with the chosen design.These failures did not just show up in the test jobs which actively used the extra features and tools; they caused problems for all jobs running on the system.

The ability in LAVA V2 to use containers for isolation is a key feature

For the majority of use cases, the small extension of the runtime of the test to set up and use a container is negligible. The extra reliability is more than worth the extra cost.

Persistent containers are themselves a risk to automation

Just as with any persistent change to the system.

Pre-installing dependencies in a persistent container does not necessarily lower the overall risk of failure. It merely substitutes one element of risk for another.

All code changes need to be tested

In unit tests and in functional tests. There is a dividing line where if something is installed as a dependency of LAVA, then when that something goes wrong, LAVA engineers will be pressured into fixing the code of that dependency whether or not we have any particular experience of that language, codebase or use case. Moving that code into a container moves that burden but also makes triage of that problem much easier by allowing debug builds / options to be substituted easily.

Complexity also increases the difficulty of debugging, again in a nonlinear fashion

A LAVA dependency needs a higher bar in terms of ease of triage.

Complexity cannot be easily measured

Although there are factors which contribute.

Monoliths

Large programs which appear as a single monolith are harder to debug than the UNIX model of one utility joined with other utilities to perform a wider task. (This applies to LAVA itself as much as any one dependency - again, a lesson from V1.)

Feature creep

Continually adding features beyond the original scope makes complex programs worse. A smaller codebase will tend to be simpler to triage than a large codebase, even if that codebase is not monolithic.

Targeted utilities are less risky than large environments

A program which supports protocol after protocol after protocol will be more difficult to maintain than 3 separate programs for each protocol. This only gets worse when the use case for that program only requires the use of one of the many protocols supported by the program. The fact that the other protocols are supported increases the complexity of the program beyond what the use case actually merits.

Metrics in this area are impossible

The risks are nonlinear, the failures are typically intermittent. Even obtaining or applying metrics takes up huge amounts of engineering time.

Mismatches in expectations

The use case of automation rarely matches up with the more widely tested use case of the upstream developers. We aren't testing the code flows typically tested by the upstream developers, so we find different bugs, raising the level of risk. Generally, the simpler it is to deploy a device in automation, the closer the test flow will be to the developer flow.

Most programs are written for the single developer model

Some very widely used programs are written to scale but this is difficult to determine without experience of trying to run it at scale.

Some programs do require special consideration

QEMU would fail most of these guidelines above, so there are mitigating factors:

  • Programs which can be easily restricted to well understood use cases lower the risk of failure. Not all use cases of the same program not need to be covered.
  • Programs which have excellent community and especially in-house support also lower the risk of failure. (Having QEMU experts in Linaro is a massive boost for having QEMU as a dispatcher dependency.)

Unfamiliar languages increase the difficulty of triage

This may affect dependencies in unexpected ways. A program which has lots of bindings into a range of other languages becomes entangled in transitions and bugs in those other languages. This commonly delays the availability of the latest version which may have a critical fix for one use case but which fails to function at all in what may seem to be an unrelated manner.

The dependency chain of the program itself increases the risk of failure in precisely the same manner as the program

In terms of maintenance, this can include the build dependencies of the program as those affect delivery / availability of LAVA in distributions like Debian.

Adding code to only one dispatcher amongst many increases the risk of failure on the instance as a whole

By having an untested element which is at variance to the rest of the system.

Conditional dependencies increase the risk

Optional components can be supported but only increase the testing burden by extending the matrix of installations.

Presence of the code in Debian main can reduce the risk of failure

This does not outweigh other considerations - there are plenty of packages in Debian (some complex, some not) which would be an unacceptable risk as a dependency of the dispatcher, fastboot for one. A small python utility from github can be a substantially lower risk than a larger program from Debian which has unused functionality.

Sometimes, "complex" simply means "buggy" or "badly designed"

fastboot is not actually a complex piece of code but we have learnt that it does not currently scale. This is a result of the disparity between the development model and the automation use case. Disparities like that actually equate to complexity, in terms of triage and maintenance. If fastboot was more complex at the codebase level, it may actually become a lower risk than currently.

Linaro as a whole does have a clear objective of harmonising the ecosystem

Adding yet another variant of existing support is at odds with the overall objective of the company. Many of the tools required in automation have no direct affect on the distinguishing factors for consumers. Adding another one "just because" is not a good reason to increase the risk of automation failure. Just as with standards.

Having the code on the dispatcher impedes development of that code

Bug fixes will take longer to be applied because the fix needs to go through a distribution or other packaging process managed by the lab admins. Applying a targeted fix inside an LXC is useful for proving that the fix works.

Not all programs can work in an LXC

LAVA also provides ways to test using those programs by deploying the code onto a test device. e.g. the V2 support for fastmodels involves only deploying the fastmodel inside a LAVA Test Shell on a test device, e.g. x86 or mustang or Juno.

Speed of running a test job in LAVA is important for CI

The goal of speed must give way to the requirement for reliability of automation

Resubmitting a test job due to a reliability failure is more harmful to the CI process than letting tests take longer to execute without such failures. Test jobs which run quickly are easier to parallelize by adding more test hardware.

Modifying software on the device

Not all parts of the software stack can be replaced automatically, typically the firmware and/or bootloader will need to be considered carefully. The boot sequence will have important effects on what kind of testing can be done automatically. Automation relies on being able to predict the behaviour of the device, interrupt that default behaviour and then execute the test. For most devices, everything which executes on the device prior to the first point at which the boot sequence can be interrupted can be considered as part of the primary boot software. None of these elements can be safely replaced or modified in automation.

The objective is to deploy the device such that as much of the software stack can be replaced as possible whilst preserving the predictable behaviour of all devices of this type so that the next test job always gets a working, clean device in a known state.

Primary boot software

For many devices, this is the bootloader, e.g. U-Boot, UEFI or fastboot.

Some devices include support for a Baseboard management controller or BMC which allows the bootloader and other firmware to be updated even if the device is bricked. The BMC software itself then be considered as the primary boot software, it cannot be safely replaced.

All testing of the primary boot software will need to be done by developers using local devices. SDMux was an idea which only fitted one specific set of hardware, the problem of testing the primary boot software is a hydra. Adding customised hardware to try to sidestep the primary boot software always increases the complexity and failure rates of the devices.

It is possible to divide the pool of devices into some which only ever use known versions of the primary boot software controlled by admins and other devices which support modifying the primary boot software. However, this causes extra work when processing the results, submitting the test jobs and administering the devices.

A secondary problem here is that it is increasingly common for the methods of updating this software to be esoteric, hacky, restricted and even proprietary.

  • Click-through licences to obtain the tools

  • Greedy tools which hog everything in /dev/bus/usb

  • NIH tools which are almost the same as existing tools but add vendor-specific "functionality"

  • GUI tools

  • Changing jumpers or DIP switches,

    Often in inaccessible locations which require removal of other ancillary hardware

  • Random, untrusted, compiled vendor software running as root

  • The need to press and hold buttons and watch for changes in LED status.

We've seen all of these - in various combinations - just in 2017, as methods of getting devices into a mode where the primary boot software can be updated.

Copyright 2018 Neil Williams linux@codehelp.co.uk

Available under CC BY-SA 3.0: https://creativecommons.org/licenses/by-sa/3.0/legalcode