
No matter how many times I visit a factory where Embotech’s Automated Vehicle Marshalling (AVM) solution is driving cars autonomously, I go through the same thought process. Every single time.
For the first 1-2 drives I am in awe. I look through the window, no driver. Steering wheel turning smoothly. Big new cars, narrow factory building. Equal margins on either sides in a tight corner. Nice. For the next couple of drives, I get anxious. Is it going to keep performing well or am I going to witness an abort together with these very important guests I have with me? So far so good. Then, by car 10 or so, it quickly starts to get boring. Several different vehicle models, all coming out of the same line, driving from production, through test tracks, to logistics. Can we go for lunch already?
Great B2B autonomy tech is not exciting. It is efficient, it is repetitive, it is seamless, it is boring. Day in, day out, flexibly adapting to the environment without you even noticing it. Factory workers walking around and in between the cars as if it were completely normal to be around a L4 system operating driverless 24/7. It is.
The Solution
So what is behind this AVM Solution? It comprises off-the-shelf lidar sensors (3D laser scanners that measure distance with thousands of points), which are installed onto the factory walls and ceiling, and on-premise servers, which are running the AI software for autonomous driving (AD) and fleet management system (FMS). The AD software computes a trajectory in real time and sends it via 4G LTE to the vehicle, which runs a small interface software that forwards the trajectory into acceleration, steering and braking commands. The FMS translates the jobs given by the customers’ overarching system to driving tasks which are executed by the AD software.
The AD software is based on two independent software paths, running on independent hardware: an AI-based performance path, designed for human-like driving and a deterministic safety path. The latter is responsible for constant health checking and stopping the vehicle in case any checks fall through.
The AI-based performance path builds a model of the world around it based on lidar points in space (a lidar point-cloud which is processed into a digital twin of the environment), then calculates the optimal path of the vehicle and tracks the progress, adjusting in real-time. Often, we have to “constrain” it to not act optimally, because the optimal path may appear unnatural to humans, leading to potential risks in mixed traffic with humans walking or driving other vehicles. That’s one of the several things we had not foreseen when we started building this solution 13 years ago. Good AD is understandable and predictable for humans - drivers and pedestrians alike. The performance path learns from vast data sets produced daily, continuously improving perception and driving performance.
The safety path - a non-AI algorithmic layer which is the basis of our safety certification - is designed to check the whole stack and to slam the brakes in case anything goes wrong. It is the last resort. It makes our physical AI system – with the performance and safety paths in parallel - safe by design. That means that we do not have to rely on millions of km driven to make the system safe, because we have an non-AI supervisor that is fully predictable and reliable, even in unforeseen (=no relevant training data) situations. This approach – safety by design – keeps development costs at bay because it drastically reduces the amount of validation that needs to be done on the AI part; which would otherwise be done in order to formally prove safety by statistics. More on this topic coming soon in an article series by our CTO, Alex Domahidi.
So, you want to install this solution in your factory or logistics yard. What’s next? First step, the connection to the vehicle needs to be enabled. Our standardised interface is a small piece of software (few kB) that enables the vehicle to be driven by external commands, which is then disabled after the vehicle leaves logistics. Second, the customer sends us a lidar scan of the driveable area (customers often have this for other reasons, but if not we can do the scan). We use our software tools to semi-automatically place lidars in a 3D environment, making sure that coverage is good while minimizing sensor cost, and the output is a lidar placement document. It contains instructions on exactly where and how to install the lidars. We also specify how many off-the-shelf servers are needed to run it, and the customer typically organizes that, too, through their standard procurement processes.
Once installation is done, our engineers come on site and spend 1-2 weeks commissioning the system, driving few cars per day manually. Next, we summon the certification body to the site and conduct confirmation tests in their presence. Our solution is safety certified (think of it as a type approval), and the only thing they are looking at is the delta – the small changes, if any – that are needed to drive this particular use case at this site. This is typically half a day’s work or so. Once certified for driverless drives, the system is ramped up progressively over the next 3 months to full output.
Lessons learned
The safest system is the one that doesn’t move at all. If it doesn’t move, it can’t cause any damage. Right? That is what we learned in practice when we first started operating our certified AVM solution back in early 2024. On the first drives with a certified system, it was barely moving. The safety path was slamming the brakes on almost every drive. We spent the next 6 months finding out where the system is too conservative: whether that is distance from static obstacles, predictions for moving ones, ghost objects, severe weather, sub-optimal path planning, factory doors, traffic lights, …, I could write all day. It is this learning process, not the certification itself, that is the key to a safe and well-functioning system (although certification is a prerequisite to even get there). Predictably, it gets faster next time you go through the process (as for our Automated Truck Solution business), but that will be the subject of a different article.
Our system is generally robust, performing at above 99% success rates for completed drives. Since we are driving north of 2500 vehicles per day, however, this still means around 20 aborts per day that you have to deal with. That is both combined software tool and operations challenge and we have built a suite of software tools that helps us monitor, operate and recover in case of aborts and other failures.
The vast majority of aborts can be dealt with by factory workers who are otherwise doing other jobs on the production line. The worker approaches the vehicle and inspects the situation, deciding whether to allow an automatic restart, or, if that fails, pick up the vehicle drive it to its end destination manually. Once in a while, a few aborted drives can cause a cascading effect, where some blocked vehicles at the wrong place (let’s say a long tunnel) cause further aborts. Even less frequently – extremely rarely – there is an error that takes the whole system down. Major incident! Hustle.
Actually, what we learned from customers is that they don’t expect zero problems, and that includes major incidents (that of course does not include safety incidents, of which we have had none). The key is how you can react to it and fix it – usually the requirement is within about 20-30 minutes. This presents a major operations challenge that has to be solved by in-situ personnel and/or smart software tools.
One of the top things that causes such problems is a misaligned lidar. Whether it is because of a loose mechanical connection, somebody bumping into it, somebody cleaning it – it causes the system to shut down this lidar and disable the relevant drivable area for safety purposes. Recalibration can be a very cumbersome and manual process, unless you have the right tools to do this at the press of a button. It took us some time to realize that this is a top issue, and some more time to get the tools right, but has been a great investment. The key is to reduce manual work, especially at times when this work keeps a large part of the system down until it’s done.
Another area of challenge is the deployment of new software versions. We have so far managed to iron out most problems of the existing software versions running in our deployments. But what about new versions coming in with bug fixes and upgrades? Those can either be installed in production breaks, or slowly introduced into production and ramped up while production is running. Typically both methods result in a period of hypercare, where our engineers from the operations team monitor the system from the early morning when production starts, ready to react if anything goes wrong. Typically after a couple of hours of stable production, the remaining risk is minimal and the hypercare team can go about their normal duties.
Even though our tools have seen tremendous development, getting the system back up-and-running inevitably also falls into the hands of our operations teams. These are the people that are on call almost 24/7 to react to an issue where an automatic recalibration, reboot, etc. does not help. The customer support team is tasked with the first level of support and triage of the problem, while the application engineers come into play when level 2 support is required, i.e. the issue goes beyond standard IT troubleshooting. The engineers have at their disposal an array of tools including AI-based root cause identification, which is key to identify faults in a complex system. It has taken us time to ramp up to the required organizational maturity to make this happen.
Another major area of development has been the measurement of system performance: when the system is generating much more data than you can process, getting the right performance KPIs & aligning that with what the customer is measuring is surprisingly elusive. It requires integration into customer ticketing systems, real-time dashboards, automatic root cause analysis, sensor health metrics, pattern recognition and several other steps to ensure that the right priorities are set in product improvement efforts.
Looking forward
So, we are operational so far on 4 sites, mostly 24/7 or close to it, and we have now driven upwards of half a million unique vehicles. That is the largest L4 industrial autonomy operation in the world. We are constantly installing new “use cases” (aka driving tasks) on these sites and also installing new sites. But how do you scale that to 200 sites? (this is a real question that we have been asked by a customer). The honest answer is that this is work in progress. Some thoughts on the steps that are for sure needed:
First of all, a necessary element to deploy efficiently (and thus profitably) is advanced automation via software tools for deployment, commissioning, operation and recovery. That means automatic flagging of errors, automatic classification of reasons for aborts, constant monitoring of KPIs, thought-through data retention policies (we already generate 10 TB per day and if we kept all of it we would be bankrupt several times over by now), et al.
Second: get the data right. The foundation of a continuously improving system is a robust data pipeline that allows the system to learn from the right things and improve as fast as possible while minimizing effort.
Third: for efficient deployment, working with the customer is key in order to standardize rollout processes, build play books, improve product manuals, and to train and enable the customer to use rollout, calibration and operation tools themselves. For the parts that cannot be given to the customer, but cannot sustainably be done by Embotech on 100s of sites, it is necessary to work with outsourcing partners who have people on the ground at factories and logistics centers, who can act as rollout and operations partners.
Final thoughts
The road to safety certification of a L4 autonomous driving system is bumpy, but getting it to work well enough to add value for the customer while still being safe – that is much harder. It involves unforeseen challenges such as: getting the system to drive sub-optimally to be perceived as human-like, agreeing about the basics of system performance measurement with the customer, developing lidar auto-calibration tools because even static things move, achieving operational excellence with the support teams, and many others.
Once the system is adding value at scale, scaling that further from several sites driving a few thousands of vehicles per day to hundreds of sites driving millions of vehicles per day is the ultimate challenge. We have seen some tremendous progress in autonomy over the past years, the most public of which has been the advancement of L5 robotaxis to several cities around the world. While that is ongoing, very successfully in terms of technology albeit far from profitability, L4 systems are quietly moving industrial logistics at scale, profitably.
Our design philosophy supports scaling such technologies at a fraction of the cost of L5 systems, because we use AI to i) develop and iterate faster, ii) make the system performant and iii) to make operations efficient, but we use deterministic methods to make the system inherently safe, by design. This is what we call an inherently-safe physical AI system. Combined with off-the-shelf sensors which are augmented by software to be safe, this creates a cost-competitive yet very competent and safe solution.
During the time you have been reading this, another 15 cars have been driven by Embotech’s L4 AVM solution, driverlessly. The future of autonomy is safe physical AI.
Written by:
Andreas Kyrtatos, CEO of Embotech AG
