Autonomously Hacking for Defense

Code generation for national security

Apr 18, 2024

In recent months, code generation has been top of mind for many as startups like Cognition, Poolside, and Magic announced massive financing rounds to build products that use generative AI to write code. Similarly, startups have released flashy demos that seem to show AI agents completing full coding projects, like Cognition’s recent release of Devin, an autonomous software engineer.

I’ve heard a number of my software engineering friends (myself included!) lament that their hard earned computer science degrees are for nothing as AI will write all software in the future. Now, this may be an overreaction on the part of my fellow CS grads, but it is still exciting to see the progress being made in automating software development. I’m particularly excited to see how generative AI-enabled code generation will improve national security.

So, what does code generation have to do with national security? At face value, code generation may not seem like it has much to do with national security — the US Department of Defense (DoD) is not known for writing software or employing software engineers (although it has had a few software factories pop up in recent years).

However, digging in further it becomes clear that using generative AI to write code has tremendous potential to improve national security. First, and perhaps most obvious, the DoD and intelligence community (IC) rely on software for everything from personnel management to logistics management to intelligence analysis to weapons operations. New code generation products like Cognition’s Devin and GitHub Copilot will enable the DoD’s software vendors to more quickly and securely produce and improve software.

There is significant white space for startups to make a large impact on national security by focusing on specific use cases of code generation. While generic code generation tools like GitHub Copilot are very good at generating code in languages commonly used by open source developers like Python, React, Javascript, Typescript, and Ruby, many of the systems and workflows used by the DoD, IC, and other critical infrastructure operators run on more specialized coding languages. For instance, many DoD codebases are written in legacy languages like Ada1 and many edge computing systems (ex: drones) are written in low level languages like C/C++.

Beyond simply writing new code, there are several specific use cases of AI-enabled code generation that will be particularly impactful for US national security: code modernization, automated cybersecurity vulnerability remediation, sensor fusion, and edge code deployment.

Code Modernization

One of the most meaningful ways code generation can have an impact on national security is through code modernization. A tremendous amount of code in production in the federal government today is legacy code, much of which suffers from tech debt. Notably, in November 2021 the DoD released a software modernization strategy. Much of the federal government’s (and critical infrastructure providers’) code is a) written in legacy languages like Ada, COBOL, or old versions of Java, b) resides in monolithic codebases that are difficult to understand, and c) depends on legacy on-prem data storage solutions. There are a number of reasons why it is important for organizations to update their legacy code.

First, legacy code tends to be difficult and expensive to improve and secure. Because so few people today know how to write code in legacy coding languages, the DoD and other organizations that rely on legacy coding languages often need to hire expensive consultants who charge hundreds (or even thousands) of dollars per hour to update legacy code. Additionally, legacy and monolithic codebases tend to be difficult to understand, which makes them difficult to update and secure. Organizations with large legacy code bases frequently struggle to retain engineering talent, as talented engineers typically are not interested in the mundane and difficult work involved in maintaining legacy codebases. Due to the cost and difficulty of updating legacy codebases, many vulnerabilities remain unpatched when they are discovered. In addition, legacy codebases often do not have modern security frameworks in place, as they are difficult to add post-facto Furthermore, organizations typically need to modernize legacy code before they are able to migrate to the cloud. Cloud migration offers an array of benefits for organizations, including improved security posture, scalability, resiliency, efficiency, and significant cost savings.

Clearly, failing to update and secure legacy codebases is tremendously problematic, particularly when mission critical systems like weapons systems, healthcare systems, and banking systems run on legacy code. Yet, many of these legacy codebases remain archaic due to the cost of updating codebases.

Today, modernizing codebases is an expensive and manual task that requires a significant amount of developers’ time, eating into the time they could spend writing net new software. However, AI code generation tools like GitHub Copilot are already showing promise for improving the efficiency of codebase modernization projects. Several IT leaders I’ve spoken with have successfully used GitHub Copilot to translate legacy languages to more modern languages, turn monolithic codebases into modular codebases, and migrate data from legacy data systems to more modern systems (like cloud based systems). However, code modernization experts report that generic code generation tools remain imperfect for code modernization.

Over the past year, a handful of startups, including Grit, Sweep, and Modelcode, have emerged that specifically focused on using generative AI to help organizations modernize code and manage tech debt. Not only do these startups’ products rewrite legacy code, they also generate unit tests and use static code analysis techniques like abstract syntax trees (ASTs) to ensure the newly generated code maintains the same functionality as the legacy code it is replacing.

In addition to upgrading legacy code, generative AI can also help organizations reduce the need for code modernization in the future by reducing tech debt as it is written. For instance, many of the startups mentioned are also able to automatically generate unit tests for customers’ codebases (most codebases have shockingly low unit test coverage), helping to make customer code more resilient. Some of these tools are built straight into customers' IDEs2 and CI/CD pipelines,3 helping developers identify and fix tech debt as they produce it to ensure hard to secure and hard to update code does not find its way into production.

Automated Cybersecurity Vulnerability Remediation

There is significant potential for code generation to revolutionize the cybersecurity industry by automating cybersecurity vulnerability remediation. Today, there are many cybersecurity products and services that help organizations identify all the vulnerabilities present in their code bases. However, very few companies actually help organizations remediate identified vulnerabilities. Consequently, many security engineers feel overwhelmed with security alerts that they lack adequate time to address, while developers often express frustration over time-consuming security review processes.

I’ve met a number of startups over the past year that use code generation techniques to automatically generate secure code to patch discovered vulnerabilities. Interestingly, many of these companies did not start out using generative AI. Rather, they started out building products to identify vulnerabilities and then found that they could provide customers even more value by combining their vulnerability identification technology with generative AI-enabled vulnerability remediation. For instance, both Semgrep and Github’s CodeQL, which help surface vulnerabilities in organizations’ codebases using a technique called Static Application Security Testing (SAST), have released features that automatically suggest secure AI-generated code to remediate identified vulnerabilities. Additionally, Xeol, which initially started out as a tool to identify end-of-life (EOL) software packages in an organizations’ software supply chain, is now rolling out a new feature that automatically generates new code to replace EOL software present in customers’ codebases. Similarly, several startups, including Staris AI and RunSybil, are using LLMs to conduct source code-assisted penetration testing (today, source code-assisted penetration testing is entirely manual and can only be conducted by highly specialized and expensive human penetration testers). After finding vulnerabilities in customers’ codebases, they are able to generate new, secure code (including unit tests) to replace vulnerable code.

Similar to code modernization, many of these automated vulnerability generation startups also generate unit tests to ensure that the code they generate is functional, and they use static code analysis to ensure that the new code works the same way as the vulnerable code it is replacing.

Sensor Fusion

As I’ve written about in the past, as part of the DoD’s CJADC24 strategy, the DoD needs to integrate the data coming in from many disparate sensors and platforms in order to improve DoD situational awareness and decision making on the battlefield. For example, an F-35 has a plethora of sensors onboard that it uses to navigate and monitor its environment. In order to gain a complete picture of the operating environment, the computers onboard the F-35 need to fuse all the data together into one operating picture. This presentation from Lockheed Martin, the F-35’s developer, outlines all the different sensors that are ultimately fused together on the F-35, and this slide shows how different sensors (infrared, electronic warfare, radar, etc) can all ultimately be fused into one common operating picture:

Now imagine you have a whole host of F-35s working together with additional platforms on the ground, in space, and in the sea, all sharing data from a myriad of sensors. Proponents of CJADC2 often refer to diagrams like the one below to demonstrate their vision for a connected future of warfare:

This vision of the future of warfare is highly reliant on quickly integrating sensor data from many different sources. It takes a significant amount of engineering work to actually integrate the data coming in from all those sensors to create a common operating picture for warfighters. Many sensor developers make it actively difficult to integrate their sensors with others and only provide basic information on how to work with their sensors in interface control documents, which can be difficult to understand. Today, in order to integrate data coming in from different sensors, organizations must employ expensive engineers to manually write sensor integration code. However, already a handful of startups (for example, Fid Labs, founded by a former Anduril sensor integration engineer) have emerged that use generative AI to write sensor integration code. These products are able to ingest sensors’ interface control documents and write code that integrates and standardizes the data coming in from multiple sensors. This form of code generation will bring the DoD one step closer to its ultimate vision for CJADC2 and wil reduce the amount of time engineers need to spend writing complicated sensor integration code.

Edge Deployment

The DoD is highly reliant on running compute workloads on size, weight, power and cost (SWaP-C) constrained edge devices like drones, aircraft, and mobile phones. Often these edge systems are in communications-denied or bandwidth-limited environments (ex: an unmanned surface vehicle in the middle of the ocean), so they cannot rely on data centers in order to run compute heavy workloads. Any computations these devices need to run (ex: computer vision algorithms that enable said unmanned surface vehicle in the middle of the ocean to navigate) need to be performed on the device itself.

The majority of cutting edge algorithms coming out of research labs are written in inefficient, high level data science languages like Python and MATLAB. However, edge devices typically run code written in highly efficient low level coding languages like Rust and C/C++. Today, in order to deploy cutting edge algorithms on edge devices, engineers must manually convert Python and MATLAB code into edge-deployable code written in a language like C/C++. This process is time consuming, and as a result, edge devices often run outdated software because it is too strenuous to rewrite new code to run on the edge.

Over the past year, a number of startups and open source projects have emerged that use generative AI to make it easier to transform high level languages like Python into lower level languages like C/C++. Companies like CodeConvert and CodePorting both use AI to make it easy to port code from one language into another language. Code Metal AI is specifically focused on converting code (ex: computer vision algorithms, signals processing algorithms, etc) written in Python, Julia, and MATLAB into edge deployable code that can be run on systems like drones and medical devices – they even optimize the code they generate for particular chipsets to make sure it runs efficiently on SWaP-C hardware.

Using LLMs to generate edge deployable code will have a large effect on national security by improving many of the edge devices we rely on including autonomous vehicles, augmented reality, logistics tracking devices, and medical devices.

In conclusion, while my fellow CS grads and I may fear that our coding skills will soon become obsolete, advances in AI code generation will have a huge impact on improving national security. As always, please let me know your thoughts. Where else are there opportunities for advancements in code generation to revolutionize national security? And please do not hesitate to reach out if you or anyone you know is building at the intersection of national security and commercial technologies!

Gray Matters

Discussion about this post