Theses Doctoral

Co-design for Security and Reliability

Manzhosov, Evgeny

Security is commonly defined as the preservation of three system properties: confidentiality, which is the ability to prevent an unauthorized party from reading data; integrity, which is the ability to prevent an unauthorized party from writing data; and availability, which is the ability of an authorized party to use the system free of interference from an unauthorized party. Simply put, security mechanisms and policies are designed to keep unauthorized parties out of the system.

On the other hand, reliability is defined as a system’s ability to operate as intended under non-adversarial conditions. It is measured as the probability of producing correct outputs at any given time. In the design of the reliability technique, one assumes that things that cause reliability failures (e.g., cosmic rays) do not seek to bypass reliability-enhancing mechanisms and policies. On the other hand, security mechanisms are designed assuming the attackers will attempt to circumvent them by exploiting any design or implementation weaknesses.

From the above definitions, it may appear that reliability mechanisms should be built after security mechanisms since non-adversarial conditions can be enabled only with strong security mechanisms. However, for security mechanisms to operate as intended, they need to do so under both adversarial and non-adversarial conditions; thus, security mechanisms themselves are required to be reliable. This creates a chicken-and-egg situation that can only be resolved by co-designing security and reliability techniques together.

However, due to historical reasons and engineering rationales, reliability features and security solutions are designed in isolation, often competing for the same resources and leading to trade-offs that may compromise both system reliability and security. Moreover, security today is layered on top of reliability. A justification for this may be that a system has to become reliable for it to attract users and thus become attack-worthy: an unreliable system is as useless to attackers as it is to users.

This justification has been borne out historically, as security threats have become prominent only after reliable systems were created, thus leading to the current situation. For example, DRAM standards today provide no extra space for security metadata while guaranteeing up to 25% of extra space for reliability features. Thus, memory reliability schemes today, whether commercial or academic, are designed to utilize all the available space provisioned by memory standards. In these conditions, security schemes are either required to forgo memory reliability to free storage for security or store security metadata separately at the cost of decreased performance.

In addition, when designed independently, integrating security and reliability into a system completely ignores the interaction between these two features. For example, the interaction between memory encryption and memory error correcting codes (ECC). In this example, memory encryption, which is available in most enterprise processors today, is usually layered over a memory reliability technique, and, at first glance, it might appear that these are completely orthogonal: the ability of ECC to correct errors is independent of whether the data is encrypted or not. It turns out, however, that this is not the case. Specifically, in cases when there are bit-flips that ECC does not correct, the errors cause more data bits to be corrupted after the data has been decrypted due to the diffusive nature of encryption algorithms, which may lead to a decrease in the overall system reliability guarantees.

To summarize, simply combining independently developed security and reliability techniques is suboptimal. Moreover, since both system features protect the same asset – data, they should be co-designed to ensure seamless integration and meet reliability and security requirements. Thus, in this thesis, I show that co-design of reliability with security can not only improve security but can also improve system reliability. To this end, first, I show how co-design of memory error correction schemes with security can make the system more secure. My novel Multi-Use ECC, or MUSE, is a memory reliability scheme that allows to store in-lined security metadata without sacrificing error correction guarantees of the code. As a result, designs become simpler (and cheaper) as they do not require caches for security metadata and the system does not spend precious memory bandwidth and power on fetching security metadata.

Second, I demonstrate that system reliability can be significantly enhanced by ensuring data integrity. My novel Polymorphic ECC scheme leverages data integrity mechanisms to improve error correction guarantees for data in the main memory without requiring more storage for ECC. In particular, Polymorphic ECC corrects errors iteratively by validating each correction attempt with a data integrity check. Moreover, due to newly discovered properties of residue coding, Polymorphic ECC can correct mutually exclusive classes of errors in the same code, which is an impossible task for traditional ECCs. This leads to a more reliable system without any extra storage requirements or decreased security guarantees.

Finally, I show that even the most sophisticated security techniques, like Fully Homomorphic Encryption (FHE), must also be reliable as they are often deployed for very sensitive tasks that have societal impact, e.g., private medical data, and analyze the necessary costs for making software-only and hardware-accelerated FHE reliable.

Files

  • thumbnail for Manzhosov_columbia_0054D_19164.pdf Manzhosov_columbia_0054D_19164.pdf application/pdf 3 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Sethumadhavan, Simha
Degree
Ph.D., Columbia University
Published Here
May 28, 2025