Malware research

The notion of a self-reproducing computer program can be traced back to initial theories about the operation of complex automata. John von Neumann showed that in theory a program could reproduce itself. This constituted a plausibility result in computability theory. Fred Cohen experimented with computer viruses and confirmed Neumann's postulate and investigated other properties of malware such as detectability and self-obfuscation using rudimentary encryption. His 1988 Doctoral dissertation was on the subject of computer viruses.

Cohen's faculty advisor, Leonard Adleman, presented a rigorous proof that, in the general case, algorithmic determination of the presence of a virus is undecidable. This problem must not be mistaken for that of determination within a broad class of programs that a virus is not present. This problem differs in that it does not require the ability to recognize all viruses.

Adleman's proof is perhaps the deepest result in malware computability theory to date and it relies on Cantor's diagonal argument as well as the halting problem. Ironically, it was later shown by Young and Yung that Adleman's work in cryptography is ideal in constructing a virus that is highly resistant to reverse-engineering by presenting the notion of a cryptovirus. A cryptovirus is a virus that contains and uses a public key and randomly generated symmetric cipher initialization vector (IV) and session key (SK).

In the cryptoviral extortion attack, the virus hybrid encrypts plaintext data on the victim's machine using the randomly generated IV and SK. The IV+SK are then encrypted using the virus writer's public key. In theory the victim must negotiate with the virus writer to get the IV+SK back in order to decrypt the ciphertext (assuming there are no backups). Analysis of the virus reveals the public key, not the IV and SK needed for decryption, or the private key needed to recover the IV and SK. This result was the first to show that computational complexity theory can be used to devise malware that is robust against reverse-engineering.

A growing area of computer virus research is to mathematically model the infection behavior of worms using models such as Lotka–Volterra equations, which has been applied in the study of biological virus. Various virus propagation scenarios have been studied by researchers such as propagation of computer virus, fighting virus with virus like predator codes, effectiveness of patching etc.

Behavioral malware detection has been researched more recently. Most approaches to behavioral detection are based on analysis of system call dependencies. The executed binary code is traced using strace or more precise taint analysis to compute data-flow dependencies among system calls. The result is a directed graph $$G=(V,E)$$ such that nodes are system calls, and edges represent dependencies. For example, $$(s,t)\in E$$ if a result returned by system call $$s$$ (either directly as a result or indirectly through output parameters) is later used as a parameter of system call $$t$$. The origins of the idea to use system calls to analyze software can be found in the work of Forrest et al. Christodorescu et al. point out that malware authors cannot easily reorder system calls without changing the semantics of the program, which makes system call dependency graphs suitable for malware detection. They compute a difference between malware and goodware system call dependency graphs and use the resulting graphs for detection, achieving high detection rates. Kolbitsch et al. pre-compute symbolic expressions and evaluate them on the syscall parameters observed at runtime.

They detect dependencies by observing whether the result obtained by evaluation matches the parameter values observed at runtime. Malware is detected by comparing the dependency graphs of the training and test sets. Fredrikson et al. describe an approach that uncovers distinguishing features in malware system call dependency graphs. They extract significant behaviors using concept analysis and leap mining. Babic et al. recently proposed a novel approach for both malware detection and classification based on grammar inference of tree automata. Their approach infers an automaton from dependency graphs, and they show how such an automaton could be used for detection and classification of malware.

Research in combining static and dynamic malware analysis techniques is also currently being conducted in an effort to minimize the shortcomings of both. Studies by researchers such as Islam et al. are working to integrate static and dynamic techniques in order to better analyze and classify malware and malware variants.