Talk:HashKeeper

Possibly erroneous and/or misleading claims
The first claim made by this article that I question is the claim that the MD5 algorithm, which is a function mapping an infinite space onto a finite one, produces a unique signature for a file. This is mathematically impossible on its face (a function cannot be injective if the cardinality of its domain exceeds the cardinality of its codomain), and furthermore the article on MD5 states that it is not even collision resistant (i.e., it is not difficult to compute two inputs that produce the same key).

The second claim, which is repeated twice, is that examiners can reach conclusions based upon matches from the HashKeeper database with "statistical certainty." This phrase is ambiguous and misleading. There is no such thing as 100% certainty in any statistical context; inferences can only be made with levels of certainty less than 100% (i.e., 99%, 95%, etc.).

These claims should be either amended or rewritten by an expert more familiar with the database and related mathematics.&mdash;Kbolino (talk) 17:02, 7 April 2010 (UTC)


 * This is splitting hairs. The space you've referred to as "infinite" is not so: the domain is finite, namely, all the files on all the computers in the world, both of which are finite numbers far smaller than 2^128.  This is to catch suspected criminals and child pornographers dealing in finite amounts of child pornography and for safely eliminating common files found on most computers like Windows system files, which are also finite.  The topic would best be discussed over at MD5 or hash function since that's what this is about, not this software in particular.  No one claims that a hash function is injective, just that it is practical, in the sense that a collision occuring once every few trillion years is a fair tradeoff that rapidly and dependably mows an examination workload by hours.  Criminals could hardly be bothered to try to produce MD5 collisions to hide their naughty files - the same effort would be much more effective encrypting or hiding it instead - anyone with such forward thinking has probably already picked colors for their prison cell long before their eventual arrest. Casascius♠ (talk) 02:11, 21 October 2010 (UTC)


 * I did not inject my opinions into the article itself, I merely called for citations of the various claims it makes. You decided to strip the tags without providing any citations, so I have marked the entire article as needing citation.  Sidestepping the mathematics a bit, the "statistical certainty" (with a laughable link to the entire subject of Statistics, as though that somehow explains what "statistical certainty" means) of this and similar methods is undoubtedly a matter of some debate in the courts.  Regardless, stating a claim requires support; waving away the need for support because you know better amounts to original research.&mdash;Kbolino (talk) 07:06, 27 April 2012 (UTC)

Definitions
Having only slight knowledge in this arena, I was perplexed by the terms 'good' and 'bad'. Does these terms mean 'valid' and 'corrupted' (as in the data itself) or 'ethical' and 'evil', respectively? LorenzoB (talk) 18:26, 15 September 2014 (UTC)

--- A "bad" file would be a file that is contraband - illegal to possess regardless of circumstances. Typically, that would be child pornography. Bad files allow an investigator to focus the examination of a computer hard drive.

A "good" file would be one the source of which is known. Normally, these would be files from operating systems and freshly installed program files from reputable vendors. There would be no reason for an investigator to review these files. By way of example, an fresh install of Windows 10 adds tens of thousands of files to a hard drive. Hash those files, store the hashes as known good files and when examining a seized hard drive, ignore those tens of thousands of files.

The files not identified as good should be examined because they could contain evidence of a crime. Which is the purpose of co nducting a forensic examination.

Hashing can also be used to identify, in a practical sense, identical files (practical because, as pointed out by Kbolino, there is a non-zero probability of a false positive) and can be used to focus copyright and piracy investigations. Hashkeeper was not intended for those kinds of investigations. Pndfam05 (talk) 17:17, 17 January 2017 (UTC)