User:Idkk/Data landfill

Data Landfill
Data Mining is the extraction of information by analysis from large quantities of (often disparate) data. If the original data cannot be usefully analysed because of its size or complexity or lack of organisation (or any other reason) then we have the opposite, which may be referred to as Data Landfill. Right from the outset, there has to be governance in the storing of data, and the planning of that storage, to avoid losing useful access to the information locked into that data.

Data may be difficult to analyse because of its size. This is less and less of a worry, as CPU speeds appear to be growing faster than the amount of data we have to process. By Westheimer's Law we can estimate that the total quantity of data in the world doubles every two years, which is slower than the raw growth in CPU speeds.

[[File:Comparison_of_Data_Complexity_with_Raw_CPU_speed,_over_time.pdf|thumb|left|250px|Data Complexity and CPU Speed

Raw CPU 58.5%, Effective CPU 55%]]Data may be difficult to analyse because of its complexity. There is no fixed definition of "Complexity" in this context, and we may be looking at Effective Measure Complexity, or Computational Complexity, or Algorithmic Information Complexity, or Shannon Entropy, or Kolmogorov Complexity, or Crutchfield's "Topological Complexity" , or Time Complexity of some useful analysis, or some other measures of indexing difficulties or sorting speeds. A review of some aspects of Data Complexity Measures was given by Sotoca, Sanchez and Mollinda in 2005. If, as a reasonable first estimate, we take Data Complexity to be related to the sortable interconnections between items of elementary data, then we can say that complexity grows at the rate O(n log n), where n is the number of elementary data items. Raw computing power may well grow according to a consequence Moore's Law, but useful computing power grows much slower than that (e.g. according to Wirth's Law). This has a big effect upon whether any given set of data can ever be analysed. Moore's Law can be paraphrased by saying that CPU speeds grow at about 58.5% per year: that means that CPU speeds catch up with complexity after about ten years. If, however, we say that useful CPU speeds grow at 53% per year (just five and a half percent less - a very optimistic view of software bloat) then useful CPU catches up with complexity only after over sixty years.

Data may be difficult to analyse because of its lack of organisation. Organisation does not just mean the arrangement of the physical records of a data set, but includes a coherent definition of what the various parts of the data actually mean , of the quality of the recording and precision of the information, and of the interrelations that should or could exist between different items. If the data is just dumped into storage without consideration - and adherence - to these, then it rapidly becomes useless - it becomes Data Landfill.

Recovery from Landfill
Outside of data processing we know that there can be useful extraction from waste and Landfill, although at some expense - usually of human effort, and at the cost of human health. Similarly, with computing effort, real information can be extracted from data storage which was previously of little use. This information extraction can be as a result of being able to process data faster, or through being able to reorganise large quantities of data in new ways (as, for example, as directed graphs  rather than as relational tables), and use alternative searching techniques to the well-tried and well known SQL methods.