User:Rocketshiporion/File Backup, Indexing & Deduplication Program

storage format
I think the most important decision at this early stage are the storage format for the meta data. Storage formats are hard to change due to backward compatibility as soon as they are in use.

Ideally the storage format should be easy to access from different programming languages, easy to read manually, compact and make it possible to do high performance searching in the data. I also think it should be easy to merge data from two sources and treat them as one data set. The raw amount of meta data is hard to estimate, in a typical home use scenario for photos, documents and so on it can be estimated as:


 * Bytes per file observation in average (path, storage media name, file name, other meta data): 200
 * Nr files in each indexing operation: 10 000
 * Nr indexing operations per year: 50
 * Total amount of meta data: 200*10 000*50=100 MB/year

If used for all system files and on many computers in a company each night the total amount of data will be much larger. I think it will be necessary to remove some old meta data to keep down the size of the meta data.

One way of minimize the amount of meta data is to not store all meta data of every file every time it is indexed, instead one could make a reference to earlier records by ID number for each record and just note things that has changed. The main problem with this are the increased complexity and increased risk of bugs. Especially merging two datasets become complicated since the ID-numbers for the records will not be guaranteed to be unique across multiple independent datasets. For these reasons I think I prefer the simple solution of storing all meta data independently for each observation.

I think the system should be built in such a way that it is never necessary to load all the meta data to RAM.

The formats I have considered are:


 * A custom simple text format.
 * JSON
 * XML
 * SQLite

At the moment I prefer SQLite, the drawback is that it can not be inspected with a text editor but there are other tools that allow inspection of the data. Advantages are that it is easier to search in the data and that it is easy to extend the file format with new tables and columns.

I have earlier made a partly implemented prototype using JSON, each indexing operation generates a JSON file like this smal example: The drawback are that after a while there will be many such files and it will take time to scan trough them when searching. A benefit are that the user can use the explorer or similar to remove unwanted meta data.


 * Although I'm not familiar with JSON, I think the significant drawback (based on this example) is that the number of entries that would accumulate from an observation may make it quite cumbersome for the user to actually remove unwanted metadata using an explorer. If e.g. 5000 files are indexed in an observation, the resulting metadata file would be perhaps a megabyte or more in size, and run to around 50,000 lines. I would go with an SQL-based database table (see below), as it could be queried with a program such as OpenOffice Base.


 * I'm not sure about what to use as the primary key though. Rocketshiporion ♫ 01:51, 22 December 2010 (UTC)
 * What I meant by "removing unwanted metadata" was that if each indexing operation generates one file of meta-data, then the user could remove all the meta data from one indexing operation at once. For example if the user index and backup the files each week then the user could remove all but one each month after a year or similar.
 * Also if the user want to look at the content of selection of DVD-R:s and other media then the user could place the meta data files from them in a folder and ask the program to look for them there. I did not intend that the user should edit the individual meta data files. Of curse nothing prevents that the meta data are stored in a SQL-database such as SQLite one file for each indexing operation but I think that in order to utilize the full power of SQL all data should be stored in one database (file). SQLite can open up to about 10 database files as one merged database but 10 is too limited for this application.
 * I think your SQL example are to restricted, there are no place to store the date of the indexing operation and similar fields, I think at least one more table is needed for such "meta meta data". (The data outside the files section in the JSON-example)
 * The core of my SQL example(See Data Model) the table fileobs is rather similar to your SQL example but I has added a number of tables for other functions that maybe makes it to complex. --Gr8xoz (talk) 13:38, 23 December 2010 (UTC)

Data model
The next thing to decide is what to store and how to organize the data. My suggestion are below, (I do not know if you know SQL but the table definition are rather straight forward.). This also gives an idea to what functionality I want to implement.

This are a rather complex data model, but I currently do not see how to simplify it without losing to much functionality, do you have any ideas? Much of the complexity comes from the ability to merge meta data that has been created and/or updated separately but I think that is a useful feature, otherwise the meta data file needed to be moved from one computer to the next for each update. I think the prototype should begin with filling the core tables, file_content and fileobs. How do you like this compared to the JSON solution?, any other suggestions?


 * Three suggestions currently:

Rocketshiporion ♫ 03:55, 2 January 2011 (UTC)
 * for the table fileobs, it would be useful to add an additional field - the filetype. Some users may want to back up for e.g. PNGs and ODBs seperately.
 * for the table devices, it may be useful to add the field FileSystem. It would permit sorting the devices based on what filesystem they're formatted with.
 * I'd combine the tables fileobs and file_content, as they have three fields in common (hash, size and tags), and the only unique fields possessed by file_content is aut_prio, man_prio and compression_ratio_est.


 * Response in order:
 * A good suggestion, a intresting question is how to find the file type in a OS independent way. Not all OS uses the end of the filename (And that is stored in fname). I need to think more aboth this.
 * I do not know how to get the file-system type in an OS-independent way but otherwise a good idea.
 * I think it is a good idea and I will probably begin with out file_content and see if I run in to any problems due to this non normalized database.
 * --Gr8xoz (talk) 15:28, 4 January 2011 (UTC)

Backup of file content
There are many interesting features that can be implemented for storing the content of the files, such ass backup over the net with minimal trust between the computers, encrypted backup, pear to per backup, differential backup of files that has similar content and so on. I do not see this as the core functionality so my ambition is to begin with a function to copy the files that needs a new backup to a specified map.

UI
The interaction with the user can be complex, therefore are it important with a god user interface. I see four types of posible UI:s, two textual and two graphical. I think a textual user interface is needed. It is important that routine tasks can be scripted so users with limited computer skills can run them after someone has configurated them. The textual user interface can either be a command line interface were every thing is specified on the command line or it could be a scripting interface were the actions and parameters are specified in some sort of text file. Templates for common tasks with helpful comments need of curse to be supplied. One simple way of doing this are to write a python library and then let the user write a very simple python script to specify what to do. One simple example could look like this:

GUI
In addition to the text based user interface I think it often would be useful with a Graphical user interface, especially for more interactive work. This can be done as a web interface or a normal GUI. Since this program are intended to run locally I think a normal GUI is most appropriate in order to avoid security issues and configuration problems in firewalls and so on. I think the over all design will look similar to this: http://www.digitalvolcano.co.uk/content/duplicate-cleaner/screenshots (A tabbed wizard-like interface)

The program should maybe include analysis functions like this: http://windirstat.info/index.php

I think I will use wxPython as GUI-library. http://www.wxpython.org/screenshots.php The GUI can be implemented as a wrapper around the textual user interface or it can be implemented as a integral part of the program. I am not sure what the best way to go are. I think a wrapper around the textual user interface is a nicer design but I am not sure what is easiest to implement.


 * I think the wrapper would be easier to implement; you only need to make one interface (the scriptable textual interface), then wrap a GUI around it. Plus, it would be easier to change the GUI in future - making a new wrapper should be easier than creating a whole new GUI. Rocketshiporion ♫ 23:40, 21 December 2010 (UTC)

File selection
An important problem in both the textual and the graphical user interface are to select files for different purposes for example backup, deletion, or just to list them for the user. This are an interesting balance between ease of use, performance and expression power. That is something I would like to discus with you later.

Terminology
I use a number of words, e.g. indexing, meta data, device, file observation, file content, threat and so on, if this shall be used by more than me then it is important that these gives a clear and precise meaning. I some thing are unclear or you have suggestions for a better terminology I would appreciate if you told me.

What do you think of the over all concept, storage format, user interface and so on?

Space colonization
I do not believe in colonization of Mars, free space colonization makes much more sense. I think we will begin with colonies in orbit around the Earth or Moon and then move on to colonies in orbit around the sun. I think raw material will mainly be mined on the Moon, the moons of Mars and in the asteroid belt in the beginning. Mars is a small inhospitable planet that offers very few advantages over free space colonisation. Its gravity are strong enough that transportation to and from Mars are expensive. If you are interested I could mail you some text I have written on this.

I still don't understand why the small diameter nuclear explosive are important, I would think it was the total volume and mass that was important. I do not understand why the propulsion unit need to be 500 mm in order to contain a 120 mm nuclear explosive, are you measuring the nuclear explosive without the conventional explosives lenses? My understanding of the orion-propulsion system are that it is very inefficient for a small spacecraft.


 * When I say 120mm for the nuclear explosive, I mean just the plutonium pit. As for the Orion-type shuttle, it would in no way be anywhere as small as the Space Shuttle - I intended a craft with a much larger overall diameter, but still significantly smaller than a full-size Orion. It would be enormous compared to the Space Shuttle; and its only similarity is that it would shuttle materiel and personnel between Earth and its outposts. I would most certainly like to read the text you have written on free space colonization. Rocketshiporion ♫ 04:32, 3 January 2011 (UTC)


 * I think a 120 mm plutonium pit is larger than most pits, the pit in Fat Man was only 90 mm. I do not think less than 500 mm diameter would be hard, W54 has a diameter of 270 mm and is 400 mm long. It has a mass of 23 kg and an yield of 250 ton TNT with the most powerful setting, an experimental version XW54 was tested with a yield of 6000 ton TNT.
 * The Orion base design "Interplanetary" (total mass of the ship 4 000 000 kg) planed to use 800 nuclear explosions equivalent to 140 ton TNT explosives each to reach LEO. The Orion design "Advanced interplanetary" (total mass of the ship 10 000 000 kg) planed to use 800 nuclear explosions equivalent to 350 ton TNT explosives each to reach LEO.
 * The interstellar designs use 1 Mt TNT devices. W59 is a 1Mt device it has a diameter that is smaller than 500 mm, it is 414 mm but it is 1215 mm long and has a mass of 250 kg. The propulsion unit will maybe be some what larger due to the need of a directional explosion and reaction mass. I will e-mail you the text about space colonisation within some days.--Gr8xoz (talk) 16:12, 4 January 2011 (UTC)

Nuclear proliferation
Of curse currently rogue states with active program for development of nuclear weapons will are hard to stop. But I think in the long run the general use of nuclear power and nuclear explosives especially will effect the nuclear proliferation. If Orion style propulsion becomes big business then it will be hard to argue that some countries should not use it and it is very easy to weaponize the technology. It is also important to remember that rogue states are not constant, Iran was a much nicer county before 1978. What I am most afraid of are not a rogue state using a few nuclear bombs but an escalating nuclear war where some rogue state use a nuclear bomb other countries retaliate and start a chain reaction similar to the events that lead to the first world war. This could threaten the survival of the human civilisation. Some estimates of the chances of human civilisation surviving 100 years are as low as 50% and nuclear war plays an important role in this estimate. I think it is way to pessimistic but I think the risks are big enough to be taken very seriously.


 * I'm interested in Orion because I see it as the fastest way to get off our planet. But you're right about the possibility of a nuclear holocaust due to a political chain reaction - in my hurry to get to other parts of the Solar System, I had overlooked the possibility that the human species might use Orion to annihilate itself first, leaving no one left to actually use Orion to go anywhere! Now that I come to think of it, I can all too easily imagine a nuclear war between the US and China, North Korea, Iran, etc. wiping out a billion people. Even a war between Israel and Iran could kill a hundred million people. And then there's nuclear-armed India and Pakistan...
 * While farther off in the future, something like Antimatter-Catalyzed Nuclear Pulse Propulsion might be safer - it can't easily be used to destroy cities. Rocketshiporion ♫ 04:50, 3 January 2011 (UTC)