User:Thisismyusername31/Data sanitization/Denmum Peer Review

General info

 * Whose work are you reviewing?

Thisismyusername31


 * Link to draft you're reviewing
 * User:Thisismyusername31/sandbox
 * Link to the current version of the article (if it exists)

Evaluate the drafted changes
(Compose a detailed peer review here, considering each of the key aspects listed above if it is relevant. Consider the guiding questions, and check out the examples of what feedback looks like.)

Article Draft 1 (1/4)
Data sanitization involves the process of permanently removing and hiding sensitive information during the usage of datasets for study or transfer of information from one device to another. This technique is essential for taking useful information from original databases while avoiding infringing on private information that may be stored in these databases. '''In recent decades, there has been an increasing use of database information in the generation of electronic tools, such as 5G mobile data. There has also been increasing usage of Internet of Things (IoT) technologies. (Do you have a source to back these two sentences up?)''' IoT technologies refer to smart devices equipped with sensors, cameras, recording devices, and other sensory tools that are then linked directly to other devices through the internet. These devices are able to easily transfer information from one device to another, however, this ease of transfer also poses a major privacy challenge. There are many risks associated with transferring information because sensitive, raw data needs to be removed in between.  For example, devices like Alexa and Google Home need to be equipped with data sanitization tools that eliminate the leakage of private data that may be collected.(Sourcing)  Data sanitization is also commonly referred to as Privacy Preserving Data Mining, or PPDM, as it aims to preserve important information while using algorithms to filter out sensitive details. ''' Currently, many models of data sanitization rely on heuristic methods that delete or add information to the original database in an effort to preserve the privacy of each subject. However, there have also been numerous new developments of PPDM that rely instead on machine learning and deep learning techniques.(sourcing) '''

''' Overall, I like the intro. It has an objective tone and content overall seems to be objective. When you say things like many ... or increasing use, make sure to have some sort of statistic/source to back that up. Otherwise, seems great! '''

Applications of Data Sanitization
Privacy Preserving Data Mining (PPDM) has a wide range of uses and is an integral step in the transfer or use of any large data set. It is also commonly linked to blockchain-based secure information sharing within supply chain management systems.


 * 5G data
 * Internet of Things (IoT) technologies eg: Alexa, Google Home, etc.
 * Healthcare industry, using large datasets
 * Supply chain industry, usage of blockchain and optimal key generation

Browser backed cloud storage systems are heavily reliant on data sanitization and are becoming an increasingly popular route of data storage. Furthermore, the ease of usage is important for enterprises and workplaces that use cloud storage for communication and collaboration.  (How does one know this? Need more information/elaboration or sourcing.) 

Data sanitization is especially relevant for the medical field or large public organizations that need to use very large databases of sensitive data. It's those organizations that need to find efficient ways to hide sensitive data while maintaining functionality. ''' Sounds somewhat opinionated. Why those organizations specifically? More elaboration needed. '''

Blockchain is used to record and transfer information in a secure way and data sanitization techniques are required to ensure that this data is transferred more securely and accurately. It’s especially applicable for those working in supply chain management and may be useful for those looking to optimize the supply chain process. ''' The need to improve blockchain methods is becoming increasingly relevant as the global level of development increases and becomes more electronically dependent. (Need sourcing for statement and/or more elaboration). '''

''' Overall, this section seems good. Some extra elaboration and sourcing is necessary in order to improve section. '''

Risks Associated
Inadequate data sanitization methods can result in two main problems: a breach of private information and compromises to the integrity of the original dataset. If data sanitization methods are unsuccessful at removing all sensitive information, it poses the risk of leaking this information to attackers. ''' Numerous studies have been conducted to optimize ways of preserving sensitive information. ( Very general statement, need to make more specific). ''' Some methods of data sanitization have a high sensitivity to distinct points that have no closeness to data points. This type of data sanitization is very precise and can detect anomalies even if the poisoned data point is relatively close to true data. Another method of data sanitization is one that also removes outliers in data, but does so in a more general way. It detects the general trend of data and discards any data that strays and it’s able to target anomalies even when inserted as a group. In general, data sanitization techniques use algorithms to detect anomalies and remove any suspicious points that may be poisoned data or sensitive information.

Furthermore, data sanitization methods may remove useful, non-sensitive information, which then renders the sanitized dataset less useful and altered from the original. There have been iterations of common data sanitization techniques that attempt to correct the issue of the loss of original dataset integrity. In particular, Liu, Xuan, Wen, and Song offered a new algorithm for data sanitization called the Improved Minimum Sensitive Itemsets Conflict First Algorithm (IMSICF) method. There is often a lot of emphasis that is put into protecting the privacy of users, so this method brings a new perspective that focuses on also protecting the integrity of the data. It functions in a way that has three main advantages: it learns to optimize the process of sanitization by only cleaning the item with the highest conflict count, keeps parts of the dataset with highest utility, and also analyzes the conflict degree of the sensitive material. Robust research was conducted on the efficacy and usefulness of this new technique to reveal the ways that it can benefit in maintaining the integrity of the dataset. This new technique is able to firstly pinpoint the specific parts of the dataset that are possibly poisoned data and also use computer algorithms to make a calculation between the tradeoffs of how useful it is to decide if it should be removed. This is a new way of data sanitization that takes into account the utility of the data before it is immediately discarded.

''' I like this section a lot. Tone seems nice and sourcing done. Maybe some more specifics in the first paragraph?(In terms of names) '''

Methods of Data Sanitization
An important goal of PPDM is to strike a balance between maintaining the privacy of users that have submitted the data while also enabling developers to make full use of the dataset. Many measures of PPDM directly modify the dataset and create a new version that makes the original unrecoverable. It strictly erases any sensitive information and makes it inaccessible for attackers.

One type of data sanitization is rule based PPDM that uses defined computer algorithms to clean datasets. Association rule hiding is the process of data sanitization as applied to transactional databases. Transactional databases are the general term for data storage used to record transactions as organizations conduct their business. Examples include shipping payments, credit card payments, and sales orders. This source analyzes fifty four different methods of data sanitization and presents its four major findings of its trends

Certain new methods of data sanitization that rely on machine deep learning. There are various weaknesses in the current use of data sanitization. Many methods are not intricate or detailed enough to protect against more specific data attacks. This effort to maintain privacy while dating important data is referred to as privacy-preserving data mining. Machine learning develops methods that are more adapted to different types of attacks and can learn to face a broader range of situations. Deep learning is able to simplify the data sanitization methods and run these protective measures in a more efficient and less time consuming way.

There have also been hybrid models that utilize both rule based and machine deep learning methods to achieve a balance between the two techniques.

''' Good section. Similar to before, some sourcing would be good. '''

Data sanitization
Data sanitization involves the secure and permanent erasure of sensitive data from datasets and devices to guarantee that there remains no residual data that can be recovered even through extensive forensic analysis. Data sanitization has a wide range of applications but it is mainly used for clearing out old personal electronic devices or for the sharing and use of large datasets that contain sensitive information. The main strategies for erasing personal data from devices are physical destruction, cryptographic erasure, and data erasure. Data sanitization methods are also applied for the cleaning of sensitive data, such as through heuristic based methods, machine learning based methods, and k-source anonymity.

This erasure is necessary as an increasing amount of data is moving to online storage, which poses a privacy risk in the situation that the old device is resold to another individual. The importance for data sanitization has risen in recent years as private information is increasingly stored in an electronic format and larger, more complex datasets are being utilized to distribute private information. Electronic storage has expanded and enabled more private data to be stored and therefore requires more advanced and thorough data  sanitization  techniques to ensure that no data is left on the device once it is no longer in use. Technological tools that enable the transfer of large amounts of data also allow more private data to be shared. Especially with the increasing popularity of cloud-based information sharing and storage, data sanitization methods that ensure that all data shared is cleaned has become a major concern.

''' Great introduction, clear, concise and maintains an objective tone. '''

Clearing devices
The main use of data sanitization is for the complete clearing of devices and destruction of all sensitive data once the device is no longer in use[Source]. This is an important stage in Information Lifecycle Management (ILM), an approach for ensuring privacy and data management throughout the usage of an electronic device, as it ensures that all data is completely destroyed and unrecoverable when devices reach the end of their lifecycle.[Source]

There are three main methods of data sanitization for complete erasure of data: physical destruction, cryptographic erasure, and data erasure[Source]. All three erasure methods aim to ensure that deleted data cannot be accessed even through advanced forensic methods, which maintains the privacy of individuals’ data even after the mobile device is no longer in use[Source].

Physical destruction

Physical erasure involves the manual destruction of stored data. This method uses mechanical shredders or degaussers to shred devices, such as phones, computers, hard drives, and printers into small individual pieces.

Degaussing is most commonly used on solid-state drives (SSDs), such as hard disk drives (HDDs), and involves the utilization of high energy magnetic fields to permanently disrupt the functionality and memory storage of the device. When data is exposed to this strong magnetic field, any memory storage is neutralized and can not be recovered or used again.

Physical destruction often ensures that data is completely erased and cannot be used again. However, the physical byproducts of mechanical waste from mechanical shredding can be damaging to the environment. Furthermore, once data is physically destroyed, it can no longer be resold or used again.

Cryptographic erasure

Cryptographic erasure involves the destruction of the secure key, or passphrase, that is used to protect stored information. Data encryption involves the development of a secure key that only enables authorized parties to gain access to the data that is stored. The permanent erasure of this key ensures that the private data stored can no longer be accessed. Cryptographic erasure is commonly installed through manufactures of the device itself as encryption software is often built into the device. Encryption with key erasure involves encrypting all sensitive material in a way that requires a secure key to decrypt the information when it needs to be used. When the information needs to be deleted, the secure key can be erased. This provides a higher ease of use than other software methods because it involves one deletion of secure information rather than each individual file.

Cryptographic erasure is often used for data storage that does not contain as much private information since there is a possibility that errors can occur due to manufacturing failures or human error during the process of key destruction. This creates a wider range of possible results of data erasure. This method allows for data to continue to be stored on the device and does not require that the device be completely erased. This way, the device can be resold again to another individual or company since the physical integrity of the device itself is maintained.

Data erasure

The process of data erasure involves masking all information at the byte level through the insertion of random 0s and 1s in on all sectors of the electronic equipment that is no longer in use[Source]. This software based method ensures that all data previous stored is completely hidden and unrecoverable, which ensures full data sanitization. The efficacy and accuracy of this sanitization method can also be analyzed through audit-able reports[Source].

This method ensures complete sanitization while also maintaining the physical integrity of the electronic equipment so that the technology can be resold or reused. This ability to recycle technological devices makes data erasure a more environmentally sound version of data sanitization. This method is also the most accurate and comprehensive since the efficacy of the data masking can be tested afterwards to ensure complete deletion. However, data erasure through software based mechanisms requires more time compared to other methods.

''' Really interesting and insightful pieces! This section is clear, concise, maintains an objective tone, is sourced, and has good flow. '''

Necessity of data sanitization
There are been increased usage of mobile devices, Internet of Things (IoT) technologies, cloud-based storage systems, portable electronic devices, and various other electronic methods to store sensitive information, therefore implementing effective erasure methods once the device is not longer in use has become crucial to protect sensitive data[Source]. Due to the increased usage of electronic devices in general and the increased storage of private information on these electronic devices, the need for data sanitization has been much more urgent in recent years[Source].

''' Good! '''

Applications of data sanitization
Data sanitization methods are also implemented for privacy preserving data mining, association rule hiding, and blockchain-based secure information sharing. These methods involve the transfer and analysis of large datasets containing private information that needs to be sanitized before being made available online so that any sensitive material not left vulnerable. Data sanitization is used to ensure that the privacy is maintained in the dataset through the clearing of any sensitive information prior to its use.

Privacy preserving data mining
Privacy Preserving Data Mining (PPDM) has a wide range of uses and is an integral step in the transfer or use of any large data set. It is also commonly linked to blockchain-based secure information sharing within supply chain management systems.


 * 5G data[Source]
 * Internet of Things (IoT) technologies eg: Alexa, Google Home, etc.[Source]
 * Healthcare industry, using large datasets[Source]
 * Supply chain industry, usage of blockchain and optimal key generation[Source]

Privacy preserving data mining and data sanitization work in tandem to clear large datasets containing sensitive information so that it can be utilized by individuals or companies for analysis. The aim of privacy preserving data mining is to ensure that private information cannot be leaked or accessed by attackers and sensitive data is not traceable to individuals that have submitted the data[Source]. Privacy preserving data mining aims to maintain this level of privacy for individuals while also maintaining the integrity and functionality of the original dataset[Source]. In order for the dataset to be used, necessary aspects of the original data need to be protected during the process of data sanitization. This balance between privacy and utility has been the primary goal of data sanitization methods.

Certain models of data sanitization delete or add information to the original database in an effort to preserve the privacy of each subject. These heuristic based algorithms are beginning to become more popularized, especially in the field of association rule mining. Heuristic methods involve specific algorithms that use pattern hiding, rule hiding, and sequence hiding to keep specific information hidden. This type of data hiding can be used to cover wide patterns in data, but is not as effective for specific information protection. Heuristic based methods are not as suited to sanitizing large datasets, however, recent developments in the heuristics based field have analyzed ways to tackle this problem. An example includes the MR-OVnTSA approach, a  heuristics based sensitive pattern hiding approach for big data, introduced by Shivani Sharma and Durga Toshniwa. This approach uses a heuristics based method called the ‘MapReduce Based Optimum Victim Item and Transaction Selection Approach’, also called MR-OVnTSA, that aims to reduce the loss of important data while removing and hiding sensitive information. It takes advantage of algorithms that compare steps and optimize sanitization. (Sourcing?) 

An important goal of PPDM is to strike a balance between maintaining the privacy of users that have submitted the data while also enabling developers to make full use of the dataset. Many measures of PPDM directly modify the dataset and create a new version that makes the original unrecoverable. It strictly erases any sensitive information and makes it inaccessible for attackers.

One type of data sanitization is rule based PPDM that uses defined computer algorithms to clean datasets. Association rule hiding is the process of data sanitization as applied to transactional databases. Transactional databases are the general term for data storage used to record transactions as organizations conduct their business. Examples include shipping payments, credit card payments, and sales orders. This source analyzes fifty four different methods of data sanitization and presents its four major findings of its trends

Certain new methods of data sanitization that rely on machine deep learning. There are various weaknesses in the current use of data sanitization. Many methods are not intricate or detailed enough to protect against more specific data attacks. This effort to maintain privacy while dating important data is referred to as privacy-preserving data mining. Machine learning develops methods that are more adapted to different types of attacks and can learn to face a broader range of situations. Deep learning is able to simplify the data sanitization methods and run these protective measures in a more efficient and less time consuming way.

There have also been hybrid models that utilize both rule based and machine deep learning methods to achieve a balance between the two techniques.

Blockchain-based secure information sharing

Browser backed cloud storage systems are heavily reliant on data sanitization and are becoming an increasingly popular route of data storage[Source]. Furthermore, the ease of usage is important for enterprises and workplaces that use cloud storage for communication and collaboration[Source].

Blockchain is used to record and transfer information in a secure way and data sanitization techniques are required to ensure that this data is transferred more securely and accurately. It’s especially applicable for those working in supply chain management and may be useful for those looking to optimize the supply chain process. The need to improve blockchain methods is becoming increasingly relevant as the global level of development increases and becomes more electronically dependent.

''' Good section, concise, clear and objective tone. Would recommend sourcing the paragraph describing the MR-OVnTSA approach but otherwise, great job! '''

Risks posed by inadequate sanitization
Inadequate data sanitization methods can result in two main problems: a breach of private information and compromises to the integrity of the original dataset. If data sanitization methods are unsuccessful at removing all sensitive information, it poses the risk of leaking this information to attackers. Numerous studies have been conducted to optimize ways of preserving sensitive information. Some methods of data sanitization have a high sensitivity to distinct points that have no closeness to data points. This type of data sanitization is very precise and can detect anomalies even if the poisoned data point is relatively close to true data. Another method of data sanitization is one that also removes outliers in data, but does so in a more general way. It detects the general trend of data and discards any data that strays and  is  able to target anomalies even when inserted as a group. In general, data sanitization techniques use algorithms to detect anomalies and remove any suspicious points that may be poisoned data or sensitive information.

Furthermore, data sanitization methods may remove useful, non-sensitive information, which then renders the sanitized dataset less useful and altered from the original. There have been iterations of common data sanitization techniques that attempt to correct the issue of the loss of original dataset integrity. In particular, Liu, Xuan, Wen, and Song offered a new algorithm for data sanitization called the Improved Minimum Sensitive Itemsets Conflict First Algorithm (IMSICF) method. There is often a lot of emphasis that is put into protecting the privacy of users, so this method brings a new perspective that focuses on also protecting the integrity of the data. It functions in a way that has three main advantages: it learns to optimize the process of sanitization by only cleaning the item with the highest conflict count, keeps parts of the dataset with highest utility, and also analyzes the conflict degree of the sensitive material. Robust research was conducted on the efficacy and usefulness of this new technique to reveal the ways that it can benefit in maintaining the integrity of the dataset. This new technique is able to firstly pinpoint the specific parts of the dataset that are possibly poisoned data and also use computer algorithms to make a calculation between the tradeoffs of how useful it is to decide if it should be removed. This is a new way of data sanitization that takes into account the utility of the data before it is immediately discarded.

''' Great section on the risks! Would recommend sourcing the last paragraph on the Improved Minimum Sensitive Itemsets Conflict First Algorithm (IMSICF) method but otherwise, it's concise, clear and objective! '''