By Priyank Jain, Manasi Gyanchandani, and Nilay Khare
Department of Computer Sciences, Maulana Azad National Institute of Technology, India
Image Attribute: 3D visualization of data by Elif Ayiter Creative Commons
Privacy and security in terms of big data is an important issue. Big data security model is not suggested in the event of complex applications due to which it gets disabled by default. However, in its absence, data can always be compromised easily. As such, this section focuses on the privacy and security issues.
Privacy - Information privacy is the privilege to have some control over how the personal information is collected and used. Information privacy is the capacity of an individual or group to stop information about themselves from becoming known to people other than those they give the information to. One serious user privacy issue is the identification of personal information during transmission over the Internet .
Security - Security is the practice of defending information and information assets through the use of technology, processes, and training from-Unauthorized access, Disclosure, Disruption, Modification, Inspection, Recording, and Destruction.
Privacy vs. security - Data privacy is focused on the use and governance of individual data—things like setting up policies in place to ensure that consumers’ personal information is being collected, shared and utilized in appropriate ways. Security concentrates more on protecting data from malicious attacks and the misuse of stolen data for profit . While security is fundamental for protecting data, it’s not sufficient for addressing privacy. Table 1 focuses on the additional difference between privacy and security.
Privacy requirements in the Big data
Big data analytics draw in various organizations; a hefty portion of them decide not to utilize these services because of the absence of standard security and privacy protection tools. These sections analyze possible strategies to upgrade big data platforms with the help of privacy protection capabilities. The foundations and development strategies of a framework that supports:
1. The specification of privacy policies managing the access to data stored into target big data platforms,
2. The generation of productive enforcement monitors for these policies, and
3. The integration of the generated monitors into the target analytics platforms. Enforcement techniques proposed for traditional DBMSs appear inadequate for the big data context due to the strict execution necessities needed to handle large data volumes, the heterogeneity of the data, and the speed at which data must be analyzed.
Businesses and government agencies are generating and continuously collecting large amounts of data. The current increased focus on substantial sums of data will undoubtedly create opportunities and avenues to understand the processing of such data over numerous varying domains. But, the potential of big data come with a price; the users’ privacy is frequently at danger. Ensures conformance to privacy terms and regulations are constrained in current big data analytics and mining practices. Developers should be able to verify that their applications conform to privacy agreements and that sensitive information is kept private regardless of changes in the applications and/or privacy regulations. To address these challenges, identify a need for new contributions in the areas of formal methods and testing procedures. New paradigms for privacy conformance testing to the four areas of the ETL (Extract, Transform, and Load) process as shown in below figure. [3, 4]
Figure Attribute: Big data architecture and testing area new paradigms for privacy conformance testing to the four areas of the ETL (Extract, Transform, and Load) processes are shown above./ Source: Authors
1. Pre‐hadoop process validation This step does the representation of the data loading process. At this step, the privacy specifications characterize the sensitive pieces of data that can uniquely identify a user or an entity. Privacy terms can likewise indicate which pieces of data can be stored and for how long. At this step, schema restrictions can take place as well.
2. Map‐reduce process validation This process changes big data assets to effectively react to a query. Privacy terms can tell the minimum number of returned records required to cover individual values, in addition to constraints on data sharing between various processes.
3. ETL process validation Similar to step (2), warehousing rationale should be confirmed at this step for compliance with privacy terms. Some data values may be aggregated anonymously or excluded in the warehouse if that indicates a high probability of identifying individuals.
4. Reports testing reports are another form of questions, conceivably with higher visibility and wider audience. Privacy terms that characterize ‘purpose’ are fundamental to check that sensitive data is not reported with the exception of specified uses.
Big data privacy in data generation phase
Data generation can be classified into active data generation and passive data generation. By active data generation, we mean that the data owner will give the data to a third party , while passive data generation refers to the circumstances that the data are produced by data owner’s online actions (e.g., browsing) and the data owner may not know about that the data are being gathered by a third party. Minimization of the risk of privacy violation amid data generation by either restricting the access or by falsifying data.
1. Access restriction If the data owner thinks that the data may uncover sensitive information which is not supposed to be shared, it refuses to provide such data. If the data owner is giving the data passively, a few measures could be taken to ensure privacy, such as anti-tracking extensions, advertisement or script blockers, and encryption tools.
2. Falsifying data In some circumstances, it is unrealistic to counteract access of sensitive data. In that case, data can be distorted using certain tools prior to the data gotten by some third party. If the data are distorted, the true information cannot be easily revealed. The following techniques are utilized by the data owner to falsify the data:
A tool Sockpuppet is utilized to hide the online identity of individual by deception. By utilizing multiple Sockpuppets, the data belonging to one specific individual will be regarded as having a place with various people. In that way, the data collector will not have enough knowledge to relate different sockpuppets to one individual.
Certain security tools can be used to mask individual’s identity, such as Mask Me. This is especially useful when the data owner needs to give the credit card details amid online shopping.
Big data privacy in data storage phase
Storing high volume data is not a major challenge due to the advancement in data storage technologies, for example, the boom in cloud computing . If the big data storage system is compromised, it can be exceptionally destructive as individuals’ personal information can be disclosed . In distributed environment, an application may need several datasets from various data centers and therefore confront the challenge of privacy protection.
The conventional security mechanisms to protect data can be divided into four categories. They are file level data security schemes, database level data security schemes, media level security schemes and application level encryption schemes . Responding to the 3V’s nature of the big data analytics, the storage infrastructure ought to be scalable. It should have the ability to be configured dynamically to accommodate various applications. One promising technology to address these requirements is storage virtualization, empowered by the emerging cloud computing paradigm . Storage virtualization is processed in which numerous network storage devices are combined into what gives off an impression of being a single storage device. SecCloud is one of the models for data security in the cloud that jointly considers both of data storage security and computation auditing security in the cloud . Therefore, there is a limited discussion in case of privacy of data when stored in the cloud.
Approaches to privacy preservation storage on cloud
When data are stored on cloud, data security predominantly has three dimensions, confidentiality, integrity and availability . The first two are directly related to privacy of the data i.e., if data confidentiality or integrity is breached it will have a direct effect on users privacy. Availability of information refers to ensuring that authorized parties are able to access the information when needed. A basic requirement for the big data storage system is to protect the privacy of an individual. There are some existing mechanisms to fulfill that requirement. For example, a sender can encrypt his data using public key encryption (PKE) in a manner that only the valid recipient can decrypt the data. The approaches to safeguard the privacy of the user when data are stored in the cloud are as follows :
Attribute-based encryption - Access control is based on the identity of a user complete access to all resources.
Homomorphic encryption - Can be deployed in IBE or ABE scheme settings updating ciphertext receiver is possible.
Storage path encryption - It secures storage of big data on clouds.
Usage of Hybrid clouds - Hybrid cloud is a cloud computing environment which utilizes a blend of on-premises, private cloud and third-party, public cloud services with organization between the two platforms.
Integrity verification of big data storage
At the point when cloud computing is used for big data storage, data owner loses control over data. The outsourced data are at risk as cloud server may not be completely trusted. The data owner should be firmly convinced that the cloud is storing data properly according to the service level contract. To ensure privacy to the cloud user is to provide the system with the mechanism to allow data owner verify that his data stored on the cloud is intact [13, 14]. The integrity of data storage in traditional systems can be verified through a number of ways i.e., Reed-Solomon code, checksums, trapdoor hash functions, message authentication code (MAC), and digital signatures etc. Therefore data integrity verification is of critical importance. It compares different integrity verification schemes discussed [13, 15]. To verify the integrity of the data stored on the cloud, straight forward approach is to retrieve all the data from the cloud. To verify the integrity of data without having to retrieve the data from cloud [14, 15]. In integrity verification scheme, the cloud server can only provide the substantial evidence of the integrity of data when all the data are intact. It is highly prescribed that the integrity verification should be conducted regularly to provide the highest level of data protection .
Big data privacy preserving in data processing
Big data processing paradigm categorizes systems into the batch, stream, graph, and machine learning processing [16, 17]. For privacy protection in data processing part, the division can be done in two phases. In the first phase, the goal is to safeguard information from unsolicited disclosure since the collected data might contain sensitive information of the data owner. In the second phase, the aim is to extract meaningful information from the data without violating the privacy.
This article is an excerpt taken from a published paper titled, - "Big data privacy: a technological perspective and review" by Priyank Jain, Manasi Gyanchandani and Nilay Khare at Journal of Big Data20163:25 DOI: 10.1186/s40537-016-0059-y/ Creative Commons Attribution 4.0 International License
 Porambage P, et al. The quest for privacy in the internet of things. IEEE Cloud Comp. 2016;3(2):36–45
 Jing Q, et al. Security of the internet of things: perspectives and challenges. Wirel Netw. 2014;20(8):2481–501.
 Han J, Ishii M, Makino H. A hadoop performance model for multi-rack clusters. In: IEEE 5th international conference on computer science and information technology (CSIT). 2013. p. 265–74.
 Gudipati M, Rao S, Mohan ND, Gajja NK. Big data: testing approach to overcome quality challenges. Data Eng. 2012:23–31.
 Xu L, Jiang C, Wang J, Yuan J, Ren Y. Information security in big data: privacy and data mining. IEEE Access. 2014;2:1149–76
 Liu S. Exploring the future of computing. IT Prof. 2011;15(1):2–3.
 Sokolova M, Matwin S. Personal privacy protection in time of big data. Berlin: Springer; 2015.
 Cheng H, Rong C, Hwang K, Wang W, Li Y. Secure big data storage and sharing scheme for cloud tenants. China Commun. 2015;12(6):106–15.
 Mell P, Grance T. The NIST definition of cloud computing. Natl Inst Stand Technol. 2009;53(6):50.
 Wei L, Zhu H, Cao Z, Dong X, Jia W, Chen Y, Vasilakos AV. Security and privacy for storage and computation in cloud computing. Inf Sci. 2014;258:371–86.
 Xiao Z, Xiao Y. Security and privacy in cloud computing. In: IEEE Trans on communications surveys and tutorials, vol 15, no. 2, 2013. p. 843–59.
 Mehmood A, Natgunanathan I, Xiang Y, Hua G, Guo S. Protection of big data privacy. In: IEEE translations and content mining are permitted for academic research. 2016.
 Wang C, Wang Q, Ren K, Lou W. Privacy-preserving public auditing for data storage security in cloud computing. In: Proc. of IEEE Int. Conf. on INFOCOM. 2010. p. 1–9.
 Liu C, Ranjan R, Zhang X, Yang C, Georgakopoulos D, Chen J. Public auditing for big data storage in cloud computing—a survey. In: Proc. of IEEE Int. Conf. on computational science and engineering. 2013. p. 1128–35.
 Liu C, Chen J, Yang LT, Zhang X, Yang C, Ranjan R, Rao K. Authorized public auditing of dynamic big data storage on a cloud with efficient verifiable fine-grained updates. In: IEEE trans. on parallel and distributed systems, vol 25, no. 9. 2014. p. 2234–44
 Xu K, et al. Privacy-preserving machine learning algorithms for big data systems. In: Distributed computing systems (ICDCS) IEEE 35th international conference; 2015.
 Zhang Y, Cao T, Li S, Tian X, Yuan L, Jia H, Vasilakos AV. Parallel processing systems for big data: a survey. In: Proceedings of the IEEE. 2016.