Organizations with interactive nature and high response time require real- time processing.
By Pwint Phyu Khine and Wang Zhao Shun
Many different kinds of organizations are now applying and implementing big data in various types of information systems based on their organizational needs. Information systems emerged according to the requirements of the organizations which are based on what organizations do, how they do and organizational goals. According to Mintzberg, five different kinds of organization are classified based on the organization’s structure, shape and management as (1) Entrepreneurial structure―a small startup firm, (2) Machine Bureaucracy―medium sized manufacturing firm with definite structure, (3) Divisionalized bureaucracy―a multi-national organization which produces different kinds of products controlled by the central headquarter, (4) Professional bureaucracy―an organization relies on the efficiency of individuals such as law firms, universities, etc., and (5) Adhocracy such as consulting firm. Different kinds of information systems are required based on the work the target organization does.
Information systems required by the organization and the nature of problems within them reflects the types of organizational structure. Systems are structured procedures for the regulations of the organization limited by organization boundary. These boundaries express the relationship between systems and its environment (organization). Information Systems collects and redistribute data within internal operations of the organization and organization environment using the three basic simplest procedures-inputting data, performing processing and outputting the information. Among the organization and systems are “Business processes” which are logically related tasks with formal rules to accomplish a specific work which needs to coordinate throughout the organization hierarchy [1]. These organizational theories are always true regardless of old or new evolving data methodologies.
Relationship between Organization and Information Systems
The relationship between Organization and Information SystemsThe relationship between organization and information systems are called socio- technical effects. This socio-technical model suggests that all these components- organizational structure, people, job tasks and Information Technology (IT)- must be changed simultaneously to achieve the objective of the target organization and information systems [1]. Sometimes, these changes can result in chang- ing business goals, relationship with people, and business processes for target organization blur the organizational boundaries and cause the flattening of the organization [1] [2]. Big data transforms traditionally siloed information systems in the organizations into digital nervous systems with information in and out of relating organizational systems. Organization resistance to change is needed to be considered in every implementation of Information systems. The most common reason for the failure of large projects is not the failure of the technology, but organizational and political resistance to change [1]. Big data projects need to avoid this kind of mistake and implement based on not only from information system perspective but also from an organizational perspective.
Implementing Big Data Systems in Organizations
The work [3] provide a layered view of the big data system. To make the complexity of big data system simpler, the big data system can be decomposed into a layered structure according to a conceptual hierarchy. The layers are “Infrastructure Layer” with raw ICT resources, “Computing Layer” which encapsulating various data tools into a middleware layer that runs over raw ICT resources, and “Application layer” which exploits the interface provided by the programming models to implement various data analysis functions to develop various field related applications in different organizations.Different scholars are considering the system development lifecycle of big data system project. Based on IBM's three phases to build big data projects, the work in [4] proposed a holistic view for implementing the big data projects.
Phase 1. Planning: Involves Global Strategy Elaboration where the main idea is that the most important thing to consider is not technology but business objectives.
Phase 2. Implementation: these stages are divided into
1) data collecting from major big data sources,
2) data preprocessing by data cleaning for valid data, integrating different data types and sources, transformation (mapping data elements from source to destination systems and reducing data into a smaller structure (sometimes data discretization as a part of it),
3) smart data analysis i.e. using advanced analytics to extract value from a huge set of data, apply advanced algorithms to perform complex analytics on either structured or unstructured data,
4) representation and visualization for guiding the analysis process and presenting the results in a meaningful way.
Phase 3. Post-implementation: this phase involves
1) actionable and timely insight extraction stage based on the nature of organization and the value that organization is seeking which decide whether the success and failure of big data project,
2) Evaluation stage evaluates a Big data project, it is stated that diverse data inputs, their quality, and expected results are required to consider.Based on this big data project life cycle, an organization can develop their own big data projects. The best way to implement big data projects is to use both technologies that are before and after big data. E.g. use both Hadoop and warehouse because they implement each other. US government considers “all contents as data” when implementing big data projects. In the digital era, data has the power to change the world and need careful implementation.
Big Data Core Techniques for Organizations
There are generally two types of processing in big data batch processing and real-time processing based on the domain nature of the organization. The fundamental of big data technology is based on MapReduce Model [5] by Google for processing batch workload of their user data. It is based on a scale-out model of the commodity servers. Later, real-time processing models such as twitter’s Storm, Yahoo’s S4, etc. become appear because of the near-real time, real time and stream processing requirements of organizations.
The core of MapReduce model is the power of “divide and conquer method” by distributing the jobs on the clusters of commodity servers with two steps (Map and Reduce) [5]. Jobs are divided and distributed over the clusters, and the completed jobs (intermediate results) from Map phases are sent to the reduce phase to perform required operations. In a way, In the MapReduce paradigm, the Map function performs filtering and sorting and Reduce function carries out grouping and aggregation operations. There are many implementations of MapReduce algorithm which are in open source or proprietary. Among the open source frameworks, the most prominent one is “Hadoop” with two main components―“MapReduce Engine” and “Hadoop Distributed File System (HDFS)”―In the HDFS cluster, files are broken into blocks that are stored in the DataNodes. NameNode maintains meta-data of these file blocks and keeps tracks of operations of Data Node [6]. MapReduce provides scalability by distributed execution and reliability by reassigning the failed jobs [7]. Other than MapReduce Engine and HDFS, Hadoop has a wide variety of ecosystem such as Hive for warehouses, Pig for the query, YARN for resource management, Sqoop for data transfer, Zookeeper for coordination, etc. and many others. Hadoop ecosystem will continue to grow as new big data systems appeared according to the need of the different organizations.
Organizations with interactive nature and high response time require real- time processing.
Although MapReduce is dominant batch processing model, real-time process- ing models are still competing with each other, each with their own competitive advantages.
“Storm” is a prominent big data technology for Real-time processing. The famous user of the storm is Twitter. Different from MapReduce, Storm use a topology which is a graph of spouts and bolts that are connected with stream grouping. Storm consume data streams which are unbounded sequences of tuples, splits the consumed streams, and processes these split data streams. The pro- cessed data stream is again consumed and this process is repeated until the operation is halted by the user. Spout performs as a source of streams in a topology, and Bolt consumes streams and produces new streams, as they execute in parallel [8].There are other real-time processing tools for Big Data such as Yahoo’s S4 (Simple Scalable Streaming System) which is based on the combination of actor models and MapReduce model. S4 works with Processing Elements (PEs) that consume the keyed data events. Messages are transmitted between PEs in the form of data events. Each PE’s state is inaccessible to other PEs and event emission and consumption is the only mode of interaction between PEs. Processing Nodes (PN) are the logical hosts of PEs which are responsible for listening to the events, executing operating on the incoming events, dispatching events with the assistance of the communication layer, and emitting output events [9]. There is no specific winner in stream processing models, and organizations can use appropriate data models that are consistent with their works.
Regardless of batch or real-time, there are many open source and proprietary software framework for big data. Open source big data framework are Hadoop, EPCC (High-Performance Computing Cluster), etc. [6]. Many other proprietary big data tools such as IBM BigInsight, Accumulo, Microsoft Azure, etc. has been successfully used in many business areas of different organizations. Now, big data tools and libraries are available in other languages such as Python, R, etc. for many different kinds of specific organizations.
About the Authors:
Pwint Phyu Khine, School of Information and Communication Engineering, University of Science and Technology Beijing (USTB), Beijing, China.
Wang Zhao Shun, Beijing Key Laboratory of Knowledge Engineering for Material Science, Beijing, China.
Publication Details:
Khine, P. and Shun, W. (2017) Big Data for Organizations: A Review. Journal of Computer and Communications, 5, 40-48. doi: 10.4236/jcc.2017.53005.
Copyright © 2017 Pwint Phyu Khine, Wang Zhao Shun et al. This article is an excerpt from an open access article distributed under the Creative Commons Attribution License.
References:
[1] Laudon, K.C. and Laudon, J.P. (2012) Management Information Systems: Managing the Digital Firm. 13th Edition, Pearson Education, US.
[2] Manyika, J., et al. (2011) Big Data: The Next Frontier for Innovation, Competition, and Productivity. San Francisco, McKinsey Global Institute, CA, USA.
[3] Hu, H., Wen, Y.G., Chua, T.-S. and Li, X.L. (2014) Toward Scalable Systems for Big Data Analytics: A Technology Tutorial. IEEE Access, 2, 652-687. https://doi.org/10.1109/ACCESS.2014.2332453
[4] Mousanif, H., Sabah, H., Douiji, Y. and Sayad, Y.O. (2014) From Big Data to Big Projects: A Step-by-Step Roadmap. International Conference on Future Internet of Things and Cloud, 373-378
[5] Dean, J. and Ghemawat, S. (2008) MapReduce: Simplified Data Processing on Large Clusters. Commun ACM, 107-113. https://doi.org/10.1145/1327452.1327492
[6] Sagiroglu, S. and Sinanc, D. (2013) Big Data: A Review. International Conference on Collaboration Technologies and Systems (CTS), 42-47.
[7] Grolinger, K., Hayes, M., Higashino, W.A., L’Heureux, A., Allison, D.S. and Capretz1, M.A.M. (2014) Challenges of MapReduce in Big Data, IEEE 10th World Congress on Services, 182-189. [Citation Time(s):2]
[8] Storm Project. http://storm.apache.org/releases/2.0.0-SNAPSHOT/Concepts.html
[9] Neumeyer, L., Robbins, B., Nair, A. and Kesari, A. (2010) S4: Distributed Stream Computing Platform. 2010 IEEE International Conference on Data Mining Workshops (ICDMW). https://doi.org/10.1109/ICDMW.2010.172