Big data is the large volume of data – both structured and unstructured. Big data is being generated by everything around us at all times. Every digital process and social media exchange produces it. Systems, sensors and mobile devices transmit it. Big data is arriving from multiple sources at an alarming velocity, volume and variety.
There are 3 V’s to define the big data.
Volume Currently we can see that data is growing at an increasing rate. Data is more than just the text files. We can find data in different formats. It is very common nowadays for organizations to have terabytes and petabytes of data. Big data needs to process high volume of low density and unstructured data- that is, data with unknown value, like twitter feeds, loads of facebook posts, web page streams, mobile apps, network traffic, sensors etc.
Velocity- data growth and activities on social media sites. There was a time when we used to believe that data from yesterday is recent like newspapers. However news channels, radio have changed the way we receive the news. Now we believe in social media. Even sometimes a few minutes old message doesn’t interests users. Many internet-enabled applications operate in real time or near real time. Example, e-Commerce applications for consumer takes the mobile device location and personal preferences to make time-sensitive marketing offers. We can see the news live on our mobile device in real time.
Variety – Now we are not getting the data just in text files. It’s in the form of video, SMS, Social media, audio, web, GPS, Sensors and many more. We can’t really control the structure of data now and it’s coming in different ways like structured, non-structured and both. Once understood, unstructured data same requirements like structured data such as summarization, lineage, audibility and privacy. We can see more complexity arise when data changes without notice when getting from known source.
There are many applications and software are introduced to deal with the big data. Hadoop is one of them which works great.
Hadoop is a java-based programming framework that supports the processing of large data sets in a distributed computing environment. It enables distributed processing of large structured, semi-structured, and unstructured data sets across clusters on servers. Hadoop is designed to scale up from a single server to thousands of machines, which has a very high degree of fault tolerance. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Hadoop HDFS is a primary storage system used by hadoop applications. A file needs to be loaded to HDFS to be processed by hadoop. After loading it is divided into chunks called blocks. Each block has a size of 64MB by default. Which is then replicated 2 more times in case of failure.
MapReduce is a software framework for easily writing applications which process big amounts of data in a parallel way on large clusters in reliable and fault tolerant manner. Instead of processing a 1 big file, input data is divided in small files and then processed simultaneously. The framework look after about scheduling tasks, monitoring them and re-executes the failed tasks.
- Hive – turns SQL into MapReduce Code
- Pig – an alternative for writing code in a simple scripting language that is then turned into actual MapReduce and runs on cluster
- Impala – interactive SQL Queries accessing data directly on HDFS and is very fast
- Hbase is a real-time database based on Hadoop
- Sqoop: takes data from relational databases such as SQL server and puts it in HDFS as delimited files to be processed on the cluster
- Flume: stream data from multiple sources in Hadoop for analysis
- Hive: graphical front-end
- Mahout: machine learning library
- Oozie is a Java Web application used to schedule Hadoop jobs