So, let us first understand what actually Big Data is.
Big data is something which includes huge amount of data and that to be real time data. Real time data is hard to maintained and examined because in every minute lots of new data will be generated. Thatswhy we have some problems with big data:-
- VOLUME : So, here the volume refers to the large volume of data which can’t be resists in only one storage device.
- VELOCITY: So, velocity is the speed of continuous flow of data which is ultimately very low if we have a large volume of data, so we have to find a way so that data flow will be in high speed.
- COST: Also, if we have large amount of data then, we have to buy storage devices of very high storage capacity that can store this huge data, but that would be very costly and that is the major challenge of this big data.
So, There is one way to overcome above three problems and that is distributed storage. So, what it means is that we have multiple storage devices plugged together and we will split our data and store in these storage devices accordingly.
There are many tools which provides distributed storage clusters as a service like we have hadoop, ceph, glusterfs, AWS-S3. There are many known big empires who are using this distributed storage with the help of hadoop like we have :-
- Facebook: It’s the world most popular social networking site which is using hadoop for big data analytics.It has two main clusters. One is 1100 node cluster having 8800 cores and 12 Petabyte storage. Another one is 300 node cluster having 2400 cores and 3 Petabyte storage.
- LinkedIn: It is the fast growing proffesional networking site which helps us to connect proffesionals and to showcase our skills out there.It uses the following hardware’s:
- 800 Westmere-based HP SL 170x having 24GB RAM and six 2 TB hard disks.
- 1900 Westmere-based SuperMicro X8DTT-H having 24GB RAM and six 2TB harddisks.
- 1400 Sandy Bridge-based SuperMicro having 32GB RAM and six 2TB harddisks.
3. Spotify: Spotify is a company that provides music and video streaming services. It uses Apache Hadoop for reporting, analysis, generating content and music recommendations, and data aggregation.It has 1650 node cluster with 43,000 cores, 70 TB RAM and 65 PB storage.
4. Twitter: Twitter is one of the most popular social network company. It provides online news and social networking services in the form of tweets(twitter messages).It is among the top-rated companies using Apache Hadoop for storing and processing tweets and several other data.
5. Yahoo: Yahoo! is an internet services organization. It gives web services through its web-based interface,Yahoo! search engine, Yahoo! Mail, Yahoo! Directory etc.It is one of the top organizations utilizing Apache Hadoop and Pig on in excess of 40,000 PCs for Ad frameworks, web search and scaling tests.It has the world’s greatest clusters having 4500 nodes, 16GB RAM and four 1TB storage.
Like this we have several number of big companies using these tools for big data analytics.