By+Wu+Jiang
The last week of September, 2014 saw the official listing of Alibaba( A Chinas giant company) in the New York Stock Exchange(NYSE:BABA), which is the first Initial public offerings( IPO) and the largest scale in history, also marks the Interent evolves into a new era---a big data era that belongs to Chinese domestic internet enterprises.
The past and present big data
Big data or mass data refers to the data size is so large that it can not be extracted, managed, handled and processed as the information that can be interpreted by human beings with a proper range of time. Under the same condition, compared with those independent small-scale dataset which could analyzed data individually, more additional information and relational data base will be obtained if the analysis is based on the grouping of each small data. Such approach can be applied to forecast the commercial trend, judge the quality of research, avoid the widespread of disease, fight against crimes or predict real-time traffic and others.
Though far away from our daily life, big data has close ties with our daily life in deed. For example, Douban Music( a name of a Chinese social network) can infer which song is most liked by a certain user after its analysis of behaviour of user population, even users favorite movie can also be inffered. Through confluence analysis of sales data of its retail stores, Adidas can exactly know the consumers preference over their products in different regional culture so as to make a more resonable strategy of inventory stocking up in a smarter way. A love and marrige website in China is trying to introduce a system that can identify facial resemblance, the company is able to conclude which facial form is most enjoyed by its users on the basis of used information, then they can provide such popular service among its users. Taobao(the biggest C2C shopping website in Chinas mainland) can predict the possible goods that each consumer is interested in, thereout, individualized recommendation targeted to each user can be produced, this is what most people often see in the side bar of it commodity recommendation. Through the analysis of the information of classified commodities by large database model, Taobao is able to answer some interesting questions which are hard to most people, such as what is the favourite color of the T-shirt for the age group of 18 , or what is the difference between the people living in South and North China when it comes to preference of sports beverage?
The simple analysis of user behaviour will not produce too much value, while if the analysis is based on a quite large scale, then we can obtain valuable prediction from its performing trend, the decision-making in business in particular. In the past, take the well-known NongFu Spring (A Chinese enterprise of drinking water production) for example, if the company wants to get such market data to help them to make decisions as how to pile up can promote its sales? The people of which age group can spend most time in front of the pile? What is their purchasing volume each time? What changes of purchasing behaviour might take place for the change of temperature? How its competitors new packing influence its own sales? Though seem easy, these questions are hard to get convincing answers.
To answer the above questions, a lot of data needs to be collected. The salesmen from NongFu Spring have to come to local supermarkets to take ten pictures every day: the piling of the bottles, the change of their location, the height of the bottle piling and so on. Every day they have to cover 15 places for investigation and survey, and upload 150 pictures, producing data size about 10M which is not a large figure. While there are 10,000 salesmen across China, that means the data size is 100G, 3TB each month. Though these data seem easy, but without the support of relevant technology concerning about big data, such analysis could not be obtained.
There is one in Google had pointed out:” what really matters is not what we can do, but what is the right size can we do.”
It only needs several pieces of paper and a pen if you can just analyze 100 lines of data every day. But if you want to analyze 100,000 lines of data, according to the processing capacity of modern computer, you just need a computer and design programme. But if the data size has reached 1000000000 lines(1TB), even a powerful server station will satisfy your need, especially when you want a real-time or close to real-time processing speed. Thus, the field of computer and numerical calculation witnesses the occurrence of a trend—distributed computing which is a science requires a system by the connecting of cluster of computers through network and then engineering data that needs massive calculation will be divided into small computing areas, then the data will be processed by each computer of the network, after uploading the calculating results which will be combined to arrive at a final data conclusion. But in order to make full use of distributed computing, we have to solve such problems as how to divide the data? How can we achieve a balanced processing of the operating load of each computer? How to combine each result into a final data efficiently? Many computing models and concepts have been designed for the purpose of solving these problems from the hardware and software of computers. Some of the most representative are cloud computing, MapReduce (Handoop) , virtualization and others. While this might only be the beginning of the computing tide. Just like Jack Ma had said:” we are moving from an era of information science and technology to an era of data science and technology.”
Mass data and
the new occupations
of the Internet
To do well in mass data, the first thing of vital importance is to get massive valuable data, which is an advantage that most native Chinese Internet enterprises have. China has a large population, dynamic economy, millions of internet users, the abundance of users behavor data is directly decided by the abundance of user data resources. Taobao has 300 million registered users and Tencents registered users has already exceeded 1 billion. All the user data is absolutely a goldmine.
A new generation technology is bound to bring up full demand of technicians of a new generation. In an era of big data, data scientist and data engineer have been one of the hottest occupation in Silicon Valley. Comparing to the traditional software engineer, data scientist is a group of researchers who stand between mathematics(statitics) and computer science, their job includes both software design and development and data modelling and statistic analysis, meantime, they are able to turn data processing model into feasible software solutions. So the native Chinese internet enterprises also attach great importance to the reservation of talents in the field of data science, in the foreseeable future, practitioners of data science must be very popular in the job market.