There may be huge potential benefits waiting in the data in your servers. These data may be used for many different purposes. Better data allow better decisions, of course. For instance, banks, insurance firms, and telecom companies already own a large amount of data about their customers. These resources are useful for building a more personal relationship with each customer.
ADVERTISEMENT |
Some organizations already use data from agricultural fields to build complex and customized models based on an extensive number of input variables (e.g., soil characteristics, weather, and plant types) in order to improve crop yields. Airline companies and large hotel chains use dynamic pricing models to improve their yield management. Data are increasingly being referred as the new “gold mine” of the 21st century.
A few factors underlie the rising prominence of data (and, therefore, data analysis).
Huge volumes of data
Data acquisition has never been easier, thanks to sensors in manufacturing plants and connected objects, as well as data from internet use and web clicks, from credit cards, fidelity cards, and customer relations management (CRM) databases, and from satellite images, to name just a few examples. Data can easily be stored at costs that are lower than ever before; huge storage capacity is now available on the cloud and elsewhere. The amount of data that is being collected is not only huge, it is growing very fast—exponentially, in fact.
Unprecedented velocity
Connected devices, like our smartphones, provide data in almost real time, and these data can be processed quickly. It’s now possible to react to any change, almost immediately.
Incredible variety
The data collected aren’t restricted to billing information; every source of data is potentially valuable for a business. Not only is numeric data getting collected in a massive way, but also unstructured data such as videos and pictures in a variety of situations.
But the explosion of data available to us is prompting every business to wrestle with an extremely complicated problem:
How can we create value from these resources?
Simple methods, such as counting words used in queries submitted to company websites, do provide insight as to the general mood and trend of your customers. Simple statistical correlations are often used by web vendors to suggest a purchase just after buying a product on the web. Very simple descriptive statistics are also useful.
Imagine what could be achieved from advanced regression models or powerful statistical multivariate techniques, which can be applied easily with statistical software packages like Minitab.
Let’s consider an example of how one company benefited from analyzing a very large database.
In the airline industry, many steps (such as security and safety checks, and cleaning the cabin) are needed before a plane can depart. Because delays negatively affect customer perceptions and productivity, airline companies routinely collect a very large amount of data related to flight delays and times required to perform tasks before departure. Sometimes this information is automatically collected, and sometimes it’s manually recorded.
A major airline company intended to use these data to identify the crucial milestones among a very large numbers of preparation steps, and to determine which ones often triggered delays in departure times. The company used Minitab’s stepwise regression analysis to quickly focus on the few variables that played a major role among a large number of potential inputs. Many variables turned out to be statistically significant, but two among them clearly seemed to make a major contribution (see X6 and X10 below).
When huge databases are used, statistical analyses can become overly sensitive and detect even very small differences due to the large sample and power of the analysis. P values often tend to be quite small (p < 0.05) for a large number of predictors.
However, in Minitab, if you click on Results in the regression dialogue box and select Expanded tables, contributions from each variable will be displayed. Whhen considered together, X6 and X10 were contributing to more than 80 percent of the overall variability (with the largest F values by far). The contributions from the remaining factors were much smaller. The airline then ran a residual analysis to cross-validate the final model.
In addition, a a multivariate technique called principal component analysis (PCA) was performed in Minitab to describe the relations between the most important predictors and the response. Milestones were expected to be strongly correlated to the subsequent steps.
The graph above is a loading plot from a principal component analysis. Lines that go in the same direction and are close to one another indicate how the variables may be grouped. Variables are visually grouped together according to their statistical correlations and how closely they are related.
A group of nine variables turned out to be strongly correlated to the most important inputs (X6 and X10) and to the final delay times (Y). Delays at the X6 stage obviously affected the X7 and X8 stages (subsequent operations), and delays from X10 affected the subsequent X11 and X12 operations.
Conclusion
This analysis provided simple rules that this airline’s crews could follow in order to avoid delays, making passengers’ next flight more pleasant. The airline can repeat this analysis periodically to search for the next most important causes of delays. Such an approach can propel innovation and help organizations replace traditional and intuitive decision-making methods with data-driven ones.
What’s more, using data to improve operations isn’t restricted to the corporate world. Increasingly, public administrations and nongovernmental organizations are making large, open databases easily accessible to communities and to virtually anyone.
Add new comment