The promise and the pitfalls of harnessing big data

Among the useful possibilities lurk some spurious correlations

Maths, stats and econometrics are suddenly sexy subjects – and well paid. Careercast says the best job of 2016 is a data scientist, paying $170,000 a year. It combines information technology, statistical analysis and interdisciplinary skills to interpret trends from data.

Google has shown there's money in mining big data. It became the most valuable company in the world by facilitating a billion data searches a day – and sticking a personalised ad on the results.

And at UNSW, the schools of mathematics and statistics, computer science and engineering, and UNSW Business School, will join forces to offer an interdisciplinary degree – a bachelor of data science and decisions – in 2017. It's an expanding field.

"The amount of created data is growing exponentially and is already exceeding the amount of available storage, which creates challenges and also great opportunities," says Valentyn Panchenko, an associate professor of economics at UNSW Business School, and convenor of next month's UNSW Business School Roundtable: Big Data in Business and Research.

"Biggish data has been around for a long time, such as retail scanning data, loyalty programs, bank, tax and health records, but what is different now is the proliferation from new data sources: internet searches, Facebook, Twitter, wearable measurement apps and phone location data. Linking these various data-sources becomes possible and opens up new opportunities," Panchenko says.

Information explosion

An awareness of big data dates back to the 1940s when the term 'information explosion' first appeared, but it has been in this century, with the rapid advances in computing power and digital technologies, that it has come into the business mainstream.

There was early optimism that simply matching up the sheer volume of data would throw up explanations in a new era of understanding the world.

"Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all," claimed Wired magazine's Chris Anderson in 2008.

Another notable point was The New York Times report in 2012 of how Target department store decided a teenager was pregnant – before she had told her family. Target sent her maternity catalogues, based on an analysis of her shopping (and set against purchasing trends derived from big data) which caused her father to complain to Target.

He later famously apologised: "It turns out there's been some activities in my house I haven't been completely aware of. She's due in August."

‘Companies can do market segment studies to a fine level, and big data is fantastic for that’

- DENZIL FIEBIG

It showed how mathematicians, statisticians, predictive analysts, economists, psychologists and neurologists were mining big data on shopping patterns – and exploiting insights such as that habits, rather than conscious decision-making, shape 45% of the choices we make daily.

Google told Nature magazine in 2008 it could predict 'flu trends by processing hundreds of billions of web searches about 'flu across five years. Initially successful, in 2013 Google's predictions were wildly wrong – double the real rate as reported by doctors.

There followed a backlash and warnings about the traps in big data analysis. The Economist labelled this the classic hype cycle, in which a technology's early proponents make grandiose claims but under-deliver, yet the technology eventually transforms the world, as did the internet.

Spurious correlations

The technology is maturing but Panchenko notes there are big data issues with storage/access, privacy, analysis and interpretation.

"The key methodological challenge is how to combine traditional fundamental models, which reflect causal relationships and typically use well-structured data, with methods currently used for big data, such as machine-learning methods, which are great for finding correlation patterns," Panchenko says.

Economists and statisticians know big numbers can throw up some spurious correlations, as a finding posted on The Economist website hilariously illustrates: There is a 95% correlation between per capita cheese consumption and people being strangled in their bedsheets. Yet causal relations, where a change in one variable causes a change in the other, cannot be found in big data without an explanation that can be tested.

Management consultants are busy using big data to drum up business. A recent PwC report claimed one-third of Australian businesses are embracing data – compared with one-half internationally. 

The Australian Bureau of Statistics is trying to make more data available for researchers and government departments are interested in using big data to improve efficiencies, as envisaged by the 2013 Australian Public Service Big Data Strategy.

According to Stuart Black, management consultant Deloitte's expert on data analytics, it is still early days.

"My role is to work with 600 Deloitte partners to put data into their offerings to improve company efficiency, such as with the Australian Food and Grocery Council's index we developed based on the demand for Chep pallets that predicts retail trade trends three months in advance. In the lead up to Christmas you can see what will be on short supply because of what has been shipped around Australia on Chep pallets," Black says.

Predictive problems

The theoretical problems around using past big data to predict the future concern econometrician Denzil Fiebig, an economics professor at UNSW Business School, who will be speaking at the big data roundtable.

"Companies can do market segment studies to a fine level, and big data is fantastic for that," Fiebig says.

"Yet data mining and machine algorithms learning about interesting patterns in data that happened in the past can strike problems answering 'what if' questions. What if I change some part of the business, such as the way I deal with my customer base, prices, incentives or the way patients deal with the health system, such as by charging them?

"It is much more difficult to infer what will happen then. With machine learning, answers do not fall straight out of the data. Just because you have more data, it does not mean it is better or more suited to answering 'what if' questions."

The US National Research Council agrees, saying that rigorous quantitative and statistical methods are required to meet the requirements of big data.

‘Just because you have more data, it does not mean it is better or more suited to answering ‘what if’ questions’

DENZIL FIEBIG

Big data and health

There is a need for standardised measures across data, data integration and reproducible big data research, according to Harvard professor John Quackenbush, speaking at the opening of the UNSW Centre for Big Data Research in Health (CBDRH).

"Everyone talks about evidence-based medicine but most medicine is anecdotal," Quackenbush says. 

"Non-responsive drugs to cancer are prevalent and expensive. The solution is to identify biomarkers that we can associate with response to therapy so as to get the right patient to the right treatment at the right time."

Quackenbush says big data analysis can now compare a healthy state to a disease state – using the model of a complex network that changes. It can involve huge numbers of gene and biological variations: a study of 250,000 individuals discovered 700 genetic variants that only explained one-fifth of the inheritability of height.

He sees these big data sets as ripe for integration, from genetics to GPS to general hospitals and X-ray-like images. Falling costs, of gene analysis for example, and rising computer power make collecting, storing, analysing and using big data more feasible.

The Scottish health system is pioneering such a program to collect and integrate big health data to devise cheap and effective treatments. 

"If my relative had bowel cancer I would want to know what sort, and whether it responds to therapies," says Andrew Morris, a professor at Edinburgh University, speaking at the  CBDRH. 

Big data research is looking at 4700 tumours from 21 types of cancer. There appear to be hundreds if not thousands of types of bowel cancer and Morris foresees cooperation to understand the complexities of disease – an international coalition to share genomic and clinical data to devise effective treatments. 

The UNSW Business School Roundtable: Big Data in Business and Research is supported by BusinessThink.

Republish

You are free to republish this article both online and in print. We ask that you follow some simple guidelines.

Please do not edit the piece, ensure that you attribute the author, their institute, and mention that the article was originally published on Business Think.

By copying the HTML below, you will be adhering to all our guidelines.

Press Ctrl-C to copy