We all thought having more data was better. We were wrong. – Recode

Interesting set of arguments against the use of big data in all circumstances and the value of small, focussed data sets:

For years, the mantra in the world of business software and enterprise IT has been “data is the new gold.” The idea was that companies of nearly every shape and size, across every industry imaginable, were essentially sitting on top of buried treasure that was just waiting to be tapped into. All they needed to do was to dig into the correct vein of their business data trove and they would be able to unleash valuable insights that could unlock hidden business opportunities, new sources of revenue, better efficiencies and much more.

Big software companies like IBM, Oracle, SAP and many more all touted these visions of data grandeur, and turned the concept of big data analytics, or just Big Data, into everyday business nomenclature.

Even now, analytics is also playing an important role in the Internet of Things, on both the commercial and industrial side, as well as on the consumer side. On the industrial side, companies are working to mine various datastreams for insights into how to improve their processes, while consumer-focused analytics show up in things like health and fitness data linked to wearables, and will soon be a part of assisted and autonomous driving systems in our cars.

Of course, the everyday reality of these grand ideas hasn’t always lived up to the hype. While there certainly have been many great success stories of companies reducing their costs or figuring out new business models, there are probably an equal (though unreported) number of companies that tried to find the gold in their data — and spent a lot of money doing so — but came up relatively empty.

The truth is, analytics is hard, and there’s no guarantee that analyzing huge chunks of data is going to translate into meaningful insights. Challenges may arise from applying the wrong tools to a given job, not analyzing the right data, or not even really knowing exactly what to look for in the first place. Regardless, it’s becoming clear to many organizations that a decade or more into the “big data” revolution, not everyone is hitting it rich.

Part of the problem is that some of the efforts are simply too big — at several different levels. Sometimes the goals are too grandiose, sometimes the datasets are too large, and sometimes the valuable insights are buried beneath a mound of numbers or other data that just really isn’t that useful. Implicit in the phrase “big data,” as well as the concept of data as gold, is that more is better. But in the case of analytics, a legitimate question worth considering: Is more data really better?

In the world of IoT, for example, many organizations are realizing that doing what I call “little data analytics” is actually much more useful. Instead of trying to mine through large datasets, these organizations are focusing their efforts on a simple stream of sensor-based data or other straightforward data collection work. For the untold number of situations across a range of industries where these kinds of efforts haven’t been done before, the results can be surprisingly useful. In some instances, these projects create nothing more than a single insight into a given process for which companies can quickly adjust — a “one and done” type of effort — but ongoing monitoring of these processes can ensure that the adjustments continue to run efficiently.

Of course, it’s easy to understand why nobody really wants to talk about little data. It’s not exactly a sexy, attention-grabbing topic, and working with it requires much less sophisticated tools — think Excel spreadsheet (or the equivalent) on a PC, for example. The analytical insights from these “little data” efforts are also likely to be relatively simple. However, that doesn’t mean they are less practical and valuable to an organization. In fact, building up a collection of these little data analytics could prove to be exactly what many organizations need. Plus, they’re the kind of results that can help justify the expenses necessary for companies to start investing in IoT efforts.

To be fair, not all applications are really suited for little data analytics. Monitoring the real-time performance of a jet engine or even a moving car involves a staggering amount of data that’s going to continue to require the most advanced computing and big data analytics tools available.

But to get more real-world traction for IoT-based efforts, companies may want to change their approach to data analytics efforts and start thinking small.

Source: We all thought having more data was better. We were wrong. – Recode

Advertisements

How the parties collect your personal info — and why Trudeau doesn’t seem to mind: Delacourt

Great piece by Delacourt:

Numbers are definitely in fashion in the new Liberal government at the moment — and not just because the budget is landing next week.

A first-ever session on “behavioural economics” for public servants was filled to capacity last week, according to a Hill Times report. “Combining economics with behavioural psychology,” said PCO spokesperson Raymond Rivet, “this new tool can help governments make services more client-focused, increase uptake of programs, and improve regulatory compliance.

Better government through behavioural economics — the idea was popularized by the 2009 book Nudge and almost immediately adopted through the establishment of a “nudge unit” by the British government in 2010. Justin Trudeau’s government is already borrowing the concept of “deliverology” from the Brits, so the ‘nudge’ was never going to be far behind. President Barack Obama, Trudeau’s new best friend, also has taken steps to introduce nudge theory to the U.S. government in recent years.

But the real motivation for data-based governance in the Trudeau government may have come from a source much closer to home — the recent election, specifically the Liberals’ extensive use of big data to win 184 seats last fall. Make no mistake: Trudeau’s Liberals may have won the election by promising intangibles like ‘hope’ and ‘change’, but they sealed the deal with a sophisticated data campaign and ground war.

So now that the Liberals have seen how mastery of the numbers can help win elections, we probably shouldn’t be too surprised that they see those same skills as useful for governing as well. Big-data politics is here to stay.

What’s missing from that equation, however — at least on the political side — is privacy protection. Late last week, while everyone’s attention was fixated on Washington, federal Privacy Commissioner Daniel Therrien reminded a Commons committee that all the political parties are amassing data on voters without any laws to guard citizens’ privacy.

“While the Privacy Act is probably not the best instrument to do this, Parliament should also consider regulating the collection, use and disclosure of personal information by political parties,” Therrien told the Commons committee on access to information and privacy.
A little more than a year ago, it seemed that a new Liberal government could be expected to agree with the privacy commissioner.
Recall last year’s conference on “digital governance” in Ottawa; on stage for one panel discussion were key strategists for the three main parties — Tim Powers for the Conservatives, Brad Lavigne for the New Democrats and Gerald Butts for the Liberals. Mr. Butts is, of course, now Trudeau’s principal secretary.

Fielding questions from the audience, the three were asked whether political databases should be subject to Canadian privacy laws. Powers and Lavigne demurred; only Butts seemed to be saying ‘yes’.

Here’s his lengthy quote, which appeared a few weeks later in an iPolitics column by Chris Waddell:

“Let’s not kid ourselves, political parties are public institutions of a sort. They are granted within national or sub-national legislation special status on a whole variety of fronts, whether they be the charitable deduction, the exemption from access to information — all those sorts of things,” Butts told the conference.

“We have created a whole body of law … or maybe we haven’t. Maybe we have just created a hole in our two bodies of law that allow political parties to exist out there in the ether. I think that is increasingly a problem and it is difficult for me to envision a future where it exists for much longer.”

That was a year ago. And unless I missed it, there’s nothing in any of Trudeau’s mandate letters to ministers about new privacy laws for political parties. And without giving away too much about the new chapters of my soon-to-be-re-released book on political marketing, I didn’t get the impression during our recent interview that Prime Minister Trudeau was greatly troubled by the collision between privacy protection and political databases.

It seems odd to me that citizens can get (often appropriately) worked up about “intrusive” government measures, whether it’s the census or the C-51 anti-terrorism law, and yet be mostly indifferent to what the chief electoral officer has called the “Wild West” of political data collection.

Even Conservatives who resented the gun registry didn’t seem to mind that their own party was keeping track of gun owners in its database, so that it could send them specially targeted fundraising messages from time to time. That’s just behavioural economics, applied to the political arena.

So far, British Columbia is the only province to take steps to put political databases in line with privacy protection. The provincial chief of elections in B.C., Keith Archer, notified political parties that they would not get access to the voters’ list — the raw material of any political database — if they failed to comply with privacy laws.

That step could — and should — be implemented in Ottawa, too. We’re in the era of big-data politics and behavioural-insight governance, and Canadians are entitled to some accountability about the data the governing party is collecting and using on them.

Not so long ago, one of Trudeau’s most senior advisers agreed with that idea. Maybe all it takes is a little nudge.

Source: How the parties collect your personal info — and why Trudeau doesn’t seem to mind – iPolitics

Social Assistance Receipt Among Refugee Claimants in Canada: Evidence from Linked Administrative Data Files

A good illustration of the benefits to evidence-based policy making by linking administrative and economic data. Bit dry analysis but essentially shows that number accessing declines with time but remains about Canadian average:

Focusing on the middle estimate [which excluded non-linked files], the receipt of SA in year t+1 among the 2005-to-2010 claimant cohorts generally ranged between 80% and 90% across family types, with rates highest among lone mothers and couples with more than two children. Similarly, the incidence of SA receipt generally ranged from about 80% to 90% across families in which the oldest member was between 19 to 24 and 55 to 64 years of age. Across provinces, the incidence of SA receipt in year t+1 was generally highest in Quebec, at over 85%, and lowest in Alberta, at under 60%.

SA receipt varied considerably across country of citizenship. Refugee claimants from countries such as Afghanistan, Colombia, the Democratic Republic of Congo, Eritrea, and Somalia all had relatively high SArates (close to or above 90%) throughout most of the study period, while  rates were lower among refugee claimants from Bangladesh, Haiti, India, and Jamaica (generally below 80%).

The rates of SA receipt tended to decline sharply in the years following the start of the refugee claim. Between years t+1 and t+2, rates fell by about 20 percentage points among most claimant cohorts, declining a further 15 percentage points between t+2 and t+3, and 10 percentage points between t+3 and t+4. By t+4, between 25% and 40% of refugee claimants received SA. However, it is important to recall that these figures pertain to the diminishing group of refugee claimants whose claims remained open up to that year. These figures are also well above the Canadian average of about 8%.

Among refugee claimant families that received SA in year t+1, the average total family income typically ranged from about $19,000 to $22,000, with SA benefits accounting for $8,000 to $11,000—or about 40% to 48%—of that total.

In aggregate terms, SA income paid to all recipients in Canada totaled $10 billion to $13 billion in most years. Given their relatively small size as a group, the dollar amount of SA paid to refugee claimant families amounted to between 1.9% and 4.4% of that total, depending on the year and on the treatment of unlinked cases.

Source: Social Assistance Receipt Among Refugee Claimants in Canada: Evidence from Linked Administrative Data Files

Turning to Big, Big Data to See What Ails the World

Good examples of how big data can help identity the more important issues and the consequent shift in focus from death to disability:

The disconnect between what we think causes the most suffering and what actually does persists today. It is partly a function of success. Diarrhea, pneumonia and childbirth deaths have greatly declined, and deaths from malaria and AIDS have fallen, although far less dramatically. (The charts here show the stunning improvement in health around the world. And here are similar charts tracking progress in hunger, poverty and violence — a big picture that’s an important counterpoint to the constant barrage of negative world news.) This success is partly due to changes made because of the first Global Burden reports.

The downside is that longer lives mean people are living long enough to develop diabetes and Alzheimer’s.   “What decline we’re seeing from communicable diseases, we’re seeing a compensatory increase from diabetes,” Murray said.   And neurological diseases such as Alzheimer’s now account for twice as many years lived with disability as cardiovascular and circulatory diseases together, Smith writes.

This is not simply because people are living longer. It’s also a function of worsening diet everywhere, as poor societies adopt the processed foods found in rich ones.

The most surprising information, though, came not in measuring deaths, but disability. “Major depression caused more total health loss in 2010 than tuberculosis,” Smith writes. Neck pain caused more health loss than any kind of cancer, and osteoarthritis caused more than natural disasters. For other findings that may surprise you, see the quiz.

The report is a giant compilation of “who knew?”

Based on this information, countries and international organizations have been able to change how they spend their health resources, and some ambitious countries have done their own national Burden of Disease studies.

Iran, writes Smith, found that traffic injury was its leading preventable cause of health loss in 2003, and put money into building new roads and retraining police. It also targeted two other big problems its study found: suicide and heart disease.

Australia, responding to the high impact of depression, began offering cost-free short-term depression therapy .

Mexico was one of the countries making the most use of Global Burden of Disease data, after Julio Frenk became health minister in 2000.   Frenk had been Murray’s boss at the W.H.O., and a participant in Murray’s work. He found that Mexico’s health system was targeting the communicable diseases that predominated in 1950, not what currently ailed Mexicans. In response, Frenk established universal health insurance (before that, 50 million were uninsured) and set coverage according to the burden of disease.

The program covered emergency care for car accidents, treatment of mental illness, cataracts, and breast and cervical cancer — all of which had been uncovered, even for people with insurance. “You want to cover those interactions that give you the highest gain,” ]he said.

Murray and company have now branched out beyond diagnosis to measuring treatment: How many people really have access to programs like anti-malaria bed nets or contraception? How much is being spent and what does it buy? Where are the most useful points of intervention?   Meanwhile, data from the Global Burden reports  is seeping further into health policy decisions around the world — data that saves suffering and money and lives.

via Turning to Big, Big Data to See What Ails the World – NYTimes.com.

Research based on social media data can contain hidden biases that ‘misrepresent real world,’ critics say

Good article on some of the limits in using social media for research, as compared to IRL (In Real Life):

One is ensuring a representative sample, a problem that is sometimes, but not always, solved by ever greater numbers. Another is that few studies try to “disentangle the human from the platform,” to distinguish the user’s motives from what the media are enabling and encouraging him to do.

Another is that data can be distorted by processes not designed primarily for research. Google, for example, stores only the search terms used after auto-completion, not the text the user actually typed. Another is simply that many social media are largely populated by non-human robots, which mimic the behaviour of real people.

Even the cultural preference in academia for “positive results” can conceal the prevalence of null findings, the authors write.

“The biases and issues highlighted above will not affect all research in the same way,” the authors write. “[But] they share in common the need for increased awareness of what is actually being analyzed when working with social media data.”

Research based on social media data can contain hidden biases that ‘misrepresent real world,’ critics say

9 Ugly Lessons About Sex From Big Data | TIME

Interesting example of big data and some reminders that we are not yet living in a post-racial society:

5. According to Rudder’s research, Asian men are the least desirable racial group to women…On OkCupid, users can rate each other on a 1 to 5 scale. While Asian women are more likely to give Asian men higher ratings, women of other races—black, Latina, white—give Asian men a rating between 1 and 2 stars less than what they usually rate men. Black and Latin men face similar discrimination from women of different respective races, while white men’s ratings remain mostly high among women of all races.

6. …And black women are the least desirable racial group to men.Pretty much the same story. Asian, Latin and white men tend to give black women 1 to 1.5 stars less, while black men’s ratings of black women are more consistent with their ratings of all races of women. But women who are Asian and Latina receive higher ratings from all men—in some cases, even more so than white women.

8. Your Facebook Likes reveal can reveal your gender, race, sexuality and political views.A group of UK researchers found that based on someone’s Facebook Likes alone, they can tell if a user is gay or straight with 88% accuracy; lesbian or straight, 75%; white or black, 95%; man or woman, 93%; Democrat or Republican, 85%.

9 Ugly Lessons About Sex From Big Data | TIME.

Professor goes to big data to figure out if Apple slows down old iPhones when new ones come out

Apple Slow iphones

A good illustration of the limits of big data and the risks of confusing correlation with causation. But bid data and correlation can help us ask more informed questions:

The important distinction is of intent. In the benign explanation, a slowdown of old phones is not a specific goal, but merely a side effect of optimizing the operating system for newer hardware. Data on search frequency would not allow us to infer intent. No matter how suggestive, this data alone doesn’t allow you to determine conclusively whether my phone is actually slower and, if so, why.

In this way, the whole exercise perfectly encapsulates the advantages and limitations of “big data.” First, 20 years ago, determining whether many people experienced a slowdown would have required an expensive survey to sample just a few hundred consumers. Now, data from Google Trends, if used correctly, allows us to see what hundreds of millions of users are searching for, and, in theory, what they are feeling or thinking. Twitter, Instagram and Facebook all create what is evocatively called the “digital exhaust,” allowing us to uncover macro patterns like this one.

Second, these new kinds of data create an intimacy between the individual and the collective. Even for our most idiosyncratic feelings, such data can help us see that we aren’t alone. In minutes, I could see that many shared my frustration. Even if you’ve never gathered the data yourself, you’ve probably sensed something similar when Google’s autocomplete feature automatically suggests the next few words you are going to type: “Oh, lots of people want to know that, too?”

Finally, we see a big limitation: This data reveals only correlations, not conclusions. We are left with at least two different interpretations of the sudden spike in “iPhone slow” queries, one conspiratorial and one benign. It is tempting to say, “See, this is why big data is useless.” But that is too trite. Correlations are what motivate us to look further. If all that big data does – and it surely does more – is to point out interesting correlations whose fundamental reasons we unpack in other ways, that already has immense value.

And if those correlations allow conspiracy theorists to become that much more smug, that’s a small price to pay.

Professor goes to big data to figure out if Apple slows down old iPhones when new ones come out