The Big Data Question: To Share or Not To ShareThe Big Data Question: To Share or Not To Share
Your business's unique data may be your enterprise's greatest asset. Should you share it with partners and vendors, or should you keep it proprietary?
Between the disclosures this year about Facebook's lax data sharing policies and the European Union's GDPR (General Data Protection Regulation), a lot of people are talking about data privacy and consumer rights. How much data should you share as a consumer with companies like Facebook or Google?
But what about businesses?
Enterprise organizations may be dealing with their own data privacy dilemma -- should they share their corporate data with partners or with vendors or with some other organization? If so, what data is OK to share, and what should they keep as private and proprietary? After all, data is the new oil. Amazon, Facebook, and Google have all built multi-billion dollar companies by collecting and leveraging data.
Although it is one of the top assets a company may have, there may be compelling reasons to share data, too. For instance, leading edge cancer centers could potentially speed up and advance society's effort to cure cancer if they shared the data that each of them collected. But sharing it with a competitor could also erode their own competitive edge in the market.
Organizations may also be considering participation in a vendor program such as one under development at SAP called Data Intelligence that will anonymize enterprise customer data and allow those customers to benchmark themselves against the rest of the market.
"People are realizing that the data they have has some value, either for internal purposes or selling to a data partner, and that is leading to more awareness of how they can share data anonymously," Mike Flannagan of SAP told information in an interview earlier this year. He said that different companies are at different levels of maturity in terms of how they think about their data.
Even if you share data that has been anonymized in order to train an algorithm, the question remains whether you are giving away your competitive edge when you share your anonymized data assets. Organizations need to be careful.
"Data is extremely valuable," said Ali Ghodsi, co-founder and CEO of Databricks (the big data platform with its origins offering hosted Spark) and an adjunct professor at the University of California, Berkeley. In Ghodsi's experience, organizations don't want to share their data, but they are willing to sell access to it. For instance, organizations might sell limited access to particular data sets for a finite period of time.
Data aggregators are companies that will create data sets to sell by scraping the web, Ghodsi said.
Then there are older companies that may have years or decades of data that have not been exposed yet to applied AI and machine learning, Ghodsi said, and those companies may hope to use those gigantic data sets to catch up and gain a competitive edge. For instance, any retailer with a loyalty card may have aggregated data over 10 or 20 years.
In Ghodsi's experience, organizations want more data, but they are unwilling to share it, sometimes even within their own organizations. In many organizations, IT controls access to the data and may not always be willing to say yes to all the requests from data scientists in the line-of-business areas. That's among the topics in a December 2017 paper co-authored by Ghodsi and other researchers from UC Berkeley titled A Berkeley View of Systems Challenges for AI. Ghodsi said that the group is doing research to find ways in which you can incentivize companies to share more of their data. One of the ways is in the model itself -- the machine learning model is a very compact summary of all the data.
"Let's say I have a massive data set of all the cancers in the world," Ghodsi said. "I can create a machine learning model. It can predict the likelihood of cancer for your lungs, their state of health, what's the risk of cancer. But I still haven't shared all the X-Ray data that I have, and I'm not going to share that with you."
That kind of sharing is happening now, Ghodsi said. Google has published many of its models for classifying images.
Another method is called Transfer Learning, and Ghodsi said that one is enabled in Databricks. This one works by combining an existing model with a new model, allowing you to get new value by leveraging the new data, Ghodsi said.
Another way to share the value of data for research while at the same time preserving your private access to that data is via Federated Machine Learning. This is among the techniques used by Owkin, a startup that is helping cancer research centers accelerate the benefits of their research.
"In federated learning you leave the data on the edge devices," said Friederike Schuur, a data scientist at Cloudera Fast Forward Labs, in an interview with information. A Google blog post explains how it works: "your device downloads the current model, improves it by learning from data on your phone, and then summarizes the changes as a small focused update. Only this update to the model is sent to the cloud, using encrypted communication, where it is immediately averaged with other user updates to improve the shared model. All the training data remains on your device, and no individual updates are stored in the cloud."
In this way, organizations could contribute to the community's research effort, but not give away their data in the process.
Innovations such as transfer learning and federated learning could help a great deal in easing concerns healthcare companies have about sharing their data. Matthew Carroll, co-founder and CEO of data governance company Immuta, said he has seen a lot of concern from healthcare companies when it comes to data sharing.
"They are afraid to give their data to everyone else," he said. "They understand that it is untapped wealth, future revenue."
For startups, that fear may translate into other consequences as well. For instance, will investment firms offer funding to companies that share their data if the value is considered to be in the data itself?
Each company will need to make its own careful decision about what to share and how to share it, Schuur said. "If it's cancer research, we should have more data sharing."
But organizations should absolutely be careful about what they share and how they share it.
About the Author
You May Also Like