In recent years, machine learning has prevailed over two champions on the quiz show, Jeopardy!, and vanquished the world’s number one-ranked player of Go, one of the most complex strategy games humankind has ever devised. You can’t doubt its immense power and reach, but it’s not all about playing games. Machine learning is fundamentally changing the way we approach computing—and it can pay off big time for your business.
Successfully programming a computer to complete a task used to depend on the developer providing exact, unambiguous instructions. As problems increased in complexity, so did the time and effort that needed to be put into commands. In contrast, machine learning today uses algorithms to enable computers to autonomously learn from data and information. Take the use of IBM Watson Visual Recognition to combat drought in California. The cognitive capabilities of the solution made it possible to process, clarify and fuse vast amounts of satellite and aerial imagery of California with other data sets. As a result, the system taught itself to identify elements such as swimming pools, enabling it to generate useful recommendations for water conservation – for example, advising pool owners to drain or fill their pools less frequently.
By giving computers the ability to learn how to solve problems for themselves, machine learning is truly transformative, driving innovation and uncovering insights that are far beyond current human capabilities. However, although many organizations are excited about machine learning, so far, very few have actually embraced it and incorporated it into everyday operations. Today, we’re going to analyze what the obstacles are, and learn what you can do for your business to ensure success with machine learning.
Common pitfalls on the data journey
You might assume that building machine learning models is fraught with complexity and difficulty, but that once the models are in place, machine learning is straightforward. In reality, though, the models are often considered the easy part; most of the big challenges with leveraging machine learning revolve around data.
First, you need to ensure simplified and scalable access to the right data sets. In a previous blog post, we discussed some of the typical challenges organizations are facing around turning data lakes into business value. In particular, a lack of effective data governance and limited findability are preventing knowledge workers from using data lakes to their full potential. Considering the huge range of data types and formats that a data lake may store, it is no surprise that finding the data you need can be a major obstacle.
Even once you’ve identified the right data, you still need to carry out data cleansing to be confident of its quality. Since data is an evolving asset, keeping track of how the quality changes over time can also be tricky. Next, you must carry out engineering to transform the data into a structure that best represents the underlying problem to the machine learning models. This is a difficult and expensive process, but you cannot start training your machine learning model until it is complete.
Let’s say you’ve navigated all these challenges, and have one or more models in production. That isn’t the end of the story. If you’re going to make decisions based on the results produced by these models, you need to know that they are working properly and are being used by the right people, in the right ways. Most organizations lack this insight, putting them in a risky position and restricting the usefulness of machine learning.
So, what can you do?
Whether you are just starting to build machine learning models, or have stumbled along the way, IBM Data Catalog (currently in beta) could offer the breakthrough you need.
Here are some of the ways it could help:
If you already have a data lake or warehouse, don’t panic: we’re not proposing moving that data anywhere else. IBM Data Catalog will provide an effective management layer on top of your existing data infrastructure that can index all of your data into a single metadata catalog (read this blog post for more information). This catalog can help you find the data you need to build and train models, and keep track of data quality.
Once you have found the data you need, IBM Data Catalog provides seamless integration with data exploration and model development tools such as IBM Data Science Experience and IBM Watson Machine Learning. As a result, you can manage the entire model development, training and publication process from end to end within a single coherent environment.
As your machine learning initiatives become more successful, the models you create, as well as the datasets you use to train and test them, will proliferate. IBM Data Catalog can help you keep this growing set of assets well-organized, categorized and controlled—helping your data scientists and data engineers understand and reuse each asset appropriately, regardless of how many hundreds of data sets and models you create.
Using sophisticated monitoring tools, Data Catalog can track the lineage and usage of each data asset. By providing insight into who is using your machine learning models and data sets, the solution can help data stewards enforce information security policies and prevent violations, while also revealing which models are being used most often by the business.
This increased visibility can help users make data-driven decisions around where to direct their efforts to enable optimal outcomes. Let’s say you’ve developed two machine learning models in parallel; you can quickly discover which is delivering greater value and, if appropriate, focus your energy on that one. In the ‘fail fast, fail often’ world of machine learning, determining when to allocate your resources away from a model is critical.
Take the plunge
Big corporations have come to consider their data as intellectual property, representing a competitive advantage that they are not willing to share. But data is only worth something if you do something with it and many companies simply aren’t using it to its full potential.
Right now, the most promising approach is to use your data to build machine learning models, so you don’t get left behind. If you choose the right tools, your first (or next) foray into the world of machine learning doesn’t need to be fraught with risk, and can potentially pay back your investment in spades.
Some traditionalists might shy away from the idea of putting the fate of their companies in the hands of a machine, no matter how smart it seems to be. But by shedding light on the “black box” of machine learning models and giving you control of your data, IBM Data Catalog can help you get better results.
For organizations that are at the very beginning of their data journey, the way forward may seem daunting, and it might look like there is too much ground to make up. But the advantage these organizations hold is that they are starting with a clean slate, and can set out with a clear target of being data-driven. By equipping themselves with the right solutions to support machine learning—such as IBM Data Catalog—these enterprises have the opportunity to catch up and even leapfrog over competitors.