Why Machine Learning Initiatives Fail in Media
Machine learning (ML) is a technological breakthrough that is changing the way we live. Machine learning has many practical implications including the ability to help us to predict future outcomes, optimize and automate processes, make decisions, and generate content. In most cases, machine learning algorithms can be optimized to achieve human-like or better performance. And in many fields, this is already the reality.
What about machine learning initiatives in the Media industry? There are many ways it can be useful, including the examples outlined below:
- Content optimization (analyze user behavior and provide analytics, indicating which kind of content is better to produce)
- User experience boosting (smart search, smart feed, recommendation systems)
- Marketing optimization (provide special offers to qualifying clients)
- Content generation (generate scenarios, videos, audio, etc. Automated pre/post- production)
- Security (fraud detection, malware detection)
Since machine learning has so much potential , why are adoption rates so low? Even though AI is something that has been widely discussed in recent years, many approaches were invented over the past 50 years. And why is it just now gaining renewed popularity and focus? Consider that the adoption of machine learning requires three key ingredients:
- Volume and Variety of Data
- Skilled engineers
- Computing resources
In this short article we will define these three ingredients. For media companies, the ability to harness their data, attract and/or hire the right resources, and have access to the appropriate computing resources will be critical to the proper adoption and successful implementation of machine learning mandates that are being driven at the c-level.
Harnessing the Data
In most cases, efficient algorithms require great volumes of data and a lot of variety – in the magnitude of terabytes or even petabytes of data. It’s important that data should represent the system in different aspects. For example, if you are looking for fraud, your dataset should contain examples with and without fraudulent actions. And, if you have only two states, it’s better to get 50% percent of each state.
Also, your data should cover all (or the majority) of possible values. For example, if an entertainment company is trying to predict if a film will be interesting for a user, the team will have to carefully consider the types and volume of relevant profiles and datasets that will be necessary in order to build a statistically significant prediction algorithm. Too little data, or the wrong kind of data, could lead to the worth predictions and result in poor business decisions.
For example, if you have a temperature sensor that is hidden from the sun from 10 am to 2 pm and 5 pm to 9 pm, you may notice that the warmest time of the day is between 2 pm and 5 pm. And, you may be surprised to find that the temperature in this period is much higher than it actually is. Understanding the context of the data is just as important as ensuring that enough data is being captured.
Attracting the Right Talent
If your data is not ready for processing, skilled employees might be able to solve the problem. But, in the case of machine learning (which is a relatively new field) there is a shortage of human capital that understands the work that needs to be done.
Not all problems within machine learning can be solved with a universal solution, and even when projects seem similar, they might still each require their own unique solutions. Data science projects have a lot of sections that require the input from specialists from different areas: data engineers are responsible for building data pipelines to transform and provide huge amounts of data to data scientists. Data scientists process the data, gain insights, and build a machine learning model. DataOps specialists are responsible for creating a variety of infrastructure such as databases and services for processing and transformation of the data. Software engineers are required to wrap an ML model to a user interface to provide users with the ability to interact with the system.
If you’re using ML, you’ll need data science specialists on hand that can offer support and maintain the system. Keep in mind that most models have expiration dates meaning that users and circumstances change periodically, which might lead to a decrease in productivity over time. Data science specialists also need soft skills since they’re required to communicate with customers on a regular basis. These specialists ideally will also have a deep understanding of business processes, ROI, and the market as a whole.
Accessing Compute Power
There are many effective methods for data cleansing. We have solutions for asymmetrical data, but in order to harness it, we need powerful infrastructure. And contrary to popular belief, the important aspects of infrastructure rely less on supercomputers and more on smart strategies. When the right kind of strategies are deployed, it can leverage a group of computers to work together to produce optimal results.
The problem is that systems like these are really hard to configure and manage. Network architecture in itself is a major pitfall, but you also have to keep in mind that strategies like these require special algorithms and security protocols.
Some of these problems can be solved by cloud computing since they offer ready-to-use resources that can be managed as code (configuration file) or with a UI. Cloud also offers useful integration and tools for managing big data.
Hadoop and Spark are useful for processing big data when a cluster of computers is used for computational tasks. HDFS (Hadoop Distributed File System) can be used to store large amounts of data while NoSQL databases can provide fast read-write operations in one or multiple computers. In general, all these tools can be used as a managed service, which is yet another benefit of a cloud environment.
Whenever you have a lot of data streaming that should be processed in almost real-time, as is the case with media, there is inevitable complexity that must be factored into the equation. It can be hard to process data with high velocity (like an online cinema or social network that generates more than 1GB of information per day). In scenarios like these, where data is received faster than we are able to process it, we need special messaging software that can act like queues. Apache Pulsar enterprise support is a good example of such a service.
When Apache Pulsar is deployed, data is written to a queue and gets processed when your system has the capacity to do so. If the processing unit fails, the queue automatically detects the failure and keeps re-sending the data until it can be processed. It’s now possible to minimize complexity by leveraging a hosted solution for these mission-critical workloads, an interesting example being Pandio. Using a service like Pandio’s can result in significant savings as enterprises can avoid costly investments in both infrastructure and human capital that is necessary to efficiently run distributed messaging at scale. In addition, a hosted solution provides the ability to scale up or down quickly according to the requirements of the business.
Conclusion
Media is a great arena for deploying machine learning, but keep in mind that there are fundamental building blocks that need to be put in place before machine learning can be successfully operationalized. Before applying machine learning, you need to ensure that you collect and store data in an appropriate way that can be used by your data scientists and data engineers. Building and / or leveraging machine learning algorithms will require experienced data engineers and data scientists to both prepare and format the data as well as to optimize and iterate upon the models. And finally, machine learning requires massive amounts of data that can be ingested, calculated, and iterated upon. To do so requires sufficient hardware and software infrastructure – whether on-prem or in the cloud. Special hosted services, such as Pandio will also help you operationalize your machine learning initiatives without worrying about scalability or the complexities of data ingestion and management.