Ollie Glass

Managing data science projects

20th June 2018

Data science and machine learning projects often have high expectations and high failure rates. Ideas seem promising and data scientists start experimenting but struggle to follow through. Projects run on indefinitely without clear goals or deliverables, teams lose direction and momentum, important individual contributors leave. What makes these projects so difficult? How can they be delivered effectively?

In this piece I’ll show how I introduce data science to a company and deliver a project. My projects typically involve creating new features for existing products. I recommend clients start small, so I'll give examples for a three-month project. I’ll describe the tasks and outcomes at each stage, how I manage common risks and challenges, and how I work with stakeholders.

Scoping data projects

While amazing things can be done with data science, many descriptions of it are hyperbolic and don’t help you think clearly about what's possible. Involving a data scientist from the start will let you check the feasibility of ideas and get a sense of the costs, time and chance of success. A data scientist can also suggest different ways to approach a project that you might not have considered.

Some features are a natural fit for data science. Consider a product that lets users send emails campaigns. Users enter times and dates to schedule email sends, and the product keeps records of sent emails and open rates. Using previous open rates to predict the best send times for new emails would make a great feature.

Creating predictive systems is inherently risky. An experienced data scientist can use experience, industry benchmarks and a survey of your data to estimate what’s possible. How much data is needed, how long the work will take and if it will be successful cannot be known for certain, but will become clearer as the research progresses. It’s essential to communicate the risks and possible costs of work clearly so organisations can make informed decisions and set their expectations accordingly.

To reduce risk, try to reduce the complexity of the research required. Review the project’s goals with a data scientist and consider compromises or different ways of creating similar value for the end user. In our email sending example we might hope to build a feature that advises on the best time to send, but knowing that even advising on a morning or afternoon send would be valuable, a data scientist can suggest simpler techniques are tried first.

Well defined projects are capital efficient. Avoid researching more than is necessary by considering the minimum research required to create value before you begin. As a benchmark, consider what users are currently doing without data science to support them. Email send times may be effectively random. Even a marginal improvement that brings a 5% increase in open rates may be valuable and make a project attractive. Further increases can come in later projects.

To get the best from a project, ensure other parts of the business are involved. Details about the feature should be included in sales and marketing collateral. Customer service may need training on how to answer questions about it, and it could create changes in the wider product. Your product’s design, the work of other teams who touch the feature, and even business strategy may need to be reviewed to capture the full value of data science.

Outcomes: A shortlist of possible features and a single feature to focus on. An understanding of what research would need to deliver for the feature to be successful.

Risks: Discovering it’s not possible to build your feature or that it’s higher risk than you can bear.

Time: About a week.

Conducting research

Research prerequisites

Working out how to create a system that meets the accuracy goal takes research. This involves considering what the feature does and finding ways to express it as a scientific or machine learning problem. Predictive modelling techniques must be selected and adapted to the problem being solved. Data has to be gathered, refined and prepared. Scientific problems and data preparation can be tackled in different ways, so you might start by shortlisting methods and considering the tradeoffs between them.

How much data you will need depends on your methods, with requirements ranging from tens to hundreds of millions of records. Having more data available will usually improve results, so if you don’t already have much data collected it might be worth finding new ways to capture it. Approaches include adding “data exhausts” to existing processes in your product, buying third party data sets, adding different but related proxy data to your own or creating new synthetic data.

If your data wasn’t captured with data science or machine learning in mind it can help to add detail to it. Running a data enrichment process with your team is good for small data sets or for enrichment that requires expert knowledge. Third party services like Amazon’s Mechanical Turk or Clickworker can be used for larger volumes.

When the data is ready and the approach is clear it’s best to start by trying a simple technique and measuring the results. Starting with a simple model lets you set a baseline to compare future results against, and means your first research communication will have less complicated ground to cover.

Communicating research

Research must be communicated to prevent data science from becoming opaque and detached from the wider business. Scientific communication is hard. Data science techniques can be complex and difficult to describe. Raw data and outcomes of experiments may be unintuitive. Developing a shared understanding of research and results, their meaning and consequences for the project takes time and collaboration.

Forming a research council with key stakeholders is a great way to do this. This group can review work in progress, clarify requirements and outcomes, and ensure the project stays focused on the most important problems. Involving stakeholders from the start will build a sense of inclusion and ownership, and ensure that their expert knowledge is not overlooked.

The council can also think about the technical details of the work. It’s particularly important that the group understands how the project’s accuracy is measured and how it will be reported. Instead of a single measure there may be several factors which have to be traded off against each other. This may require departments to understand how the scientific problem is being formulated and agree any compromises between themselves.

Presenting at an all-hands meeting lets you introduce the project to the wider team. Sharing the goals and outlining the approach, connecting it to the business’ vision and strategy, and perhaps giving a demo are more appropriate than technical details here. A question and answer session allows the team to share concerns and build trust in the project.

Research progress

Conducting research involves rounds of trial and error, revision and improvement. Progress can be uneven, alternating between plateaus of testing new approaches with no apparent change followed by rapid improvements when a new method bears fruit. Over time there should be a gradual increase in accuracy and performance as an approach is refined. When you reach the minimum accuracy level set in the planning stage, your project is de-risked. From here you can choose between improving accuracy or moving on to production.

Outcomes: A data science methodology that produces the outputs you need at an acceptable accuracy level. You and your team understand the general approach and have agreed any trade-offs.

Risks: Not meeting your accuracy goal within the research time or budget.

Time: About a month.

From research to production

Moving the data preparation and modelling techniques into production is the most straightforward part of the process. The challenges here usual have well known solutions, including working with large amounts of data, meeting low latency requirements, or managing complex data transformations.

That said, there’s no common approach for moving data science from research into production, nor is there a widely adopted production framework. Ruby on Rails transformed web development and became something close to an industry standard. React and Redux are doing something similar for front-end development. But data science only has some proprietary third-party services like Amazon Sagemaker, nothing standard or completely open source.

Most projects use similar components. Almost all start with an extract, transform and load (ETL) pipeline that collects data from different parts of your business and assembles it in a dedicated database. The data is enriched and prepared so it can be used to train predictive models. If models are to be updated when new data comes in, a dashboard and alert system can monitor their performance and warn if their accuracy goes out of range.

There are many software tools and services to support building and deploying data products now. Companies offer ETL, data labelling and predictive model building as managed services. The Python language has an excellent range of libraries for most common data science and machine learning tasks. Conventional databases like Postgres are fine for most projects, dedicated “big data” software tools like Hadoop and Spark are rarely needed in practice. All of these are options are available on AWS, Google Cloud and Azure cloud services. Unless you need something particularly novel or cutting edge, finding software and services should not be a problem.

Outcomes: A data product deployed to production.

Risks: Normal engineering risks of time and budget overruns.

Time: About a month.

Ongoing operation

It’s important to monitor that features are operating properly in production. Uptime and general status of a data science feature can usually be overseen with the monitoring services used in other parts of a technology stack. Involve engineers in the design and build of the system, and DevOps or SRE teams in the design and operation of monitoring, to ensure a fast and straightforward technical handover.

To be certain that a system is working well in a live setting it might be necessary to monitor model accuracy on live data. If performance goes outside of an expected range you may want alerts and a fallback system to kick in.

Some machine learning systems will require updating when the underlying data changes. A process for monitoring and updating models can be incorporated into the design, build and ongoing operation of the system. It may be possible to automatically retrain models with only minimal oversight or intervention, and this task can be handed over to a general developer or analyst. A dedicated data scientist may be necessary to monitor and adjust more complex or bespoke models.

It’s vital that the new feature is understood and adopted by its stakeholders including product users and internal teams. Bringing data science into a company is a transformational process, and new ways of working and thinking have to take root for it to be successful. The shared understanding and ownership built by involving the wider team in the research efforts will pay dividends here.

Outcomes: Ownership is transferred to in-house technical team. Adoption of the new feature and any new ways of working it requires.

Risks: Failing to maintain the feature or make full use of it. Organizational rejection of the new feature.

Time: At least a week dedicated handover time, but also ongoing throughout the project, especially in the earlier stages.

In conclusion

Projects can seem daunting at first, but dividing work into stages with clear goals and processes will make it easier to manage. Discovery and research stages are all about finding good ideas to try and learning what can be achieved. If data science can’t produce the results you need, find out quickly before committing to building and managing new technology.

If research shows you have a viable feature then the key tasks become making decisions with the team about the best trade-offs and getting it to an optimal level of performance. Ongoing data science support may be required but much of the delivery and operation is similar to any other engineering project.