Ollie Glass

Managing data science projects

It can be hard to know how to run data science projects and what to expect, especially if you don’t have a chief data officer in your team. Because I'm often the first data scientist my clients have worked with, I find it helpful to share a framework for how to manage a data project in addition to discussing how data science fits with their business in general. There are four stages: discovery, research, production and ongoing operation. I’ll give examples of the outcomes and timelines of each stage on a small three-month project.

Complete data science projects can take months. The work is complex, demands careful attention and requires gathering data to build an understanding of the project's goal. A large company like eBay or Stitch Fix might dedicate a team of twenty data scientists to a single area for many years, but a smaller project could go from idea to implementation in three to six months.

Discovery

Start by proposing ideas and sanity checking them. While it's true that amazing things can be done with data science, much of the coverage is hyperbolic and doesn't help you think clearly about what's possible. Involve a data scientist from the start to test the feasibility of ideas and get a sense of the costs, time and chance of success. They can also suggest different ideas or approaches to a business challenge you might not have considered.

Data science is inherently risky. Even if a similar company with similar data has shipped a similar feature, that’s no guarantee it’ll work for you or work as well. That said, some features are a natural fit for data science. Say your product lets users send emails campaigns. You have records of email send times and open rates, and users manually enter the time to send emails. You can improve their campaign outcomes with basic data-driven decision support. It’s a smart, low risk feature to build.

Data science usually involves creating models that make predictions. Models might predict the best time to send a marketing email, or which related products to show to a user. It’s important to work out how accurate a predictive model needs to be for a product feature to be viable. Models will never be entirely accurate, but they can be good enough. If you can build some flexibility into a feature you may be able to find ways to make it work even with less than ideal accuracy.

Outcomes: A shortlist of possible features, and a single feature to focus on. A sense of what the finished feature would look like if it was working well.

Risks: Discovering it’s not possible to build your feature, or that it would be prohibitively expensive.

Time: From a few days to a few weeks, usually a week.

Research

Research is required to work out how to create the model and meet the accuracy goal. This involves considering what the feature does, for example, predicting sale prices from photos of shoes, and finding ways to express that as a scientific or machine learning problem. Scientific problems can be tackled in different ways, so you might start by shortlisting methods and considering the tradeoffs between them.

Naturally, these methods all require data. How much data you need depends on the methods being used, with requirements ranging from tens or low hundreds to hundreds of millions of records. Having more data available will usually improve results, so if you don’t already have much data collected it can be worth finding new ways to capture it. Approaches include adding “data exhausts” to existing processes in your product, buying third party data sets, adding different but related proxy data to your own, or creating new synthetic data.

Sometimes it can help to add more detail to data, especially if it wasn’t captured with data science or machine learning in mind. Running a data enrichment process with your team is good for building small data sets or for enrichment that requires expert knowledge. Third party services like Amazon’s Mechanical Turk or Clickworker can be used for larger volumes.

When the data is ready and the approach is clear it’s best to start by trying a simple modelling technique and measuring the results. Research needs to be communicated to prevent data science work from becoming siloed or opaque, and this can be challenging. Modelling techniques may be complex and hard to describe. Instead of a single measure of accuracy, there may be several factors which have to be traded off against each other. These may require stakeholders to work with new concepts to understand and input on the decisions being made.

Like any project, it’s can help to communicate it in different ways to different audiences in your company. General team and company-wide presentations and discussions can build buy in and engagement in the project, and don’t usually need to go into all of the details. Sharing the specifics with key stakeholders is usually invaluable for surfacing business insights and requirements that can guide the direction of the research.

The scientific method involves rounds of trial and error, revision and improvement. Progress can be uneven, alternating between plateaus of testing new approaches with no apparent change followed by rapid improvements when a new method bears fruit. Over time there should be a gradual increase in accuracy and performance as the approach is being refined, but it’s important to know that there’s no guarantee that research will be able to deliver the results required in a reasonable timeframe or cost. These risks should have been raised and considered in the discovery stage.

You might like to set a total research budget and stop if this is reached or if interim results are good enough. Usually it’s best to take on projects where you’re at least reasonably confident the research is likely to bear fruit, or the risks and costs are clearly understood if you’re choosing to invest in new experiments.

Outcomes: A data science methodology that produces the outputs you need at an acceptable accuracy level. You and your team understand the general approach and have agreed any important trade-offs.

Risks: Not achieving a minimum accuracy goal within the research time or budget. Discovering an approach that can’t be scaled to a production system.

Time: From a few weeks to a few months, usually a month.

Production

We’re ready to move the data preparation and modelling techniques into production. This is the most straightforward part of the process. Software engineering is not trivial or without its own risks, but the work of building and shipping features is not unique to data science, and the risks and requirements are well known. Most of the challenges you meet here have well known solutions, including working with large amounts of data, meeting low latency requirements, or managing complex data transformations.

That said, there’s no single common approach for moving data science from research into production, nor is there a widely adopted production framework. Ruby on Rails transformed web development and became something close to an industry standard, React and Redux are doing something similar for front-end development. But data science only has some proprietary third-party services, nothing standard or open source.

Most projects require similar technical components: some kind of extract, transform and load (ETL) data pipeline that moves data from your product to a data science service and prepares it for use, model building and validation, deployment of models to a live system and ongoing monitoring. To prevent them from becoming tangled or overly complicated, they should be carefully designed and integrated with your product and deployment environment.

There are excellent libraries like Keras, Tensorflow, Pandas, scikit-learn and NumPy for the most common and critical tasks, used and sponsored by companies like Bloomberg, Spotify and Booking.com. Dedicated “big data” tools like Hadoop and Spark are not needed on most projects. Instead, AWS and Google Cloud provide excellent services for working with data at scale and are straightforward to integrate with most existing technology stacks. Your engineering team can ensure services are chosen that work well with your other systems.

Outcomes: A data product feature deployed to production.

Risks: Usual engineering risks of time and budget overruns. Producing complex software that’s difficult to maintain.

Time: About a month.

Operation

It’s important to monitor that features are operating properly in production. Uptime and general status of a data science feature can usually be overseen with the monitoring services used in other parts of a technology stack. To ensure a fast and straightforward technical handover, involve the engineers in the design and build of the system, and DevOps or SRE teams in the design and operation of monitoring.

To be certain that a system is working well in a live setting, it might necessary to monitor model accuracy on live data. If performance goes outside of an expected range, you may want alerts and a fallback system to kick in.

Some machine learning systems will require updating when the underlying data changes. A process for monitoring and updating models can be incorporated into the design, build and ongoing operation of the system. It may be possible to automatically retrain models with only minimal oversight or intervention, and this task can be handed over to a general developer or analyst. A dedicated data scientist may be necessary to monitor and adjust more complex or bespoke models.

It’s vital that the new feature is understood and adopted by its stakeholders including product users and internal teams. Bringing data science into a company is a transformational process, and new ways of working and thinking have to take root for it to be successful. The shared understanding and ownership built by involving the wider team in the research efforts will pay dividends here.

Outcomes: Ownership is transferred to in house technical team. Adoption of the new feature and any new ways of working it requires.

Risks: Failing to maintain the feature or make full use of it. Organizational rejection of the new feature.

Time: At least a week dedicated handover time, but also ongoing throughout the project, especially in the earlier stages.

Summing up

Projects can seem daunting at first, but dividing work into stages with clear goals and processes will make it easier to manage. Discovery and research stages are all about finding good ideas to try and learning what it’s possible to achieve. If data science isn’t going to work for you, find out as fast as possible, before committing to building and managing new technology.

If research shows you have a viable feature, then the key tasks become making decisions with the team about the best trade-offs and getting it to an optimal level of performance. Ongoing data science work may be required to support it, but much of the delivery and ongoing operational work is similar to any other engineering project.