TNEI shares with us some “dos and don’ts” for those working in the power sector, to bear in mind while getting to grips with conducting good data science – either yourself or when commissioning others.
We live in an era of data, where the sheer amount being captured and stored in all areas of life is exploding. With that, as the challenges faced across all industries grow in both number and complexity, so does the requirement for increasingly sophisticated use of this data.
The electrical power sector is no exception, with the types of data that are likely to be available in much greater amounts, particularly at the lowest voltage levels, including more granular network monitoring, weather monitoring and the smart metering of individual customers (subject to privacy rules). This data will surely be an immense resource in embracing the challenge of delivering a low-carbon, secure and affordable energy system with pro-active consumers.
TNEI believes that, in order to realise the potential benefits of this data, the power sector must embrace the field of data science while ensuring that the analysis conducted follows good practice to produce truly helpful results. The field of data science can serve as an incredibly valuable tool, drawing out carefully derived insights and translating business questions into questions answerable using currently available or forthcoming data.
1. Do use a multidisciplinary team with expertise in statistics, computer science, and power systems.
As highlighted in Figure 1 above, data science brings together a mix of three components: programming skills – including visualisation techniques; expertise in statistical reasoning and inference; and domain knowledge.
Good data science requires strength in all three distinctive skill sets, and problems are likely to arise if one of those skill sets are missing or neglected by a data scientist or team. Analytical teams should ensure that domain expertise informs every stage of their analysis, and software developers should try to avoid off-the peg and/or black box methods without scrutiny from an experienced statistician.
2. Do make sure you have clear objectives before scoping out a Machine Learning tool, leveraging extensive domain expertise.
The ability to do this well is one of the main benefits of integrating the components outlined above. It is easy for an analytical team to spend almost all of their time solving a technical problem, at the expense of thinking deeply about what exactly the problem is to be solved. The problem should be worked down from objectives that can be expressed in natural language to one expressed purely mathematically, in the context of a fully specified probabilistic model.
3. Don’t build algorithms from scratch if you can use an open source Python or R package.
Spending most of your time solving problems that others have already solved is bad business value. Using libraries will drastically improve development efficiency and support a neat, modular design.
Typical development tools such as Python and R supply versatile libraries with well-established support bases through comprehensive documentation and online forums such as ‘Stack Overflow’. However, a balance needs to be struck to ensure that each process is well understood, and black boxes treated with caution.
4. Do consider a wide range of possible Machine Learning and probabilistic programming algorithms, and use the right one for the problem and the available data.
Unfortunately, it’s rare to find any algorithm that’s always the best at solving a certain type of problem. Selection must be tailored to satisfy various factors, e.g. scalability, speed and the amount of data required for accurate results – check you have enough.
5. Don’t overfit or underfit your model.
Given a finite amount of data, a model of any type has an optimal level of complexity. Building a model that’s too complex is known as over-fitting and means that the model-fitting process inadvertently ‘learns’ from noise in the data as though it was a genuine pattern. Underfitting means building a model that’s unnecessarily simple – a more complex one would deliver more accurate predictions, if there is enough data to fit it properly. Save a proportion of your data to test for this.
6. Don’t use models that haven’t been validated.
Ensure models are rigorously validated, and that the tests applied confirm that your model performs well with regard to the specific aspects that are salient to your application, e.g. if predicting one-in-10-year demands, validating against mean values has limited use.
Be vigilant in checking whether any analytical tools or models that you use are based on assumed relationships that haven’t been tested. Be careful in your interpretation of confidence intervals – they can be misleading.
7. Don’t assume that data removes the need for predictive models.
Power system professionals new to data science might believe that the existence of data – e.g. monitored LV network voltages and thermal utilisations over a period of months – eliminates the need for modelling power flows in that network.
Conversely, they may believe that the application of a power flow model, fitted to only one data set, is sufficient. In fact, data and modelling always complement each other, and more data combined with a good model always equals a better result.
8. Do present results smartly for a technically diverse audience.
Sometimes, generating new and effective visualisations of data can be a difficult task, but one where help is available – look for inspiration! There are a variety of open-source libraries that permit users to upload interactive data visualisation online. Dynamic data visualisations can be much more powerful than a static, overloaded image, and these can optimise the target readers’ experience by allowing them to manipulate the field of view.
Obviously, we’ve only touched on our selected dos and don’ts here, and why they’re so important, although it would have been easy to fill many pages discussing each one. Nonetheless, we hope they provide food for thought and have triggered your curiosity about making sure you tackle your business or organisation’s data science needs in the best possible way.