Home News and blogs hub

How to increase citations, ease reviews and facilitate collaboration for ML in applied science

Author(s)

Jesper Dramsch

SSI fellow

Posted on 13 February 2023

Estimated read time: 3 min

Sections in this article

How to increase citations, ease reviews and facilitate collaboration for ML in applied science

Posted by d.barclay on 13 February 2023 - 10:00am A computer screen with coding on it

Image by StockSnap from Pixabay

By SSI Fellow Jesper Dramsch.

This article was first posted on dramsch.net

Are you a scientist applying ML?

I wrote a tutorial with ready-to-use notebooks to make your life easier!

Let's focus on 3 aspects:

More Citations
Easier Review
Better Collaboration

This was a EuroScipy tutorial in 2022 (notebooks available here).

Model evaluation

In science, we want to describe the world.

Overfitting gets in the way of this.

With real-world data, there are many ways to overfit, even if we use a random split and have a validation and test set!

A machine learning model that isn't evaluated correctly is not a scientific result.

This leads to desk rejections, tons of extra work, or in the worst case, maybe redactions and being the "bad example".

Especially on:

Time Data
Spatial Data
Spatiotemporal Data

Here's the Notebook

Benchmarking

Compare your models using the right metrics and benchmarks.

Here are great examples: - DummyClassifiers - Benchmark Datasets - Domain Methods - Linear Models - Random Forests

Always ground your model in the reality of science!

Metrics on their own don't paint a full picture.

Use benchmarks to tell a story of "how well your model should be doing" and disarm comments by Reviewer 2 before they're even written.

Here's the Notebook

Model sharing

Sharing models is great for reproducibility and collaboration.

Export your models and fix the random seed for paper submissions.

Share your dependencies in a requirements.txt or env.yml so other researchers can use and cite your work!

Good code is easy to use and cite!

Use these libraries:

flake8 for linting
black for formatting

Write docstrings for docs! (VS Code has a fantastic extension called autoDocstring)

Provide a Docker container for ultimate reproducibility.

Your peers will thank you.

Here's the Notebook

Testing

I know code testing in science is hard.

Here are ways that make it incredibly easy: - Doctests for small examples - Data Tests for important samples - Deterministic tests for methods

You can make your own life and that of collaborators 1000 times easier!

Use Input Validation.

Pandera is a nice little tool that lets you define how your input data should look like. Think:

Data Ranges
Data Types
Category Names

It's honestly a game-changer!

Here's the Notebook

Interpretability

This is a great communication tool for papers and meetings with domain scientists!

No one cares about your mean squared error!

How does the prediction depend on changing your input values?!

What features are important?!

Here's the Notebook

Ablation studies

You know it. I know it.

Data science is trying a lot to find what works. It's iterative!

Use ablation studies to switch off components in your solution to evaluate the effect on the final score!

This level of care is great in a paper!

Here's the Notebook

Conclusion

We looked at 6 ready-to-use notebooks to make your life easier.

This resource is for you to steal and make better science.

Each tool makes it more likely for:

Your results to go through review
Others to use and cite your stuff
The code fairy to smile upon you