We are delighted to welcome you to our in person event in March! Leeds Data Science are pleased to present Daniel Burkhardt Cerigo on building a minimal data science platform and Russ Hyde on the importance of code quality in data science.
Schedule
18:00–18:45h: Refreshments
18:45–18:50h: Welcome
18:50–19:20h: Daniel Burkhardt Cerigo
19:30–20:00h: Russ Hyde
Daniel Burkhardt Cerigo, Founding Data Scientist @ datavaluepeople
Building a minimal data science platform
Data science teams often have to support various diverse projects and requests from the rest of their organisation. Establishing a general platform which can power these many projects is vital to 1. not severely limiting your team (running pipeline jobs on their laptops!?), 2. avoid wasting time repeatedly replicating cloud infrastructure, 3. avoid sprawling infrastructure that becomes unmaintainable.
In this talk I’ll explain both the high level concepts, along with giving specific tech-stack examples, of the components required for a functioning cloud infrastructure platform to power a general data science team within a business. Some components we’ll cover will be: mono-repos, containerisation, build systems, orchestration tools, compute engines, data storage, and infrastructure-as-code.
I’ll be trying to get as much further knowledge and expertise from the attendees on their relevant experiences building platforms for data science applications, the lessons they’ve learnt, other tech options, etc., so do come along to gain some new ideas and share what you already know!
By the end of this talk you will have enough of an orientation, and references to the details, to be able to build your own data science platform tailored to your (team’s) needs and preferences.
Speaker bio: Daniel Burkhardt Cerigo is a full-stack machine learning scientist and engineer. He has been building end-to-end machine learning systems, leading data sci teams, consulting, and educating on applied ML for 6+ years. He is currently doing research on protein structure prediction using ML. He founded datavaluepeople in 2020, writes open source python packages in the ML space, is Kaggle ML-competition ranked, and likes gyoza and snowboarding.
Russ Hyde, Data Scientist @ Jumping Rivers Ltd
Does code quality even matter in data science?
It depends!
If you need to quickly summarise some data for an ad-hoc request, then knock out the code in whatever manner gets the job done.
But what happens when you start getting a lot of similar requests, or you are working on a more substantial project, or you are collaborating within a larger team? Now, productivity should be viewed ‘across the team’ and ‘across all projects’. What can you do to help yourself and your colleagues, and what tools exist to help?
Code quality concerns those aspects of software that make it easier to work with, easier to explain to others and easier to maintain or extend.
In this talk, I’ll take you through the source code for an evolving analysis project. We’ll discuss how to (and how not to) modularise code. Along the way, we’ll talk about actions and calculations, body-tweaking, duplicate stomping and a few tools that help automate the boring low-level stuff
that teams sometimes disagree about.
Speaker bio: Russ Hyde is a co-organiser of Leeds Data Science and a data scientist at Jumping Rivers specialising in R and shiny development. In a former life he was a cell biologist and then cancer bioinformatician. He has contributed to several CRAN packages and runs online book clubs about shiny development for the R-for-data-science community.
News and Announcements
Have a news item or announcement you’d like to make about upcoming data events or job opportunities in Leeds? Comment below or contact us directly and we’ll do our best to circulate this information at the end of the session.
Please contact the organisers if you would like to volunteer as a speaker for future events.