Introduction to Python for Data Science, Part 1
Speaker: Andrew Collier
Time: Oct 10 (Wed), 09:00
This is the first half of the 2-session tutorial
Python is a popular platform for doing Data Science. The two dominant libraries, pandas and sklearn, provide extensive functionality for data preparation, data manipulation and Machine Learning. This workshop will provide an introduction to using these libraries.
Specifically we’ll cover the following topics:
- What is Data Science?
- Grabbing data from various sources
- Working with
- Dealing with funky data (missing data and outliers)
- Overview of Machine Learning
- Keeping it simple using Nearest Neighbours
- Capturing a trend:
- Predicting categories:
- Binary outcomes:
Pipelineto streamline your workflow
- Cross Validation
The workshop will be intensely hands on, so you will definitely need a laptop. Instructions for getting everything set up will be provided prior to the workshop.
No prior knowledge of Data Science or Machine Learning is assumed, although it will be helpful if you have worked with a spreadsheet before and are moderately competent with basic Python.
We will work with a diverse selection of data sets and perform a variety of analyses. Along the way we’ll build and submit an entry to a Kaggle competition. By the end of the day you will be functionally competent to venture forth on your own Data Science projects.
Please ensure that you have the following installed and tested:
- Python 3
- Modules: numpy, pandas, scipy, matplotlib and sklearn.
Two easy ways to get all of the above are:
- install Anaconda or
- use datawookie/jupyterhub Docker image.