Introduction to Python for Data Science, Part 2

Speaker: Andrew Collier

Track: PyData

Type: Tutorial

Room: Cedarwood

Time: Oct 10 (Wed), 13:30

Duration: 4:00

This is the second half of the 2-session tutorial.

Python is a popular platform for doing Data Science. The two dominant libraries, pandas and sklearn, provide extensive functionality for data preparation, data manipulation and Machine Learning. This workshop will provide an introduction to using these libraries.

Specifically we’ll cover the following topics:

  • What is Data Science?
  • Grabbing data from various sources
  • Working with Series and DataFrame objects
  • Dealing with funky data (missing data and outliers)
  • Overview of Machine Learning
  • Keeping it simple using Nearest Neighbours
  • Capturing a trend: LinearRegression
  • Predicting categories: DecisionTreeClassifier
  • Binary outcomes: LogisticRegression
  • Using Pipeline to streamline your workflow
  • Cross Validation

The workshop will be intensely hands on, so you will definitely need a laptop. Instructions for getting everything set up will be provided prior to the workshop.

No prior knowledge of Data Science or Machine Learning is assumed, although it will be helpful if you have worked with a spreadsheet before and are moderately competent with basic Python.

We will work with a diverse selection of data sets and perform a variety of analyses. Along the way we’ll build and submit an entry to a Kaggle competition. By the end of the day you will be functionally competent to venture forth on your own Data Science projects.