I’ve officially entered my first data contest! After spending countless hours cleaning processing the data and then running a wide variety of models, I was still very nervous about actually posting my work so publicly. I know it’s silly because there are thousands of entries and I knew I wouldn’t land at the top. After all, this was my first machine learning project that I have undertaken. So often I want to fast forward in my data science process to be an expert in the field, but I am learning to embrace where I am in the process. We all know data science is the continual learning process.
Z, t, p, f, chi-square… when it comes to hypothesis testing, we have many different test statistics from which to choose. If you’ve taken a high school level statistics course, you most likely have been exposed to z and t tests. These are heavily covered in the AP Statistics curriculum, but in case you forgot - here’s a brief rundown of how these work. z-tests are used when we are testing hypotheses about the population mean when we know the population standard deviation. We can also use z-tests when we are testing population proportions. Most of the time, we don’t have access to the population standard deviation, so we use a t-test. Now there are actually two t-tests that we can use. The first and the common one taught is the student’s t-test. There’s a great story about why it’s called student’s, you can watch a video about it here. The other t-test that can be used when the comparing two means that have different variances (student’s t-test assumes that they have the same variance) is called the Welch’s t-test.
While working on a project that involved looking at King County housing data and working on creating a model to predict pricing, I was intrigued during the data cleaning process. Since I live in King County, I was interested in the data because I see the housing costs around me. When examining the data, I found that the column indicating whether a property was waterfront property or not had many Nan entries. I tried a couple different approaches to figure out how to deal with these unknown values. Knowing more about King County, when I checked to see how many properties were listed as waterfront, it seemed a bit low given how much water exists in this county. The first was to look at the houses on a map and see if it was easy to designate waterfront properties based on their latitude and longitude. I used geomapping to display both the waterfront and non-waterfront properites. The red is for non-waterfront properties, the blue is for waterfront, and the yellow has waterfront listed as a Nan entry. As you can see, the yellow is very well dispersed amongst the red and blue. Along with that, it is difficult to determine yellow areas that should be blue, because there are also red houses near them. I then looked to see how much difference in cost there was between waterfront properties and non-waterfront properties and determined if I changed the Nan entries to the mode (0), it wouldn’t change our data too drastically.
My journey into data science began with this question: What is the cursive writing of high school math? I’ve been teaching high school math for the past 13 years. I’ve taught Geometry, Algebra 2, Precalculus, AP Calculus, AP Statistics, and some math electives. As I’ve continued to grow as an educator the more I see the system that we’ve created in education is fiercely broken. The whole point of one institution is to get into another institution. Go from a good elementary to a good middle school to a good high school to a good college then finally “the real world”. High school has become the business of being college prepatory without really questioning, should we be? Should every student desire to get a college degree in order to be able to have a successful life? What are we teaching in high school that is actually obsolete (like the removal of cursive writing from elementary school teaching)? Every math teacher knows that will be asked the question “When will I use this in real life?” - and most of the time the response is some vague application to being able to think critically and problem solve.