The Challenges

Short description

This dataset contains the house sales record in Brooklyn, New York City, from 2003 to 2017. It is a combination of NYC Rolling Sales from the NYC Department of Finance and the PLUTO for roughly 30,000 properties, which we have then merged into a single .csv file. Candidates are required to predict the property sale prices using only the data provided. The plot below illustrates the sale price per unit area in different locations and years.    With over 100 features, it could be quite a challenge to find out which has the most correlation with the sale price. For some covariates such as residential area, whether they would influence the sale price can be rather intuitive, while others such as police forces require more investigation. Performing dimensionality reduction using, for example, Principle Component Analysis, is therefore a recommended way of starting your analysis. Additionally, given the volatility of property prices, implementing appropriate anomaly detection methods would also help improve your prediction. We will run your code against test cases where you will be awarded for accuracy and interpretability of your model. Candidates are free to choose any modelling methods that they feel most comfortable with, as long as they provide sensible reasons for their choice. 

AI Hack: GWAS Challenge

Harrison Zhu and Thomas Wong   

Short description

Our data contains study individuals with genotype information across 861,473 single nucleotide polymorphisms (SNPs). The individuals were surveyed between 5 years, where the case-control study was based on European ancestry severe angiographic cardiovascular disease (CAD) status. We provide a pre-processed dataset in the form of .csv files for:
  • SNP matrices for all 22 chromosomes and patients 
  • Clinical matrix for all patients 
High-density lipoprotein (HDL)
Low-density lipoprotein (LDL)
Triglycerdies (TG)
Coronary artery disease (CAD)
$8\times 10^{6}+$ SNPs

Building a linear model 

$$HDL_{i} = \alpha_{i} + x_{SNP_{k}}\beta_{i,1} + x_{LDL_{i}}\beta_{i, 2} + x_{CAD}\beta_{i, 3} + x_{SEX}\beta_{i, 4} + x_{TG}\beta_{i, 5} + \sum_{k=1}^{9}PC_{k}\beta_{i, k} + \epsilon_{i},$$ with $\epsilon_{i}\stackrel{i.i.d.}{\sim} F$ for some probability distribution $F$. With HDL as the response, and adding 1 SNP and 9 principal components each time, we obtain the below results summarised by a Manhattan plot with $F$ being Gaussian.
Main Task: Study the SNPs that are likely to be significant to changes in measures such as triglycerides, low-density lipoprotein (LDL) cholesterol and high-density lipoprotein (HDL) cholesterol to change, and which SNPs seem to be linked to the cardiovascular diseases the most. In particular, use linear and logistic regressions. Be careful with how you interpret the pp-values (adjust for Bonferonni correction).
Bonus: Other methods are also possible and it is up to you to explore these options.

Short description 

This dataset is rooted in a 10-question questionnaire that every single American citizen should have answered, issued by the US Census Bureau. These answers have been organised in several ways by the US Census Bureau and is hosted in several formats here.

We are using the 2012-2016 detailed tables – Block Groups- California dataset, which the US Census Bureau only hosts as a geodatabase format. The geodatabase file has information about both the geography and the metadata of all the block groups in a specific state.

A block group is a collection of several blocks, which are small areas usually defined by some geographic entity like a road, a river or other. One block group typically has a population of around 1000 people, however this can vary quite a lot. The geographic data was unfortunately omitted when we have exported all the metadata over to.csv files. The US census questionnaire asks for sex, age, gender, annual income, civil status, education, and employment status and a couple of more questions. The US census bureau has then restructured these answers into anonymous features describing the averages of some answers and the count of people fitting certain characteristic.

With over 7500 features there are a a lot of variables to consider and we want you to identify interesting correlations within the dataset that you think could be valuable to the global community.

Short description  

The dataset contains information of road accidents in UK for a specific period. There are around 300K inputs with 65 features, most of which are categorical. However, some features suffer from more than 30% missing values, which poses as a problem to the modelling procedure.

Candidates are required to predict the number of casualties, based on other covariates. They can drop any feature with large amount of missing values they like, but such drops would affect the accuracy.

Our suggestion for the missing values is to impute them using whatever methods the candidates consider appropriate. One potential way is random forest, but the size of this dataset makes it difficult to accommodate into a random forest model.

We fit 5000 data points into the Poisson regression, penalised in both $L^{1}$ and $L^{2}$. The graphs comparing actual outputs and predictions for the first 200 points.

The rules

AI Hack is organised by ICDSS, MathSoc and MLSoc. Imperial College London © AIHack 2018, Icon Copyright
Web design by Harrison Zhu & Enyi Shang