Does this variable appear to be important for the task at hand, and why?
The dataset for this coursework is included in the UCI machine learning repository, and was used for an actual research study aiming to understand and predict credit card holders who default on their debt. It is important to stress that the dataset you will be provided with is not identical to the one on the UCI repository, as it has been processed to correct specific errors. It is important however to bear in mind that as with any real-world dataset you should not expect the data to be perfect. Identifying any issues (limitations) with the data and attempting to correct these is part of the assessment. Dataset description: The data is a sample of 30,000 credit card holders from an important bank in Taiwan. The data was collected on October 2006. Therefore all amounts are in New Taiwan dollars (NT). In the variables’ list below first we report the name of each variable in the data frame format and then its description. • LIMIT BAL: Amount of credit, which includes both the individual consumer credit and his/her family (supplementary) credit. • EDUCATION: This is a categorical variable representing education: 1 = graduate school; 2 = university; 3 = high school; 4 = other/ unknown. • MARRIAGE: Marital status of credit card holder. Categorical variable taking values: 1 = married, 2=single, 3=unknown • AGE: Age of credit card holder • PAY_1, …, PAY_6: Repayment status over the last 6 months. Specifically, PAY_1 corresponds to repayment status in September, PAY_2 to repayment status in August, etc. A value of zero means that the credit card holder has repaid their credit card fully. A value of 1 means that there is a payment delay of one month; 2 means a repayment delay of two months, etc. • BILL AMT1, …, BILL AMT6: Bill statements over past six months: BILL AMT1 corresponds to Septem- ber 2005, BILL AMT2 to August 2005, etc up to BILL AMT6 which corresponds to April 2005. • PAY AMT1, …, PAY AMT6: Amount of previous payments over past six months: PAY AMT1 corresponds to September 2005, PAY AMT2 to August 2005, etc up to PAY AMT6 which corresponds to April 2005. • default: Binary response (class) variable. This binary variable indicates whether the credit card holder defaulted on the next monthly payment (default=1), or paid on time (default=0). Task description: The focus of this coursework is on exploratory data analysis (data understanding). Using appropriate visualisation methods and statistical measures covered in the first part of the course (the meaning of this is explained precisely at the top of this brief), develop general and specific insights from the data which are relevant to the classification problem at hand. Your report should discuss all the variables contained in the dataset, and for each variable your answer should address the questions: • Does this variable appear to be important for the task at hand, and why? Support your claims with appropriate visualisations that document whether and how important each variable is. • Are different variables related, and which variables convey information similar to that provided in other variable(s)? You should also report key findings related to issues of data quality such as incorrect observations, outliers, unexpected findings. Note that this is not an exhaustive list of questions.