Complete the find_dist function.
Learning Goal: I’m working on a python question and need an explanation and answer to help me learn.Will Provide all the files once question is acceptedIn this homework you will:Learn how to install new Python modules
Build up a complex analysis code by building smaller functions first
Perform some basic data exploration
BackgroundPython modulesPython has a lot of built-in functionality right out of the box: basic data structures like lists, sets, and dictionaries, functions that help use those data structures (like len), etc. There are a vast number of Python modules that provide additional functionality too. This functionality is not built in — not everyone needs it — but Python comes with tools to make it easy to use those modules.Installing a moduleTo install a module, you can use Python’s built-in package manager, pip. We will provide instructions for installing modules from the command line. These instructions will work with Python3 installations on Scholar, ECE Grid, most Linux distributions, and Mac OS. There are a number of different ways to get Python3 on Windows, so you will have to look at the documentation for your version to determine how to install a new module.Modules can be installed globally (so everyone on a machine has access to them) or locally (so only you have access to them). Installing modules globally requires root access to the machine (or other specially-set permissions), so we will provide instructions for installing modules locally.To install a module named scipy, you can use the following command:python3 -mpip install –user scipyThis command can let you install the latest version of the module. You can also install an exact version of the module by adding the version number after the module’s name followed by the sign ==:python3 -mpip install –user scipy==2.0.8 # install the scipy module with version number 2.0.8If scipy is already installed on your system but you want to upgrade to a new version of it, you can use the command:python3 -mpip install –user –upgrade scipyThere are also other choices of package managers, for example if you are using conda to manage your python environment, you can do the same thing as pip does with the following commands:conda install scipy # Install the latest version of scipy moduleconda install scipy=2.0.8 # Install the scipy version 2.0.8conda update scipy # Update the module scipyNote: In order to specify a specific version number pip uses == while conda uses =.To complete this homework, you will need to install the following modules with the specified versions or higher:numpy==1.14.0 : this is a module that provides array and matrix classes, and many mathematical operations on those classes. It is the foundation of many of the modules that are used in data science.
scipy==1.0.0 : this module provides many other useful functions for data analysis, including functions for dealing with probability distributions.
matplotlib==2.1.2 : a basic plotting/visualization library.
Note 1: You are welcome to use different versions of the above modules since a lot of them may already have been installed in your environment. There might be a slight chance that potential problems will occur if you are using modules that are of a different version, especially older versions than the ones provided above. Therefore, please make sure your version is at least large as the version number provided above.Note 2: If you encounter an error message of ModuleNotFoundError: No module named ‘tkinter’ then, in your code, replace the line:import matplotlib.pyplot as plt
with the following lines:import matplotlib
matplotlib.use(‘agg’)
import matplotlib.pyplot as plt
Using a moduleThe functions in a module are in that module’s namespace. To make sure that the function names do not collide with functions in other modules (or Python’s built-in functions), the functions need to be accessed through a prefix. To load a module, you have to tell Python (a) which module to load; and (b) what prefix to use when accessing the functions of that module. For example, the following code:import numpy as npTells Python you want to use the module numpy, and that you want to access the functions of numpy using the prefix np. For example, the following code will read in a list of numbers stored in a text file and give you back a list with those numbers in it:data = np.loadtxt(‘input.txt’)You can also use the import keyword to bring in functions from other files (think of these like #include directives in C). The following command will import function1 from a file called myfile.py:from myfile import function1You can also import all functions defined in another file, like helper.py for example, using the asterisk operator:from helper import *Incremental DevelopmentIn this homework, we will ask you to write a fairly complex piece of code: finding the number of histogram bins that results in the lowest error for a given data set. When you need to write complex code like this, your goal should be to break the problem down into smaller pieces. Write functions that solve each of the smaller pieces, then figure out how to connect those functions together (some of them might call other functions you wrote) to solve the overall problem.This approach makes it much easier to write complex code, both because you do not have to solve the problem all in one go, and because it makes it easier to test your code: you can test each of your smaller pieces individually to make sure that they work properly.In this homework, we will walk you through one particular way you can break down the problem (and, in fact, we want you to solve the problem in this way — we will test the individual pieces for partial credit).Store returned values from a functionReturn the coordinates from the data points in the probability plot (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html) to retun the values you can use the following command:val1,val2=stats.probplot(data, dist = ‘uniform’, plot=plt)
This will store both return values from probplot and will store them in val1 and val2 respectively. Only val1 (per documentation) will be related to the coordinates.Instructions1) Problem 1: Histogram Bin Width OptimizationIn this problem, you will implement histogram bin width optimization from a data set. You will use the histogram function in matplotlib.pyplot, accessed as matplotlib.pyplot.hist or plt.hist if you import matplotlib.pyplot as plt. Please read the documentation: matplotlib.pyplot.hist. (Note in particular that the function returns a tuple of three elements: n, bins, and patches, but you only need n, so be sure to unpack the output accordingly.)An example of tuple unpacking:List1 = [‘String1’, ‘String2’]
str1, str2 = List1 # str1 == ‘String1’ and str2 == ‘String2’
We have broken the problem down into smaller pieces for you. Problem1.py has four functions for you to fill in. Keep the signatures of these functions the same as you are filling them in; we will use these to assign partial credit.norm_histogram takes a histogram of counts and creates a histogram of probabilities.
compute_j computes the value of J for a given histogram and bin width (check histogram slides for more info).
sweep_n computes the j value for different number of bins, where number of bins will take values from min_bins to max_bins. Therefore, sweep_n should return a list of compute_j values. You will need to use the compute_j and matplotlib.pyplot.hist functions in your implementation. Note that sweep_n cares about the number of bins while j cares about the width of the bins — make sure to do the conversion!.
find_min is a generic function that takes a list of numbers and returns a tuple containing the average of the three smallest numbers in that list and the indexes of those three numbers.
You can use input.txt, provided in the repository, as test data. To test each function individually please refer to testbin.py. There are instructions provided in the file to test each portion of your code.Within testbin.py if:norm_histogram runs correctly then the output (shown only up to 4 points post decimal) will be [0.104 0.096 0.094 0.079 0.108 0.092 0.114 0.109 0.121 0.083]
compute_j works then the output will be -0.0101
sweep_n works then the output (shown only up to 4 points post decimal) should be [-0.0101 -0.0101 -0.0101 -0.0101 -0.0101 -0.0101 -0.0101 -0.0101 -0.0101 -0.0101 -0.0101 -0.0101 -0.01 -0.0101 -0.01 -0.01 -0.01 -0.01 -0.01 -0.01 -0.01 -0.01 -0.0099 -0.01 -0.01 -0.0099 -0.0099 -0.0099 -0.0098 -0.0099 -0.0098 -0.0099 -0.0099 -0.0099 -0.0098 -0.0098 -0.0098 -0.0098 -0.0097 -0.0098 -0.0098 -0.0098 -0.0098 -0.0098 -0.0097 -0.0097 -0.0097 -0.0097 -0.0097 -0.0096 -0.0097 -0.0097 -0.0096 -0.0096 -0.0096 -0.0095 -0.0096 -0.0097 -0.0095 -0.0096 -0.0095 -0.0095 -0.0096 -0.0095 -0.0094 -0.0095 -0.0095 -0.0095 -0.0094 -0.0093 -0.0094 -0.0093 -0.0095 -0.0094 -0.0093 -0.0094 -0.0094 -0.0094 -0.0093 -0.0094 -0.0092 -0.0094 -0.0093 -0.0092 -0.0093 -0.0092 -0.0091 -0.0092 -0.0093 -0.0093 -0.0092 -0.0093 -0.0091 -0.0092 -0.0092 -0.0092 -0.0091 -0.0091 -0.0091 -0.0091]
find_min executes then your output should be (-0.0101, [0, 7, 1]) Please note: you are not to truncate the values. We have only done so to keep the write up brief. The output from find_min will be checked up to four points post decimal.
If your functions all work, and you run the test code that is included in Problem1.py, you should produce the following output: (-0.0101, [0, 7, 1]). (Again, we truncated our value here only to keep this document brief.)The expected outputs show only up to three points of precision, your result from running testbin.py may contain longer floating points.The if __name__ == ‘__main__’ line in Problem1.py is a useful way to write tests for your code: this is code that will only run if you run this file as the main script; if this file is included from another script, this test code will not run.Problem 2: DistributionsIn this problem we will draw from your probability plot understanding to create a function that for any dataset reports what is the closest distribution fit between:Gaussian (norm)
Exponential (expon)
Uniform (uniform)
Wald (wald)
Note: for this problem we will assume that the best fit distribution is that in which the sum of squared distances from the coordinates of the probplot to the identity line (X=Y) is minimized.To complete this problem, do the following steps.Complete the get_coordinates function. This function takes in an array of data and the name of a distribution. It then calculates the QQ plot by calling the stats.probplot function with the dataset and the named distribution. The stats.probplot function returns a bizarre data structure: a tuple of two tuples; we’re concerned with the two values first tuple in the returned tuple. More concretely, the stats.probplot function returns something with a structure like ((X, Y), (c, d, e)), you will need to return the elements in the position of X and Y from the get_coordinates function (return it as a tuple like (X, Y)).
Complete the calculate_distance() function. This function takes in two floats and returns the calculated distance. The formula you need to use for this function is (in LaTeX form): $$sqrt{(x – frac{x+y}{2})^2 + (y – frac{x+y}{2})^2}$$ It performs vector projection to the identity line.
NOTE: If you can’t read LaTeX, you can copy and paste that formula in an online LaTeX compiler like QuickLaTeX.Complete the find_dist function. This function takes in a list of the sum of squared distances and a list of distributions. Your code must find the minimum value in the sum_err list of sums and the distribution at the same index in dists. Returns a tuple that contains the distribution selected and the error calculated. For example (‘norm’, 9.87546).
If your code is correct you should get the following results for the files sample_norm.csv, sample_expon.csv,sample_uniform.csv, sample_wald.csv respectively:(‘norm’, 96.90230310278383)
(‘expon’, 155.95940064211737)
(‘uniform’, 30.477151216719985)
(‘wald’, 2366.701864399592)
NOTE: we will only check your values to 4 decimal places.What to SubmitFor Problem 1, please submit Problem1.py with all the appropriate functions filled in.For Problem 2, please submit Problem2.py with all the appropriate functions filled in.
Requirements: Please make changes to the necessary files