################# Outlier detection ################# .. _diagnostics-preparation: **************************************** Introduction to the diagnostics exercise **************************************** * reading bytes from a file; * calculating a hash from the bytes. See :ref:`reading-git-objects`. * stripping and splitting strings with ``my_string.strip()`` and ``my_string.split()``; * raising Errors with ``raise``; * manipulating paths with ``os.path`` (see: :doc:`path_manipulation`); Split into groups, then: * Get the data for your group from the USB key(s); * Tell me about your groups; * "Fork" the repository for your group, to your github user: * Group 00 forks : https://github.com/psych-214-fall-2016/diagnostics-00 * Group 01 forks : https://github.com/psych-214-fall-2016/diagnostics-01 * Group 02 forks : https://github.com/psych-214-fall-2016/diagnostics-02 * Clone your forked repository. For example:: git clone https://github.com/matthew-brett/diagnostics-00 where ``matthew-brett`` is your github user name, and ``00`` is your group. * Change directory into this new directory, e.g.:: cd diagnostics-00 * Unpack and copy the data from the USB key into your new ``data`` directory. The following instructions assume you are running from the terminal in OSX and Linux, or the git bash shell in Windows. * If you have already unpacked your ``group0x.tar.gz`` archive (where x can be 0, 1 or 2), then copy the ``*.nii`` files to your data directory with something like:: cd data cp ~/Downloads/group00/* . where ``~/Downloads/group00`` is the directory you unpacked to; * If you haven't unpacked the archive yet, you can unpack the archive into your data directory with:: cd data tar zxvf ~/Downloads/group00.tar.gz Your ``data`` directory should now contain 20 files with filenames starting with ``group0`` and ending with ``.nii``; another file called ``hash_list.txt``, and a file called ``data_hashes.txt`` that came with the repository when you cloned it; * Now do ``git status``. You will see that none of the files that you have just copied show up in git's listing of untracked files. This is because I put a clever ``.gitignore`` file in the ``data`` directory, to tell git to ignore all files except the ``data_hashes.txt`` file. You can see the file by opening it in Atom with ``atom .gitignore``; * If your terimal is currently running in the ``data`` subdirectory, change directory back to the ``diagnostics-00`` (etc) directory with ``cd ..``; * Have a look at ``hash_list.txt`` file, by opening it in Atom:: atom data/hash_list.txt For each of the ``.nii`` files, ``hash_list.txt`` has a line with the SHA1 hash for that file, and the filename, separated by a space; * You want to be able to confirm that your data has not beed overwritten or corrupted. To do this, you need to calculate the current hash for each ``.nii`` file and compare it to the hash value in ``hash_list.txt``; * See :ref:`reading-git-objects` for a reminder of how to read file contents and calculate the SHA1 hash for the contents; * Now run ``python3 scripts/validata_data.py data``. When you first run this file, it will fail; * Edit ``scripts/validate_data.py`` in Atom to fix. .. _outlier-detection-project: ************************* Outlier detection project ************************* You have three weeks to complete this exercise. Your goal is to: #. Fill out the script and any needed library code to run ``scripts/find_outliers.py data`` on your data, and return a list of outlier volumes for each scan (where there is an outlier); #. You should add a text file giving a brief summary for each outlier scan, why you think the detected scans should be rejected as an outlier, and your educated guess as to the cause of the difference between this scan and the rest of the scans in the run; #. You should do this by collaborating in your teams using git and github; Grading will be on: * the quality of your outlier detection as assessed by the improvement in the statistical testing for the experimental model after removing the outliers; * the generality of your outlier detection as assessed by the improvement in the statistical testing for the experimental model after removing the outliers, for another similar dataset; * the quality of your code; * the quality and transparency of your process, from your interactions on github; * the quality of your arguments about the scans rejected as outliers. Your outlier detection script should be *reproducible*. That means that we, your graders, should be able to clone your repository, and then follow simple instructions in order to be able to reproduce your run of ``scripts/find_outliers.py data`` and get the same answer. To make this possible, fill out the ``README.md`` text file in your repository to describe a few simple steps that we can take to set up on our own machines and run your code. Have a look at the current ``README.md`` file for a skeleton. We should be able to perform these same steps to get the same output as you from the outlier detection script.