STAT 39000: Project 3 — Fall 2021
Thank yourself later and document now
Motivation: Documentation is one of the most critical parts of a project. There are so many tools that are specifically designed to help document a project, and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can’t go wrong with tools like Sphinx, or pdoc.
Context: This is the second project in a 2-project series where we explore thoroughly documenting Python code, while solving data-driven problems.
Scope: Python, documentation
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/apple/health/watch_dump.xml
Questions
In this project, we are going to use the most popular Python documentation generation tool, Sphinx, to generate documentation for the module we created in project (2). If you chose to skip project (2), the module, in its entirety, will be posted at the latest, this upcoming Monday. You do not need that module to complete this project. Your module from project (2) does not need to be perfect for this project.
Last project was more challenging than intended. This project will provide a bit of a reprieve, and should (hopefully) be fun to mess around with.
project_02_module.py
"""This module is for project 2 for STAT 39000.
**Serialization:** Serialization is the process of taking a set or subset of data and transforming it into a specific file format that is designed for transmission over a network, storage, or some other specific use-case.
**Deserialization:** Deserialization is the opposite process from serialization where the serialized data is reverted back into its original form.
The following are some common serialization formats:
- JSON
- Bincode
- MessagePack
- YAML
- TOML
- Pickle
- BSON
- CBOR
- Parquet
- XML
- Protobuf
**JSON:** One of the more wide-spread serialization formats, JSON has the advantages that it is human readable, and has a excellent set of optimized tools written to serialize and deserialize. In addition, it has first-rate support in browsers. A disadvantage is that it is not a fantastic format storage-wise (it takes up lots of space), and parsing large JSON files can use a lot of memory.
**MessagePack:** MessagePack is a non-human-readable file format (binary) that is extremely fast to serialize and deserialize, and is extremely efficient space-wise. It has excellent tooling in many different languages. It is still not the *most* space efficient, or *fastest* to serialize/deserialize, and remains impossible to work with in its serialized form.
Generally, each format is either *human-readable* or *not*. Human readable formats are able to be read by a human when opened up in a text editor, for example. Non human-readable formats are typically in some binary format and will look like random nonsense when opened in a text editor.
"""
import lxml
import lxml.etree
from datetime import datetime, date
def my_function(a, b):
"""
>>> my_function(2, 3)
6
>>> my_function('a', 3)
'aaa'
>>> my_function(1, 3)
4
"""
return a * b
def get_records_for_date(tree: lxml.etree._ElementTree, for_date: date) -> list:
"""
Given an `lxml.etree` object and a `datetime.date` object, return a list of records
with the startDate equal to `for_date`.
Args:
tree (lxml.etree): The watch_dump.xml file as an `lxml.etree` object.
for_date (datetime.date): The date for which returned records should have a startDate equal to.
Raises:
TypeError: If `tree` is not an `lxml.etree` object.
TypeError: If `for_date` is not a `datetime.date` object.
Returns:
list: A list of records with the startDate equal to `for_date`.
"""
if not isinstance(tree, lxml.etree._ElementTree):
raise TypeError('tree must be an lxml.etree')
if not isinstance(for_date, date):
raise TypeError('for_date must be a datetime.date')
results = []
for record in tree.xpath('/HealthData/Record'):
if for_date == datetime.strptime(record.attrib.get('startDate'), '%Y-%m-%d %X %z').date():
results.append(record)
return results
def from_msgpack(file: str) -> lxml.etree._Element:
"""
Given the absolute path a msgpack file, return the deserialized `lxml.Element` object.
Args:
file (str): The absolute path of the msgpack file to deserialize.
Raises:
TypeError: If `file` is not a `str`.
Returns:
lxml.Element: The deserialized `lxml.Element` object.
"""
if not isinstance(file, str):
raise TypeError('file must be a str')
with open(file, 'rb') as f:
d = msgpack.load(f)
e = etree.Element('Record')
for key, value in d.items():
e.attrib[key] = str(value)
return e
def to_msgpack(element: lxml.etree._Element, file: str) -> None:
"""
Given an `lxml.Element` object and a file path, serialize the `lxml.Element` object to
a msgpack file at the given file path.
Args:
element (lxml.Element): The element to serialize.
file (str): The absolute path of the msgpack file to and save.
Raises:
TypeError: If `file` is not a `str`.
TypeError: If `element` is not an `lxml.Element`.
Returns:
None: None
"""
if not isinstance(file, str):
raise TypeError('file must be a str')
if not isinstance(element, lxml.etree._Element):
raise TypeError('element must be an lxml.Element')
# Test if `type`, `sourceVersion`, `unit`, and `value` are present in the element.
d = dict(element.attrib)
if not d.get('type') or not d.get('sourceVersion') or not d.get('unit') or not d.get('value'):
raise ValueError('element must have all of the following keys: type, sourceVersion, unit, and value')
# Remove "other" keys from the dict
keys_to_remove = []
for key in d.keys():
if key not in ['type', 'sourceVersion', 'unit', 'value']:
keys_to_remove.append(key)
for key in keys_to_remove:
del d[key]
with open(file, 'wb') as f:
msgpack.dump(d, f)
if __name__ == '__main__':
import doctest
doctest.testmod()
Question 1
Please use Firefox for this project. If you choose to use Chrome, the appearance of the documentation will be horrible. If you choose to use Chrome anyway, it is recommended that you change a setting in Chrome, temporarily, for this project, by typing (where you would normally put the URL): chrome://flags Then, search for "samesite". For "SameSite by default cookies", change from "Default" to "Disabled", and restart the browser. |
-
Create a new folder in your
$HOME
directory calledproject3
. -
Create a new Jupyter notebook in that folder called
project3.ipynb
, based on the normal project template.The majority of this notebook will just contain a single
bash
cell with the commands used to re-generate the documentation. This is okay, and by design. The main deliverable for this project will end up being the PDF of the documentation’s HTML page. -
Copy and paste the code from project (2)'s
firstname-lastname-project02.py
module into the$HOME/project3
directory, you can rename this to befirstname_lastname_project03.py
. -
In a
bash
cell in your Jupyter notebook, make sure youcd
theproject3
folder, and run the following command:python -m sphinx.cmd.quickstart ./docs -q -p project3 -a "Kevin Amstutz" -v 1.0.0 --sep
Please replace "Kevin Amstutz" with your own name.
What do each of these arguments do? Check out this page of the official documentation.
You should be left with a newly created docs
folder within your project3
folder. Your structure should look something like the following.
project03(1) ├── 39000_f2021_project03_solutions.ipynb(2) ├── docs(3) │ ├── build (4) │ ├── make.bat │ ├── Makefile (5) │ └── source (6) │ ├── conf.py (7) │ ├── index.rst (8) │ ├── _static │ └── _templates └── kevin_amstutz_project03.py(9) 5 directories, 6 files
1 | Our module (named project03 ) folder |
2 | Your project notebook (probably named something like firstname_lastname_project03.ipynb ) |
3 | Your documentation folder |
4 | Your empty build folder where generated documentation will be stored |
5 | The Makefile used to run the commands that generate your documentation. Make the following changes:
|
6 | Your source folder. This folder contains all hand-typed documentation. |
7 | Your conf.py file. This file contains the configuration for your documentation. Make the following changes:
|
8 | Your index.rst file. This file (and all files ending in .rst ) is written in reStructuredText — a Markdown-like syntax. |
9 | Your module. This is the module containing the code from the previous project, with nice, clean docstrings. |
Finally, with the modifications above having been made, run the following command in a bash
cell in Jupyter notebook to generate your documentation.
cd $HOME/project3/docs
make html
After complete, your module folders structure should look something like the following.
project03 ├── 39000_f2021_project03_solutions.ipynb ├── docs │ ├── build │ │ ├── doctrees │ │ │ ├── environment.pickle │ │ │ └── index.doctree │ │ └── html │ │ ├── genindex.html │ │ ├── index.html │ │ ├── objects.inv │ │ ├── search.html │ │ ├── searchindex.js │ │ ├── _sources │ │ │ └── index.rst.txt │ │ └── _static │ │ ├── alabaster.css │ │ ├── basic.css │ │ ├── custom.css │ │ ├── doctools.js │ │ ├── documentation_options.js │ │ ├── file.png │ │ ├── jquery-3.5.1.js │ │ ├── jquery.js │ │ ├── language_data.js │ │ ├── minus.png │ │ ├── plus.png │ │ ├── pygments.css │ │ ├── searchtools.js │ │ ├── underscore-1.13.1.js │ │ └── underscore.js │ ├── make.bat │ ├── Makefile │ └── source │ ├── conf.py │ ├── index.rst │ ├── _static │ └── _templates └── kevin_amstutz_project03.py 9 directories, 29 files
In the left-hand pane in the Jupyter Lab interface, navigate to $HOME/project3/docs/build/html/
, and right click on the index.html
file and choose Open in New Browser Tab. You should now be able to see your documentation in a new tab.
Make sure you are able to generate the documentation before you proceed, otherwise, you will not be able to continue to modify, regenerate, and view your documentation. |
-
Code used to solve this problem (in 2 Jupyter
bash
cells).
Question 2
One of the most important documents in any package or project is the README.md file. This file is so important that version control companies like GitHub and GitLab will automatically display it below the repositories contents. This file contains things like instructions on how to install the packages, usage examples, lists of dependencies, license links, etc. Check out some popular GitHub repositories for projects like numpy
, pytorch
, or any other repository you’ve come across that you believe does a good job explaining the project.
In the docs/source
folder, create a new file called README.rst
. Choose 3-5 of the following "types" of reStruturedText from the this webpage, and create a fake README. The content can be Lorem Ipsum type of content as long as it demonstrates 3-5 of the types of reStruturedText.
-
Inline markup
-
Lists and quote-like blocks
-
Literal blocks
-
Doctest blocks
-
Tables
-
Hyperlinks
-
Sections
-
Field lists
-
Roles
-
Images
-
Footnotes
-
Citations
-
Etc.
Make sure to include at least 1 section. This counts as 1 of your 3-5. |
Once complete, add a reference to your README to the index.rst
file. To add a reference to your README.rst
file, open the index.rst
file in an editor and add "README" as follows.
.. project3 documentation master file, created by
sphinx-quickstart on Wed Sep 1 09:38:12 2021.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to project3's documentation!
====================================
.. toctree::
:maxdepth: 2
:caption: Contents:
README
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
Make sure "README" is aligned with ":caption:" — it should be 3 spaces from the left before the "R" in "README". |
In a new bash
cell in your notebook, regenerate your documentation. Check out the resulting index.html
page, and click on the links. Pretty great!
-
Code used to solve this problem.
-
Screenshot or PDF labeled "question02_results".
Question 3
The pdoc
package was specifically designed to generate documentation for Python modules using the docstrings in the module. As you may have noticed, this is not "native" to Sphinx.
Sphinx has extensions. One such extension is the autodoc extension. This extension provides the same sort of functionality that pdoc
provides natively.
To use this extension, modify the conf.py
file in the docs/source
folder.
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autodoc'
]
Next, update your index.rst
file so autodoc knows which modules to extract data from.
.. project3 documentation master file, created by
sphinx-quickstart on Wed Sep 1 09:38:12 2021.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to project3's documentation!
====================================
.. automodule:: firstname_lastname_project03
:members:
.. toctree::
:maxdepth: 2
:caption: Contents:
README
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
In a new bash
cell in your notebook, regenerate your documentation. Check out the resulting index.html
page, and click on the links. Not too bad!
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Okay, while the documentation looks pretty good, clearly, Sphinx does not recognize Google style docstrings. As you may have guessed, there is an extension for that.
Add the napoleon
extension to your conf.py
file.
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.napoleon'
]
In a new bash
cell in your notebook, regenerate your documentation. Check out the resulting index.html
page, and click on the links. Much better!
-
Code used to solve this problem.
-
Output from running the code.
Question 5
To make it explicitly clear what files to submit for this project:
|
At this stage, you should have a pretty nice set of documentation, with really nice in-code documentation in the form of docstrings. However, there is still another "thing" to add to your docstrings that can take them to the next level.
doctest
is a standard library tool that allows you to include code, with expected output inside your docstring. Not only can this be nice for the user to see, but both pdoc
and Sphinx applies special formatting to such additions to a docstring.
Write a super simple function, it could be as simple as adding a couple of digits and returning a value. The following is an example. Come up with your own function with at least 1 passing test and 1 failing test (like the example).
def add(value1, value2):
"""Function to add two values.
The first example below will pass (because 1+1 is 2), the second will fail (because 1+2 is not 5)
>>> add(1, 1)
2
>>> add(1, 2)
5
"""
return value1 + value2
Where ">>>" represents the Python REPL and code demonstrating how you would use the function, and the line immediately following is the expected output.
Make sure your function actually does something so you can test to see if it is working as intended or not. |
To use doctest, add the following to the bottom of your firstname_lastname_project03.py
file.
if __name__ == '__main__':
import doctest
doctest.testmod()
Now, in a new bash
cell in your notebook, run the following command.
python kevin_amstutz_project03.py -v
This will actually run your example code in the docstring and compare the output to the expected result! Very cool. We will learn more about this in the next couple of projects.
When including the |
Now, regenerate your documentation again and check it out. Notice how the lines in the docstring are neatly formatted? Pretty great.
Okay, last but not least, check out the themes here, and choose one of the themes listed, regenerate your documentation, and save the webpage to a PDF for submission. Note that each theme may have slightly different requirements on how to "activate" it. For example, to use the "Readable" theme, you must add the following to your conf.py
file.
import sphinx_readable_theme
html_theme = 'readable'
html_theme_path = [sphinx_readable_theme.get_html_theme_path()]
You can change a theme by changing the value of |
If a theme doesn’t work, just select a different theme. |
Unlike
|
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |