STAT 39000: Project 2 — Fall 2021

The (art?) of a docstring

Motivation: Documentation is one of the most critical parts of a project. There are so many tools that are specifically designed to help document a project, and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can’t go wrong with tools like Sphinx, or pdoc.

Context: This is the first project in a 2-project series where we explore thoroughly documenting Python code, while solving data-driven problems.

Scope: Python, documentation

Learning Objectives

Use Sphinx to document a set of Python code.
Use pdoc to document a set of Python code.
Write and use code that serializes and deserializes data.
Learn the pros and cons of various serialization formats.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

/depot/datamine/data/apple/health/watch_dump.xml

Questions

The topics of this semester are outlined in the 39000 home page. In addition to those topics, there will be a slight emphasis on topics related to working with APIs. Each project this semester will continue to be data-driven, and will be based on the provided dataset(s). The dataset listed for this project will be one that is revisited throughout the semester, as we will be slowly building out functions, modules, tests, documentation, etc, that will come together towards the end of the semester. Of course, all projects that expect any sort of previous work will provide you with previous work in case you chose to skip any given project.

In this project we will work with pdoc to build some simple documentation, review some Python skills that may be rusty, and learn about a serialization and deserialization of data — a common component to many data science and computer science projects, and a key topics to understand when working with APIs.

For the sake of clarity, this project will have more deliverables than the "standard" .ipynb notebook, .py file containing Python code, and PDF. In this project, we will ask you to submit an additional PDF showing the documentation webpage that you will have built by the end of the project. How to do this will be made clear in the given question.

Make sure to select 4096 MB of RAM for this project. Otherwise you may get an issue reading the dataset in question 3.

Question 1

Let’s start by navigating to ondemand.brown.rcac.purdue.edu, and launching a Jupyter Lab instance. In the previous project, you learned how to run various types of code in a Jupyter notebook (the .ipynb file). Jupyter Lab is actually much more useful. You can open terminals on Brown (the cluster), as well as open a an editor for .R files, .py files, or any other text-based file.

Give it a try. In the "Other" category in the Jupyter Lab home page, where you would normally select the "f2021-s2022" kernel, instead select the "Python File" option. Upon clicking the square, you will be presented with a file called untitled.py. Rename this file to firstname-lastname-project02.py (where firstname and lastname are your first and last name, respectively).

Make sure you are in your $HOME directory when clicking the "Python File" square. Otherwise you may get an error stating you do not have permissions to create the file.

Read the "3.8.2 Modules" section of Google’s Python Style Guide. Each individual .py file is called a Python "module". It is good practice to include a module-level docstring at the top of each module. Create a module-level docstring for your new module. Rather than giving an explanation of the module, and usage examples, instead include a short description (in your own words, 3-4 sentences) of the terms "serialization" and "deserialization". In addition, list a few (at least 2) examples of different serialization formats, and include a brief description of the format, and some advantages and disadvantages of each. Lastly, if you could break all serialization formats into 2 broad categories, what would those categories be, and why?

Any good answer for the "2 broad categories" will be accepted. With that being said, a hint would be to think of what the serialized data looks like (if you tried to open it in a text editor, for example), or how it is read.

Save your module.

Relevant topics: pdoc, Sphinx, Docstrings & Comments

Items to submit

Code used to solve this problem.
Output from running the code.

Question 2

Now, in Jupyter Lab, open a new notebook using the "f2021-s2022" kernel (using the course notebook template).

You can have both the Python file and the notebook open in separate Jupyter Lab tabs for easier navigation.

Fill in a code cell for question 1 with a Python comment.

# See firstname-lastname-project02.py

For this question, read the pdoc section, and run a bash command to generate the documentation for your module that you created in the previous question, firstname-lastname-project02.py. So, everywhere in the example in the pdoc section where you see "mymodule.py" replace it with your module’s name — firstname-lastname-project02.py.

Use the -o flag to specify the output directory — I would suggest making it somewhere in your $HOME directory to avoid permissions issues.

Once complete, on the left-hand side of the Jupyter Lab interface, navigate to your output directory. You should see something called firstname-lastname-project02.html. To view this file in your browser, right click on the file, and select Open in New Browser Tab. A new browser tab should open with your freshly made documentation. Pretty cool!

Ignore the index.html file — we are looking for the firstname-lastname-project02.html file.

You may have noticed that the docstrings are (partially) markdown-friendly. Try introducing some markdown formatting in your docstring for more appealing documentation.

At this stage, you have the ability to create a PDF based on the generated webpage (but you do not yet need to do so). To do so, click on File Print… Destination Save to PDF. This may vary slightly from browser to browser, but it should be fairly straightforward.

Relevant topics: pdoc, Sphinx, Docstrings & Comments

Items to submit

Code used to solve this problem.
Output from running the code.

Question 3

When I refer to "watch data" I just mean the dataset for this project.

Write a function to called get_records_for_date that accepts an lxml etree (of our watch data, via etree.parse), and a datetime.date, and returns a list of Record Elements, for a given date. Raise a TypeError if the date is not a datetime.date, or if the etree is not an lxml.etree.

Use the Google Python Style Guide’s "Functions and Methods" section to write the docstring for this function. Be sure to include type annotations for the parameters and return value.

Re-generate your documentation. How does the updated documentation look? You may notice that the formatting is pretty ugly and things like "Args" or "Returns" are not really formatted in a way that makes it easy to read.

Use the -d flag to specify the format as "google", and re-generate your documentation. How does the updated documentation look?

The following code should help get you started.

import lxml.etree
from datetime import datetime, date

# read in the watch data
tree = etree.parse('/depot/datamine/data/apple/health/watch_dump.xml')

def get_records_for_date(tree: lxml.etree._ElementTree, for_date: date) -> list[lxml.etree._Element]:
    # docstring goes here

    # test if `tree` is an `lxml.etree._ElementTree`, and raise TypeError if not

    # test if `for_date` is a `datetime.date`, and raise TypeError if not

    # loop through the records in the watch data using the xpath expression `/HealthData/Record`
        # how to see a record, in case you want to
        print(lxml.etree.tostring(record))

        # test if the record's `startDate` is the same as `for_date`, and append to a list if it is

    # return the list of records

# how to test this function
chosen_date = datetime.strptime('2019/01/01', '%Y/%m/%d').date()
my_records = get_records_for_date(tree, chosen_date)

The following is some code that will be helpful to test the types.

from datetime import datetime, date

isinstance(some_date_object, date) # test if some_date_object is a date
isinstance(some_xml_tree_object, lxml.etree._ElementTree) # test if some_xml_tree_object is an lxml.etree._ElementTree

To loop through records, you can use the xpath method.

for record in tree.xpath('/HealthData/Record'):
    # do something with record

Add this function to your firstname-lastname-project02.py file, and if you want, regenerate your new documentation that includes your new function.

Relevant topics: pdoc, Sphinx, Docstrings & Comments

Items to submit

Code used to solve this problem.
Output from running the code.

Question 4

Great! Now, write a function called to_msgpack, that accepts an lxml Element, and an absolute path to the desired output file, checks to make sure it contains the following keys: type, sourceVersion, unit, and value, and encodes/serializes, then saves the result to the specified file.

The following code should help get you started.

import msgpack

def to_msgpack(element: lxml.etree._Element, file: str) -> None:
    # docstring goes here

    # test if `file` is a `str`, and raise TypeError if not

    # test if `element` is a `lxml.etree._Element`, and raise TypeError if not

    # convert `element.attrib` into a dict

    # test if the dict contains the keys `type`, `sourceVersion`, `unit`, and `value`, and raise ValueError if not

    # remove "other" non-type/sourceVersion/unit/value keys from the dict

    # use msgpack library to serialize the dict to a msgpack file

# how to use this function
chosen_date = datetime.strptime('2019/01/01', '%Y/%m/%d').date()
my_records = get_records_for_date(tree, chosen_date)
to_msgpack(my_records[0], '$HOME/my_records.msgpack')

to_msgpack(my_records[0], '$HOME/my_records.msgpack') may not work, depending on how you set up your function, you may need to use an absolute path like to_msgpack(my_records[0], '/home/kamstut/my_records.msgpack').

Then, write a function called from_msgpack, that accepts an absolute path to a serialized file, and returns an lxml Element.

The following code should help get you started.

def from_msgpack(file: str) -> lxml.etree._Element:
    # docstring goes here

    # test if `file` is a `str`, and raise TypeError if not

    # deserialize the msgpack file into a dict

    # create new "Record" element
    e = etree.Element('Record')

    # loop through keys and values in the dict
    # and set the attributes of the new "Record" element
    # NOTE: This assumed the dict is called "d"
    for key, value in d.items():
        e.attrib[key] = str(value)

    # return the new "Record" element

# how to use this function
print(lxml.etree.tostring(from_msgpack('$HOME/my_records.msgpack')))

print(lxml.etree.tostring(from_msgpack('$HOME/my_records.msgpack'))) may not work, depending on how you set up your function, you may need to use an absolute path like print(lxml.etree.tostring(from_msgpack('/home/kamstut/my_records.msgpack'))).

Add these functions to your firstname-lastname-project02.py file, and regenerate your documentation. You should see some great looking documentation with your new functions.

Relevant topics: pdoc, Sphinx, Docstrings & Comments

Items to submit

Code used to solve this problem.
Output from running the code.

Question 5

This was hopefully a not-too-difficult project that gave you some exposure to tools in the Python ecosystem, as well as chipped away at any rust you may have had with writing Python code.

Finally, investigate the official pdoc documentation, and make at least 2 changes/customizations to your module. Some examples are below — feel free to get creative and do something with pdoc outside of this list of options:

Modify the module so you do not need to pass the -d flag in order to let pdoc know that you are using Google-style docstrings.
Change the logo of the documentation to your own logo (or any logo you’d like).
Add some math formulas and change the output accordingly.
Edit and customize pdoc’s jinja2 template (or CSS).

Relevant topics: pdoc, Sphinx, Docstrings & Comments

Items to submit

Code used to solve this problem.
Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.