STAT 39000: Project 2 — Fall 2021
The (art?) of a docstring
Motivation: Documentation is one of the most critical parts of a project. There are so many tools that are specifically designed to help document a project, and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can’t go wrong with tools like Sphinx, or pdoc.
Context: This is the first project in a 2-project series where we explore thoroughly documenting Python code, while solving data-driven problems.
Scope: Python, documentation
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/apple/health/watch_dump.xml
Questions
The topics of this semester are outlined in the 39000 home page. In addition to those topics, there will be a slight emphasis on topics related to working with APIs. Each project this semester will continue to be data-driven, and will be based on the provided dataset(s). The dataset listed for this project will be one that is revisited throughout the semester, as we will be slowly building out functions, modules, tests, documentation, etc, that will come together towards the end of the semester. Of course, all projects that expect any sort of previous work will provide you with previous work in case you chose to skip any given project.
In this project we will work with pdoc to build some simple documentation, review some Python skills that may be rusty, and learn about a serialization and deserialization of data — a common component to many data science and computer science projects, and a key topics to understand when working with APIs.
For the sake of clarity, this project will have more deliverables than the "standard" .ipynb
notebook, .py
file containing Python code, and PDF. In this project, we will ask you to submit an additional PDF showing the documentation webpage that you will have built by the end of the project. How to do this will be made clear in the given question.
Make sure to select 4096 MB of RAM for this project. Otherwise you may get an issue reading the dataset in question 3. |
Question 1
Let’s start by navigating to ondemand.brown.rcac.purdue.edu, and launching a Jupyter Lab instance. In the previous project, you learned how to run various types of code in a Jupyter notebook (the .ipynb
file). Jupyter Lab is actually much more useful. You can open terminals on Brown (the cluster), as well as open a an editor for .R
files, .py
files, or any other text-based file.
Give it a try. In the "Other" category in the Jupyter Lab home page, where you would normally select the "f2021-s2022" kernel, instead select the "Python File" option. Upon clicking the square, you will be presented with a file called untitled.py
. Rename this file to firstname-lastname-project02.py
(where firstname
and lastname
are your first and last name, respectively).
Make sure you are in your |
Read the "3.8.2 Modules" section of Google’s Python Style Guide. Each individual .py
file is called a Python "module". It is good practice to include a module-level docstring at the top of each module. Create a module-level docstring for your new module. Rather than giving an explanation of the module, and usage examples, instead include a short description (in your own words, 3-4 sentences) of the terms "serialization" and "deserialization". In addition, list a few (at least 2) examples of different serialization formats, and include a brief description of the format, and some advantages and disadvantages of each. Lastly, if you could break all serialization formats into 2 broad categories, what would those categories be, and why?
Any good answer for the "2 broad categories" will be accepted. With that being said, a hint would be to think of what the serialized data looks like (if you tried to open it in a text editor, for example), or how it is read. |
Save your module.
Relevant topics: pdoc, Sphinx, Docstrings & Comments
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Now, in Jupyter Lab, open a new notebook using the "f2021-s2022" kernel (using the course notebook template).
You can have both the Python file and the notebook open in separate Jupyter Lab tabs for easier navigation. |
Fill in a code cell for question 1 with a Python comment.
# See firstname-lastname-project02.py
For this question, read the pdoc section, and run a bash
command to generate the documentation for your module that you created in the previous question, firstname-lastname-project02.py
. So, everywhere in the example in the pdoc section where you see "mymodule.py" replace it with your module’s name — firstname-lastname-project02.py
.
Use the |
Once complete, on the left-hand side of the Jupyter Lab interface, navigate to your output directory. You should see something called firstname-lastname-project02.html
. To view this file in your browser, right click on the file, and select Open in New Browser Tab. A new browser tab should open with your freshly made documentation. Pretty cool!
Ignore the |
You may have noticed that the docstrings are (partially) markdown-friendly. Try introducing some markdown formatting in your docstring for more appealing documentation. |
At this stage, you have the ability to create a PDF based on the generated webpage (but you do not yet need to do so). To do so, click on . This may vary slightly from browser to browser, but it should be fairly straightforward. |
Relevant topics: pdoc, Sphinx, Docstrings & Comments
-
Code used to solve this problem.
-
Output from running the code.
Question 3
When I refer to "watch data" I just mean the dataset for this project. |
Write a function to called get_records_for_date
that accepts an lxml
etree (of our watch data, via etree.parse
), and a datetime.date
, and returns a list of Record Elements, for a given date. Raise a TypeError
if the date is not a datetime.date
, or if the etree is not an lxml.etree
.
Use the Google Python Style Guide’s "Functions and Methods" section to write the docstring for this function. Be sure to include type annotations for the parameters and return value.
Re-generate your documentation. How does the updated documentation look? You may notice that the formatting is pretty ugly and things like "Args" or "Returns" are not really formatted in a way that makes it easy to read.
Use the -d
flag to specify the format as "google", and re-generate your documentation. How does the updated documentation look?
The following code should help get you started.
|
The following is some code that will be helpful to test the types.
|
To loop through records, you can use the
|
Add this function to your firstname-lastname-project02.py
file, and if you want, regenerate your new documentation that includes your new function.
Relevant topics: pdoc, Sphinx, Docstrings & Comments
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Great! Now, write a function called to_msgpack
, that accepts an lxml
Element, and an absolute path to the desired output file, checks to make sure it contains the following keys: type
, sourceVersion
, unit
, and value
, and encodes/serializes, then saves the result to the specified file.
The following code should help get you started.
|
|
Then, write a function called from_msgpack
, that accepts an absolute path to a serialized file, and returns an lxml
Element.
The following code should help get you started.
|
|
Add these functions to your firstname-lastname-project02.py
file, and regenerate your documentation. You should see some great looking documentation with your new functions.
Relevant topics: pdoc, Sphinx, Docstrings & Comments
-
Code used to solve this problem.
-
Output from running the code.
Question 5
This was hopefully a not-too-difficult project that gave you some exposure to tools in the Python ecosystem, as well as chipped away at any rust you may have had with writing Python code.
Finally, investigate the official pdoc documentation, and make at least 2 changes/customizations to your module. Some examples are below — feel free to get creative and do something with pdoc outside of this list of options:
-
Modify the module so you do not need to pass the
-d
flag in order to let pdoc know that you are using Google-style docstrings. -
Change the logo of the documentation to your own logo (or any logo you’d like).
-
Add some math formulas and change the output accordingly.
-
Edit and customize pdoc’s jinja2 template (or CSS).
Relevant topics: pdoc, Sphinx, Docstrings & Comments
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |