When I work with unfamiliar YAML files specifying deployment manifests, product metadata, serialized records, etc. I want to quickly get a sense of a few things:
- What is the set of keys in this data structure?
- If the structure(nested keys) of the document changed over time, what is the quick summary of changes
structure_digest
Given the following long YAML file, I don’t really want to read through all of it to learn what keys and paths are available in it:
{
"receipt": "Oz-Ware Purchase Invoice",
"date": "2012-08-06",
"customer": {
"given": "Dorothy",
"family": "Gale"
},
"items": [
{
"part_no": "A4786",
"descrip": "Water Bucket (Filled)",
"price": 1.47,
"quantity": 4
},
{
"part_no": "E1628",
"descrip": "High Heeled "Ruby" Slippers",
"size": 8,
"price": 100.27,
},
…(many many more items )
]
}
Lets remove the value content, focus on structure, summarizing array entries as one:
>structure_digest order1.yml order2.yml ...
.customer.family
.customer.given
.date
.items[].descrip
.items[].part_no
.items[].price
.items[].quantity
.items[].size
.receipt
This summary hints at the basic structure of the file, particularly removing the noise of many items having very similar content and keys.
Usage
Usage: structure_digest [options] File1[, File2, ...]
-t, --tree replace repeated suffixes with indents
Usecases: Web APIs
Discogs.com provides a rich api of music records. Fetching a page of Pink Floyd’s releases returns a hefty 15K of minimized JSON:
curl -s http://api.discogs.com/artists/45467/releases > pink-floyd.json
>wc -c pink-floyd.json
15186 pink-floyd.json
>head pink-floyd.json
{"pagination": {"per_page": 50, "items": 1330, "page": 1, "urls": {"last":
"http://api.discogs.com/artists/45467/releases?per_page=50&page=27", "next":
"http://api.discogs.com/artists/45467/releases?per_page=50&page=2"}, "pages":
27}, "releases": [{"thumb":
"http://api.discogs.com/image/R-150-1090924-1191680758.jpeg", "artist": "Pink
Floyd, The*", "main_release": 1090924, "title": "Apples And Oran…
~100s of lines in my terminal. But we can quickly understand this document now:
>structure_digest --tree pink-floyd.json
.pagination
.items
.page
.pages
.per_page
.urls
.last
.next
.releases[]
.artist
.format
.id
.label
.main_release
.resource_url
.role
.status
.thumb
.title
.type
.year
Usecases: Configuration files
A BOSH manifest specifies a cloud deployment. It’s used by Cloud Foundry and its configuration is rich. Lets abstract its example manifest and find the fields configuring a BOSH “job”:
>structure_digest bosh_example.yml | grep -E "^.jobs"
.jobs[].instances
.jobs[].name
.jobs[].networks[].name
.jobs[].networks[].static_ips[]
.jobs[].persistent_disk
.jobs[].resource_pool
.jobs[].template
Pretty neat.
Finding structure changes with diff
If you have 2 versions of some information format and an example of each, here’s a quick way to see what changed:
>diff <(structure_digest old.json) <(structure_digest new.json)
2,4c2,3
< .pagination.page
< .pagination.pages
< .pagination.per_page
---
> .pagination.limit
> .pagination.offset
This is great, we can tell that the API introduced a change from pagination to offsets and limits
Learn more & respond
The project is on github. Please follow it there for new features, changes.
What do you think of this tool? Do you love it? Do you hate it? Let me know in the comments.