HDX disaggregation wizard
LiQuer is a small server-side framework that can be quite helpful when building data-oriented web applications.
One such example is HDX disaggregation wizard. It is a tool solving a simple task: splitting (disaggregating) a single data sheet (csv or xlsx) into multiple sheets. Sheet is split by values in the specified column (or multiple columns).
Since this is a quite generic task (related to group by), this functionality is built into liquer.ext.lq_pandas
.
The core of this feature is the eq
command, filtering dataframe by specific values in a column,
e.g. eq-a-123
keeps in the dataframe only rows where column a
is 123.
Command eq
(equal) accepts multiple column-value pairs, e.g. eq-a-123-b-234
.
In HDX the convention is to use the first row for tags. To support this comvention, teq
command (tag equal)
always keeps the first row of the dataframe. The disaggregation service supports both tagged and untagged data,
using either eq
or teq
for filtering, depending on the user input.
The complete flow is simple:
* fetch data (command df_from
)
* find unique values in a column (or multiple columns) and use them
to create a list (table) of queries (command split_df
)
* the queries use eq
(or teq
) to filter dataframe by value(s).
So, use a query like df_from-URL/split_df-COLUMN
and you will get a table with queries like df_from-URL/eq-COLUMN-VALUE1
,
df_from-URL/eq-COLUMN-VALUE2
.
A little detail regarding the split function:
There are actually four versions of this function - depending whether it is used for tagged or untagged document
and whether it is quick (or query) splits or full splits.
The quick version only provides raw LiQuer queries (not the complete URL),
The full split (split_df
for untagged and tsplit_df
for tagged data) execute all the split queries,
which might be slow. As a side effect, the results are cached (depending on the configuration, the example is using FileCache('cache')
).
The complete user interface is in a single html file hdx_wizard.html
, served by the flask server.
Inside the user interface, LiQuer service is called multiple times e.g. to get previews or metadata:
* Data previews uses the ability of LiQuer lq_pandas
to convert dataframes to json, which can easily be read into javascript on the browser. First preview uses head_df
command to display only a restricted part of the dataframe (head)
* columns_info
command is used to get lit of columns and eventual tags
* /api/build
service is used to build valid queries from javascript lists. This could be implemented directly in javascript,
build service is a way to remotely call liquer.parser.encode
.
Integration of libhxl (example of a custom state type)
Pandas is great, but there are other good libraries too e.g. tabulate. If you want to to use other data type (tabular or other), it will typically require (besides some useful commands) defining how that data can be serialized. This is done by implementing a state type. State type does several things associated with state type handling, but the most important role is handling serialization and deserialization.
One excelent library used for working with humanitarian data is
libhxl.
Libhxl plays somewhat similar role as pandas: it reads, writes and manipulates tabular data - but it does as well understand HXL,
which pandas doesn't - hence the liquer.ext.lq_hxl
module.
In order to allow libhxl objects to be used in liquer,
we need to define a state type: HxlStateType
.
import hxl
from liquer.state_types import StateType, register_state_type, mimetype_from_extension
class HxlStateType(StateType):
def identifier(self):
"Define an unique string identifier for the state type"
return "hxl_dataset"
The identifier
is important e.g. for caching,
where it is stored as a part of metadata and it
tells what StateType should be used for deserialization.
def default_extension(self):
"Default file extension for the state type"
return "csv"
def is_type_of(self, data):
"Check if data is of this state type"
return isinstance(data, hxl.model.Dataset)
Default extension is used when the extension is not specified otherwise - for example if query does not end with a filename.
The as_bytes
and from_bytes
are two most important methods,
which take care of the serialization and deserialization.
A state data can be serialized into multiple formats (e.g. csv, html, json...), therefore as_bytes
optionally accepts a file extension
and returns (besides the bytes) as well the mimetype.
Th mimetype (when queried through the liquer server) becomes a part of the web service response.
Note that serialization and deserialization do not necessarily need to support the same formats. E.g. html is quite nice to support in serialization, but it is too unspecific for a deserialization.
def as_bytes(self, data, extension=None):
"""Serialize data as bytes
File extension may be provided and influence the serialization format.
"""
if extension is None:
extension = self.default_extension()
assert self.is_type_of(data)
mimetype = mimetype_from_extension(extension)
if extension == "csv":
output = "".join(data.gen_csv(show_headers=True, show_tags=True))
return output.encode("utf-8"), mimetype
elif extension == "json":
output = "".join(data.gen_json(show_headers=True, show_tags=True))
return output.encode("utf-8"), mimetype
else:
raise Exception(
f"Serialization: file extension {extension} is not supported by HXL dataset type.")
def from_bytes(self, b: bytes, extension=None):
"""De-serialize data from bytes
File extension may be provided and influence the serialization format.
"""
if extension is None:
extension = self.default_extension()
f = BytesIO()
f.write(b)
f.seek(0)
if extension == "csv":
return hxl.data(f)
raise Exception(
f"Deserialization: file extension {extension} is not supported by HXL dataset type.")
Sometimes a deep copy of state data is needed - e.g. to assure
that the data in the cache will not become unintentionally
modified. That's why the state type should define copy
method.
Since libhxl dataset is immutable (?), it is OK to return just the data without making a copy.
def copy(self, data):
"""Make a deep copy of the data"""
return data
Once the state type class is defined, a state type instance is created and registered
HXL_DATASET_STATE_TYPE = HxlStateType()
register_state_type(hxl.Dataset, HXL_DATASET_STATE_TYPE)
register_state_type(hxl.io.HXLReader, HXL_DATASET_STATE_TYPE)
This is (currently) done for all relevant types.
State types are registered in a global StateTypesRegistry
object, which is responsible for registering and finding a state type
instance for any state data.
For more details see liquer.ext.lq_hxl
module.
Actually, the state type may not define a serialization and/or deserialization. There are objects that either can't be reliably serialized (e.g. matplotlib figure - as of time of writing) or serialization is otherwise undesirable. Such state types would be perfectly legal - they just could be neither cached nor served by the liquer web server. However, they could be inside the query, e.g. if matplotlib figure would be followed by image creation command, the image could be both served and cached.
Reports and visualizations
With the help of LiQuer, it is very easy to create both resuable visualizations with multiple views as well as documents viewable offline or suitable for printing. There are multiple markups suitable for creating reports and visualisations, but probably the easiest and most flexible are HTML documents. In LiQuer html can be easily created by returning a html text from a command.
Creation of text is simplified by evaluate_template
function, which processes a string (template)
containing LiQuer queries and replaces those queries by their results.
Report example is processing data from Global Food Prices Database (WFP). It contains monthly prices for various commodities. To adapt the data to our needs we need a cople of extra commands:
Month and year are in two separate columns mp_year
and mp_month
. For charts we need dates in YYYY-MM-DD format, which we achieve with the following command:
@command
def datemy(df,y="mp_year",m="mp_month",target="date"):
df.loc[:,target]=["%04d-%02d-01"%(int(year),int(month)) for year,month in zip(df[y],df[m])]
return df
To make statistics, it's handy to use pandas groupby. As an example we show count of groups, which used in the report to show number of observed prices in various markets:
@command
def count(df, *groupby_columns):
df.loc[:,"count"]=1
return df.groupby(groupby_columns).count().reset_index().loc[:,list(groupby_columns)+["count"]]
An example of a custom filter is a greater or equal command geq
, used in the report
to cut away years before a start year:
@command
def geq(df, column, value:float):
index = df.loc[:,column] >= value
return df.loc[index,:]
This is somewhat similar to eq
command from the pandas support module liquer.ext.lq_pandas
,
but only supports numerical values, while the eq
command is somewhat more general.
Pandas dataframe supports quite flexible method to_html
for converting dataframes to html format.
Report uses for styling the popular css framework bootstrap and to display the
tables nicely we just need to add some bootstrap css classes.
Command as well prepends a link to the dataframe itself by the link
command.
This tends to be very useful in practice, allowing to conviniently import underlying raw data into a spreadsheet.
@command
def table(state):
df = state.get()
html=evaluate_template(f"""<a href="${state.query}/link-url-csv$">(data)</a> """)
return html+df.to_html(index=False, classes="table table-striped")
The core of the report is a report
command. It can be applied on any dataframe containing suitable fields.
This allows a large degree of flexibility - arbitrary filters can be inserted into a command chain before the report.
For example, the current report can be restricted to specific markets, time periods or commodities without
any additional code, just by modifying the URL.
Report embeds a possibility to remove data pefore a from_year
. This in principle could be done
by inserting a geq
command before the report (which would work fine). Passing from_year
as an argument
has an advantage, that the start year can become a part of the report (e.g. it can be used as a part of the title).
Main part of the report is a single template, evaluated with evaluate_template
.
Note that LiQuer template uses as well string interpolation by python f string (PEP 498), which is a very powerful combination.
@command
def report(state, from_year=2017, linktype=None):
state = state.with_caching(False)
def makelink(url):
if linktype is None:
return url
extension = url.split(".")[-1]
return evaluate(f"fetch-{encode_token(url)}/link-{linktype}-{extension}").get()
try:
source = state.sources[0]
except:
source = "???"
LiQuer='<a href="https://github.com/orest-d/liquer"> LiQuer </a>'
df = state.get()
try:
title = ",".join(sorted(df.adm0_name.unique())) + f" since {from_year}"
except:
title = "report"
return state.with_filename("report.html").with_data(evaluate_template(f"""
<html>
<head>
<title>{title}</title>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<link rel="stylesheet" href="{makelink('https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css')}" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
</head>
<body>
<div class="p-3 mb-2 bg-success text-white fixed-top shadow-sm">
<a class="nav-link active" href="https://data.humdata.org"><img src="{makelink('https://centre.humdata.org/wp-content/uploads/hdx_new_logo_accronym2.png')}" style="height:30px;" alt="HDX"></a>
</div>
<div class="bg-light fixed-bottom border-top">
Generated with {LiQuer} <span class="float-right">© 2019 Orest Dubay</span>
</div>
<br/>
<br/>
<br/>
<br/>
<h1>{title}</h1>
<div class="container-fluid">
<div class="row">
Data originate from <a href="{source}"> {source} </a> were processed via a {LiQuer} service.
Only data after {from_year} are shown (<a href="${state.query}/datemy/geq-mp_year-{from_year}/link-url-csv$">data</a>),
complete data are <a href="${state.query}/datemy/link-url$"> here</a>.
</div>
<div class="row">
<div class="col-md-6" style="height:50%;">${state.query}/datemy/geq-mp_year-{from_year}/groupby_mean-mp_price-date-cm_name/plotly_chart-xys-date-mp_price-cm_name$</div>
<div class="col-md-6" style="height:50%;">${state.query}/datemy/geq-mp_year-{from_year}/count-adm1_name/plotly_chart-piexs-count-adm1_name$</div>
</div>
<div class="row">
<div class="col-md-6" style="height:50%;">
<h2>Average prices</h2>
${state.query}/datemy/geq-mp_year-{from_year}/groupby_mean-mp_price-cm_name/table$</div>
<div class="col-md-6" style="height:50%;">
<h2>Observations</h2>
${state.query}/datemy/geq-mp_year-{from_year}/count-adm1_name/table$</div>
</div>
</body>
</html>
"""))
Inside the report
command some more magic is used to handle links and external resources.
Links are created by a nested function makelink
. The main purpose is to allow three different regimes
of working with links:
* links to original sources (default),
* serving (proxying) resources through LiQuer service and
* dataurls.
Links to original sources are useful if the report is used from a web service: the report size is then relatively small
and thus the loading time is faster than for dataurls.
Proxying resources through LiQuer service allows to cache resources by LiQuer. This may be useful on slower internet connections, when running the service without internet or behind a firewall.
Dataurl link type allows saving the report as a single html file. Such a report can be used e.g. for offline browsing, archiving or sending by e-mail. All the assets are embedded inside the html file, so the report will work even when the LiQuer service
is not available. Note: The embedded LiQuer queries will of course not work offline, but if necessary, the data comming out from LiQuer can be turned to a dataurl with link
command; type of the link can be controlled by linktype
state variable. Assuming
linktype is not hardcoded (as in table
command) all query links in the report could be turned to dataurls like this:
filter-params/let-linktype-dataurl/report
This of course could lead to extremply large report files, so it should be used carefully.