Skip to content

Commit 99aac68

Browse files
committed
Adding readme information and marking text as copyrighted
1 parent 33ecff3 commit 99aac68

24 files changed

+174
-0
lines changed

examples/README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,3 +52,10 @@ Here is a direct link to the file used in the examples:
5252
- [Executing SQL on Polars](./sql-on-polars.py)
5353
- [Executing SQL on Pandas](./sql-on-pandas.py)
5454
- [Executing SQL on cuDF](./sql-on-cudf.py)
55+
56+
## TPC-H Examples
57+
58+
Within the subdirectory `tpch` there are 22 examples that reproduce queries in
59+
the TPC-H specification. These include realistic data that can be generated at
60+
arbitrary scale and allow the user to see use cases for a variety of data frame
61+
operations.

examples/tpch/README.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# DataFusion Python Examples for TPC-H
21+
22+
These examples reproduce the problems listed in the Transaction Process Council
23+
TPC-H benchmark. The purpose of these examples is to demonstrate how to use
24+
different aspects of Data Fusion and not necessarily geared towards creating the
25+
most performant queries possible. Within each example is a description of the
26+
problem. For users who are familiar with SQL style commands, you can compare the
27+
approaches in these examples with those listed in the specification.
28+
29+
- https://www.tpc.org/tpch/
30+
31+
The examples provided are based on version 2.18.0 of the TPC-H specification.
32+
33+
## Data Setup
34+
35+
To run these examples, you must first generate a dataset. The `dbgen` tool
36+
provided by TPC can create datasets of arbitrary scale. For testing it is
37+
typically sufficient to create a 1 gigabyte dataset. For convenience, this
38+
repository has a script which uses docker to create this dataset. From the
39+
`benchmarks/tpch` directory execute the following script.
40+
41+
```bash
42+
./tpch-gen.sh 1
43+
```
44+
45+
The examples provided use parquet files for the tables generated by `dbgen`.
46+
An python script is provided to convert the text files from `dbgen` into parquet
47+
files expected by the examples. From the `examples/tpch` directory you can
48+
execute the following command to create the necessary parquet files.
49+
50+
```bash
51+
python convert_data_to_parquet.py
52+
```
53+
54+
## Description of Examples
55+
56+
For easier access, a description of the techniques demonstrated in each file
57+
is in the README.md file in the `examples` directory.

examples/tpch/q01_pricing_summary_report.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,17 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 1:
20+
1921
The Pricing Summary Report Query provides a summary pricing report for all lineitems shipped as of
2022
a given date. The date is within 60 - 120 days of the greatest ship date contained in the database.
2123
The query lists totals for extended price, discounted extended price, discounted extended price
2224
plus tax, average quantity, average extended price, and average discount. These aggregates are
2325
grouped by RETURNFLAG and LINESTATUS, and listed in ascending order of RETURNFLAG and LINESTATUS.
2426
A count of the number of lineitems in each group is included.
27+
28+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
29+
as part of their TPC Benchmark H Specification revision 2.18.0.
2530
"""
2631

2732
import pyarrow as pa

examples/tpch/q02_minimum_cost_supplier.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,17 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 2:
20+
1921
The Minimum Cost Supplier Query finds, in a given region, for each part of a certain type and size,
2022
the supplier who can supply it at minimum cost. If several suppliers in that region offer the
2123
desired part type and size at the same (minimum) cost, the query lists the parts from suppliers with
2224
the 100 highest account balances. For each supplier, the query lists the supplier's account balance,
2325
name and nation; the part's number and manufacturer; the supplier's address, phone number and
2426
comment information.
27+
28+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
29+
as part of their TPC Benchmark H Specification revision 2.18.0.
2530
"""
2631

2732
import datafusion

examples/tpch/q03_shipping_priority.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,15 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 3:
20+
1921
The Shipping Priority Query retrieves the shipping priority and potential revenue, defined as the
2022
sum of l_extendedprice * (1-l_discount), of the orders having the largest revenue among those that
2123
had not been shipped as of a given date. Orders are listed in decreasing order of revenue. If more
2224
than 10 unshipped orders exist, only the 10 orders with the largest revenue are listed.
25+
26+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
27+
as part of their TPC Benchmark H Specification revision 2.18.0.
2328
"""
2429

2530
from datafusion import SessionContext, col, lit, functions as F

examples/tpch/q04_order_priority_checking.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,14 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 4:
20+
1921
The Order Priority Checking Query counts the number of orders ordered in a given quarter of a given
2022
year in which at least one lineitem was received by the customer later than its committed date. The
2123
query lists the count of such orders for each order priority sorted in ascending priority order.
24+
25+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
26+
as part of their TPC Benchmark H Specification revision 2.18.0.
2227
"""
2328

2429
from datetime import datetime

examples/tpch/q05_local_supplier_volume.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,17 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 5:
20+
1921
The Local Supplier Volume Query lists for each nation in a region the revenue volume that resulted
2022
from lineitem transactions in which the customer ordering parts and the supplier filling them were
2123
both within that nation. The query is run in order to determine whether to institute local
2224
distribution centers in a given region. The query considers only parts ordered in a given year. The
2325
query displays the nations and revenue volume in descending order by revenue. Revenue volume for all
2426
qualifying lineitems in a particular nation is defined as sum(l_extendedprice * (1 - l_discount)).
27+
28+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
29+
as part of their TPC Benchmark H Specification revision 2.18.0.
2530
"""
2631

2732
from datetime import datetime

examples/tpch/q06_forecasting_revenue_change.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,17 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 6:
20+
1921
The Forecasting Revenue Change Query considers all the lineitems shipped in a given year with
2022
discounts between DISCOUNT-0.01 and DISCOUNT+0.01. The query lists the amount by which the total
2123
revenue would have increased if these discounts had been eliminated for lineitems with l_quantity
2224
less than quantity. Note that the potential revenue increase is equal to the sum of
2325
[l_extendedprice * l_discount] for all lineitems with discounts and quantities in the qualifying
2426
range.
27+
28+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
29+
as part of their TPC Benchmark H Specification revision 2.18.0.
2530
"""
2631

2732
from datetime import datetime

examples/tpch/q07_volume_shipping.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,16 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 7:
20+
1921
The Volume Shipping Query finds, for two given nations, the gross discounted revenues derived from
2022
lineitems in which parts were shipped from a supplier in either nation to a customer in the other
2123
nation during 1995 and 1996. The query lists the supplier nation, the customer nation, the year,
2224
and the revenue from shipments that took place in that year. The query orders the answer by
2325
Supplier nation, Customer nation, and year (all ascending).
26+
27+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
28+
as part of their TPC Benchmark H Specification revision 2.18.0.
2429
"""
2530

2631
from datetime import datetime

examples/tpch/q08_market_share.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,15 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 8:
20+
1921
The market share for a given nation within a given region is defined as the fraction of the
2022
revenue, the sum of [l_extendedprice * (1-l_discount)], from the products of a specified type in
2123
that region that was supplied by suppliers from the given nation. The query determines this for the
2224
years 1995 and 1996 presented in this order.
25+
26+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
27+
as part of their TPC Benchmark H Specification revision 2.18.0.
2328
"""
2429

2530
from datetime import datetime

examples/tpch/q09_product_type_profit_measure.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,17 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 9:
20+
1921
The Product Type Profit Measure Query finds, for each nation and each year, the profit for all parts
2022
ordered in that year that contain a specified substring in their names and that were filled by a
2123
supplier in that nation. The profit is defined as the sum of
2224
[(l_extendedprice*(1-l_discount)) - (ps_supplycost * l_quantity)] for all lineitems describing
2325
parts in the specified line. The query lists the nations in ascending alphabetical order and, for
2426
each nation, the year and profit in descending order by year (most recent first).
27+
28+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
29+
as part of their TPC Benchmark H Specification revision 2.18.0.
2530
"""
2631

2732
import pyarrow as pa

examples/tpch/q10_returned_item_reporting.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,17 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 10:
20+
1921
The Returned Item Reporting Query finds the top 20 customers, in terms of their effect on lost
2022
revenue for a given quarter, who have returned parts. The query considers only parts that were
2123
ordered in the specified quarter. The query lists the customer's name, address, nation, phone
2224
number, account balance, comment information and revenue lost. The customers are listed in
2325
descending order of lost revenue. Revenue lost is defined as
2426
sum(l_extendedprice*(1-l_discount)) for all qualifying lineitems.
27+
28+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
29+
as part of their TPC Benchmark H Specification revision 2.18.0.
2530
"""
2631

2732
from datetime import datetime

examples/tpch/q11_important_stock_identification.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,15 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 11:
20+
1921
The Important Stock Identification Query finds, from scanning the available stock of suppliers
2022
in a given nation, all the parts that represent a significant percentage of the total value of
2123
all available parts. The query displays the part number and the value of those parts in
2224
descending order of value.
25+
26+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
27+
as part of their TPC Benchmark H Specification revision 2.18.0.
2328
"""
2429

2530
from datafusion import SessionContext, WindowFrame, col, lit, functions as F

examples/tpch/q12_ship_mode_order_priority.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,17 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 12:
20+
1921
The Shipping Modes and Order Priority Query counts, by ship mode, for lineitems actually received
2022
by customers in a given year, the number of lineitems belonging to orders for which the
2123
l_receiptdate exceeds the l_commitdate for two different specified ship modes. Only lineitems that
2224
were actually shipped before the l_commitdate are considered. The late lineitems are partitioned
2325
into two groups, those with priority URGENT or HIGH, and those with a priority other than URGENT or
2426
HIGH.
27+
28+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
29+
as part of their TPC Benchmark H Specification revision 2.18.0.
2530
"""
2631

2732
from datetime import datetime

examples/tpch/q13_customer_distribution.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,16 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 13:
20+
1921
This query determines the distribution of customers by the number of orders they have made,
2022
including customers who have no record of orders, past or present. It counts and reports how many
2123
customers have no orders, how many have 1, 2, 3, etc. A check is made to ensure that the orders
2224
counted do not fall into one of several special categories of orders. Special categories are
2325
identified in the order comment column by looking for a particular pattern.
26+
27+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
28+
as part of their TPC Benchmark H Specification revision 2.18.0.
2429
"""
2530

2631
from datafusion import SessionContext, col, lit, functions as F

examples/tpch/q14_promotion_effect.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,14 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 14:
20+
1921
The Promotion Effect Query determines what percentage of the revenue in a given year and month was
2022
derived from promotional parts. The query considers only parts actually shipped in that month and
2123
gives the percentage. Revenue is defined as (l_extendedprice * (1-l_discount)).
24+
25+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
26+
as part of their TPC Benchmark H Specification revision 2.18.0.
2227
"""
2328

2429
from datetime import datetime

examples/tpch/q15_top_supplier.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,14 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 15:
20+
1921
The Top Supplier Query finds the supplier who contributed the most to the overall revenue for parts
2022
shipped during a given quarter of a given year. In case of a tie, the query lists all suppliers
2123
whose contribution was equal to the maximum, presented in supplier number order.
24+
25+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
26+
as part of their TPC Benchmark H Specification revision 2.18.0.
2227
"""
2328

2429
from datetime import datetime

examples/tpch/q16_part_supplier_relationship.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,16 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 16:
20+
1921
The Parts/Supplier Relationship Query counts the number of suppliers who can supply parts that
2022
satisfy a particular customer's requirements. The customer is interested in parts of eight
2123
different sizes as long as they are not of a given type, not of a given brand, and not from a
2224
supplier who has had complaints registered at the Better Business Bureau. Results must be presented
2325
in descending count and ascending brand, type, and size.
26+
27+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
28+
as part of their TPC Benchmark H Specification revision 2.18.0.
2429
"""
2530

2631
import pyarrow as pa

examples/tpch/q17_small_quantity_order.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,16 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 17:
20+
1921
The Small-Quantity-Order Revenue Query considers parts of a given brand and with a given container
2022
type and determines the average lineitem quantity of such parts ordered for all orders (past and
2123
pending) in the 7-year database. What would be the average yearly gross (undiscounted) loss in
2224
revenue if orders for these parts with a quantity of less than 20% of this average were no longer
2325
taken?
26+
27+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
28+
as part of their TPC Benchmark H Specification revision 2.18.0.
2429
"""
2530

2631
from datafusion import SessionContext, WindowFrame, col, lit, functions as F

examples/tpch/q18_large_volume_customer.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,14 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 18:
20+
1921
The Large Volume Customer Query finds a list of the top 100 customers who have ever placed large
2022
quantity orders. The query lists the customer name, customer key, the order key, date and total
2123
price and the quantity for the order.
24+
25+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
26+
as part of their TPC Benchmark H Specification revision 2.18.0.
2227
"""
2328

2429
from datafusion import SessionContext, col, lit, functions as F

examples/tpch/q19_discounted_revenue.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,14 @@
1616
# under the License.
1717

1818
"""
19+
TPC-H Problem Statement Query 19:
20+
1921
The Discounted Revenue query finds the gross discounted revenue for all orders for three different
2022
types of parts that were shipped by air and delivered in person. Parts are selected based on the
2123
combination of specific brands, a list of containers, and a range of sizes.
24+
25+
The above problem statement text is copyrighted by the Transaction Processing Performance Council
26+
as part of their TPC Benchmark H Specification revision 2.18.0.
2227
"""
2328

2429
import pyarrow as pa

0 commit comments

Comments
 (0)