Skip to content

Commit 1a98ee8

Browse files
authored
Merge pull request #53 from civitaspo/develop
v0.2.0
2 parents ba6df44 + f0f8b43 commit 1a98ee8

35 files changed

+2285
-1006
lines changed

.circleci/config.yml

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,6 @@ references:
1111
TERM: dumb
1212

1313
jobs:
14-
spotless:
15-
<<: *environment
16-
steps:
17-
- checkout
18-
- run: ./gradlew spotlessCheck
1914

2015
build:
2116
<<: *environment
@@ -36,6 +31,4 @@ workflows:
3631
merge-before:
3732
jobs:
3833
- build
39-
- spotless
40-
4134

.scalafmt.conf

Lines changed: 0 additions & 9 deletions
This file was deleted.

CHANGELOG.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,25 @@
1+
0.2.0 (2019-07-16)
2+
==================
3+
4+
* [New Feature] Add `athena.apas>` operator.
5+
* [Enhancement] Use scala 2.13.0
6+
* [New Feature] `athena.add_partition>` operator
7+
* [New Feature] `athena.drop_partition>` operator
8+
* [New Feature] `athena.drop_table>` operator
9+
* [Enhancement] Suppress aws-java-sdk log
10+
* [Note] Use aws-java-sdk-glue for catalog only operations.
11+
* [Breaking Change - `athena.query>`] `preview` option is `false` by default.
12+
* [Enhancement - `athena.ctas>`] Remove `;` from the query.
13+
* [Enhancement] Create wrappers for aws-java-sdk for the readability and the separation of responsibilities.
14+
* The change of STS has a possibility to break the backward compatibility of `assume role` behavior.
15+
* [Enhancement] Introduce region variable for `Aws` to resolve region according to `auth_method` option.
16+
* [New Feature] Add `workgroup` option.
17+
* [Breaking Change - `athena.query`] Remove the `output` option as the deprecation is notified from before.
18+
* [Deprecated - `athena.ctas>`] Make `select_query` deprecated.
19+
* [Note] Introduce `pro.civitaspo.digdag.plugin.athena.aws` package to divide dependencies about aws.
20+
* [Note] Use the Intellij formatter instead of spotless, so remove spotless from CI.
21+
22+
123
0.1.5 (2018-12-11)
224
==================
325

README.md

Lines changed: 79 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ _export:
1515
repositories:
1616
- https://jitpack.io
1717
dependencies:
18-
- pro.civitaspo:digdag-operator-athena:0.1.5
18+
- pro.civitaspo:digdag-operator-athena:0.2.0
1919
athena:
2020
auth_method: profile
2121

@@ -80,21 +80,83 @@ Define the below options on properties (which is indicated by `-c`, `--config`).
8080
- **region**: The AWS region to use for Athena service. (string, optional)
8181
- **endpoint**: The Amazon Athena endpoint address to use. (string, optional)
8282

83+
## Configuration for `athena.add_partition>` operator
84+
85+
### Options
86+
87+
- **database**: The name of the database. (string, required)
88+
- **table**: The name of the partitioned table. (string, required)
89+
- **location**: The location of the partition. If not specified, this operator generates like hive automatically. (string, default: auto generated like the below)
90+
- `${table location}/${partition key1}=${partition value1}/${partition key2}=${partition value2}/...`
91+
- **partition_kv**: key-value pairs for partitioning (string to string map, required)
92+
- **save_mode**: The mode to save the partition. (string, default = `"overwrite"`, available values are `"skip_if_exists"`, `"error_if_exists"`, `"overwrite"`)
93+
- **follow_location**: Skip to add a partition and drop the partition if the location does not exist. (boolean, default: `true`)
94+
- **catalog_id**: glue data catalog id if you use a catalog different from account/region default catalog. (string, optional)
95+
96+
### Output Parameters
97+
98+
Nothing
99+
100+
## Configuration for `athena.drop_partition>` operator
101+
102+
### Options
103+
104+
- **database**: The name of the database. (string, required)
105+
- **table**: The name of the partitioned table. (string, required)
106+
- **partition_kv**: key-value pairs for partitioning (string to string map, required)
107+
- **with_location**: Drop the partition with removing objects on S3 (boolean, default: `false`)
108+
- **ignore_if_not_exist**: Ignore if the partition does not exist. (boolean, default: `true`)
109+
- **catalog_id**: glue data catalog id if you use a catalog different from account/region default catalog. (string, optional)
110+
111+
### Output Parameters
112+
113+
Nothing
114+
115+
## Configuration for `athena.apas>` operator
116+
117+
`apas` means *Add a partition as select* that creates a partition the query result is stored.
118+
119+
### Options
120+
121+
- **athena.apas>**: The select SQL statements or file location (in local or Amazon S3) to be executed for a new table by [`Create Table As Select`]((https://aws.amazon.com/jp/about-aws/whats-new/2018/10/athena_ctas_support/)). You can use digdag's template engine like `${...}` in the SQL query. (string, required)
122+
- **database**: The name of the database that has the partitioned table. (string, required)
123+
- **table**: The name of the partitioned table. (string, required)
124+
- **workgroup**: The name of the workgroup in which the query is being started. (string, optional)
125+
- **partition_kv**: key-value pairs for partitioning (string to string map, required)
126+
- **location**: The location of the partition. If not specified, this operator generates like hive automatically. (string, default: auto generated like the below)
127+
- `${table location}/${partition key1}=${partition value1}/${partition key2}=${partition value2}/...`
128+
- **save_mode**: Specify the expected behavior. Available values are `"skip_if_exists"`, `"error_if_exists"`, `"ignore"`, `"overwrite"`. See the below explanation of the behaviour. (string, default: `"overwrite"`)
129+
- `"skip_if_exists"`: Skip processing if the partition or the location exists.
130+
- `"error_if_exists"`: Raise error if the partition or the location exists.
131+
- `"overwrite"`: Always recreate the partition and the location if exists. This operation is not atomic.
132+
- **bucketed_by**: An array list of buckets to bucket data. If omitted, Athena does not bucket your data in this query. (array of string, optional)
133+
- **bucket_count**: The number of buckets for bucketing your data. If omitted, Athena does not bucket your data. (integer, optional)
134+
- **additional_properties**: Additional properties for CTAS that is used `athena.apas>` internally. These are used for CTAS WITH clause without escaping. (string to string map, optional)
135+
- **ignore_schema_diff**: Ignore if the schema of the query result is different from tha table. (boolean, default: `false`)
136+
- **token_prefix**: Prefix for `ClientRequestToken` that a unique case-sensitive string used to ensure the request to create the query is idempotent (executes only once). On this plugin, the token is composed like `${token_prefix}-${session_uuid}-${hash value of query}-${radom string}`. (string, default: `"digdag-athena-apas"`)
137+
- **timeout**: Specify timeout period. (`DurationParam`, default: `"10m"`)
138+
- **catalog_id**: glue data catalog id if you use a catalog different from account/region default catalog. (string, optional)
139+
140+
### Output Parameters
141+
142+
Nothing
143+
83144
## Configuration for `athena.query>` operator
84145

85146
### Options
86147

87148
- **athena.query>**: The SQL query statements or file location (in local or Amazon S3) to be executed. You can use digdag's template engine like `${...}` in the SQL query. (string, required)
88149
- **token_prefix**: Prefix for `ClientRequestToken` that a unique case-sensitive string used to ensure the request to create the query is idempotent (executes only once). On this plugin, the token is composed like `${token_prefix}-${session_uuid}-${hash value of query}-${random string}`. (string, default: `"digdag-athena"`)
89150
- **database**: The name of the database. (string, optional)
90-
- **output**: The location in Amazon S3 where your query results are stored, such as `"s3://path/to/query/"`. For more information, see [Queries and Query Result Files](https://docs.aws.amazon.com/athena/latest/ug/querying.html). (string, default: `"s3://aws-athena-query-results-${AWS_ACCOUNT_ID}-<AWS_REGION>"`)
151+
- **workgroup**: The name of the workgroup in which the query is being started. (string, optional)
91152
- **timeout**: Specify timeout period. (`DurationParam`, default: `"10m"`)
92153
- **preview**: Call `athena.preview>` operator after run `athena.query>`. (boolean, default: `true`)
93154

94155
### Output Parameters
95156

96157
- **athena.last_query.id**: The unique identifier for each query execution. (string)
97158
- **athena.last_query.database**: The name of the database. (string)
159+
- **athena.last_query.workgroup**: The name of the workgroup in which the query is being started. (string)
98160
- **athena.last_query.query**: The SQL query statements which the query execution ran. (string)
99161
- **athena.last_query.output**: The location in Amazon S3 where your query results are stored. (string)
100162
- **athena.last_query.scan_bytes**: The number of bytes in the data that was queried. (long)
@@ -131,9 +193,10 @@ Define the below options on properties (which is indicated by `-c`, `--config`).
131193

132194
### Options
133195

134-
- **select_query**: The select SQL statements or file location (in local or Amazon S3) to be executed for a new table by [`Create Table As Select`]((https://aws.amazon.com/jp/about-aws/whats-new/2018/10/athena_ctas_support/)). You can use digdag's template engine like `${...}` in the SQL query. (string, required)
196+
- **athena.ctas>**: The select SQL statements or file location (in local or Amazon S3) to be executed for a new table by [`Create Table As Select`]((https://aws.amazon.com/jp/about-aws/whats-new/2018/10/athena_ctas_support/)). You can use digdag's template engine like `${...}` in the SQL query. (string, required)
135197
- **database**: The database name for query execution context. (string, optional)
136198
- **table**: The table name for the new table (string, default: `digdag_athena_ctas_${session_uuid.replaceAll("-", "")}_${random}`)
199+
- **workgroup**: The name of the workgroup in which the query is being started. (string, optional)
137200
- **output**: Output location for data created by CTAS (string, default: `"s3://aws-athena-query-results-${AWS_ACCOUNT_ID}-<AWS_REGION>/Unsaved/${YEAR}/${MONTH}/${DAY}/${athena_query_id}/"`)
138201
- **format**: The data format for the CTAS query results, such as `"orc"`, `"parquet"`, `"avro"`, `"json"`, or `"textfile"`. (string, default: `"parquet"`)
139202
- **compression**: The compression type to use for `"orc"` or `"parquet"`. (string, default: `"snappy"`)
@@ -144,7 +207,7 @@ Define the below options on properties (which is indicated by `-c`, `--config`).
144207
- **additional_properties**: Additional properties for CTAS. These are used for CTAS WITH clause without escaping. (string to string map, optional)
145208
- **table_mode**: Specify the expected behavior of CTAS results. Available values are `"default"`, `"empty"`, `"data_only"`. See the below explanation of the behaviour. (string, default: `"default"`)
146209
- `"default"`: Do not do any care. This option require the least IAM privileges for digdag, but the behaviour depends on Athena.
147-
- `"empty_table"`: Create a new empty table with the same schema as the select query results.
210+
- `"empty"`: Create a new empty table with the same schema as the select query results.
148211
- `"data_only"`: Create a new table with data by CTAS, but drop this after CTAS execution. The table created by CTAS is an external table, so the data is left even if the table is dropped.
149212
- **save_mode**: Specify the expected behavior of CTAS. Available values are `"none"`, `"error_if_exists"`, `"ignore"`, `"overwrite"`. See the below explanation of the behaviour. (string, default: `"overwrite"`)
150213
- `"none"`: Do not do any care. This option require the least IAM privileges for digdag, but the behaviour depends on Athena.
@@ -158,6 +221,18 @@ Define the below options on properties (which is indicated by `-c`, `--config`).
158221

159222
Nothing
160223

224+
## Configuration for `athena.drop_table>` operator
225+
226+
- **database**: The name of the database. (string, required)
227+
- **table**: The name of the partitioned table. (string, required)
228+
- **with_location**: Drop the partition with removing objects on S3 (boolean, default: `false`)
229+
- **ignore_if_not_exist**: Ignore if the partition does not exist. (boolean, default: `true`)
230+
- **catalog_id**: glue data catalog id if you use a catalog different from account/region default catalog. (string, optional)
231+
232+
### Output Parameters
233+
234+
Nothing
235+
161236
# Development
162237

163238
## Run an Example

build.gradle

Lines changed: 8 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,16 @@
11
plugins {
22
id 'scala'
33
id 'maven-publish'
4-
id 'com.github.johnrengelman.shadow' version '2.0.2'
5-
id "com.diffplug.gradle.spotless" version "3.13.0"
4+
id 'com.github.johnrengelman.shadow' version '5.1.0'
65
}
76

87
group = 'pro.civitaspo'
9-
version = '0.1.5'
8+
version = '0.2.0'
109

11-
def digdagVersion = '0.9.27'
12-
def awsSdkVersion = "1.11.372"
13-
def scalaSemanticVersion = "2.12.6"
14-
def depScalaVersion = "2.12"
10+
def digdagVersion = '0.9.37'
11+
def awsSdkVersion = "1.11.587"
12+
def scalaSemanticVersion = "2.13.0"
13+
def depScalaVersion = "2.13"
1514

1615
repositories {
1716
mavenCentral()
@@ -33,6 +32,8 @@ dependencies {
3332
compile group: 'com.amazonaws', name: 'aws-java-sdk-s3', version: awsSdkVersion
3433
// https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-sts
3534
compile group: 'com.amazonaws', name: 'aws-java-sdk-sts', version: awsSdkVersion
35+
// https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-glue
36+
compile group: 'com.amazonaws', name: 'aws-java-sdk-glue', version: awsSdkVersion
3637
}
3738

3839
shadowJar {
@@ -55,12 +56,6 @@ publishing {
5556
}
5657
}
5758

58-
spotless {
59-
scala {
60-
scalafmt('1.5.1').configFile('.scalafmt.conf')
61-
}
62-
}
63-
6459
sourceCompatibility = 1.8
6560
targetCompatibility = 1.8
6661

example/example.dig

Lines changed: 72 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ _export:
44
- file://${repos}
55
# - https://jitpack.io
66
dependencies:
7-
- pro.civitaspo:digdag-operator-athena:0.1.5
7+
- pro.civitaspo:digdag-operator-athena:0.2.0
88
athena:
99
auth_method: profile
1010
value: 5
@@ -15,10 +15,78 @@ _export:
1515
+step2:
1616
echo>: ${athena}
1717

18-
+stap3:
19-
athena.ctas>:
20-
select_query: template.sql
18+
+step3:
19+
athena.preview>: ${athena.last_query.id}
20+
21+
+stap4:
22+
athena.ctas>: template.sql
23+
database: ${database}
24+
table: hoge
25+
output: ${output}
26+
27+
+step5:
28+
echo>: ${athena}
29+
30+
+step6:
31+
athena.drop_table>:
32+
database: ${database}
33+
table: hoge
34+
with_location: true
35+
36+
+step7:
37+
athena.ctas>: select 1 as a, 2 as b, 3 as c union all select 4 as a, 5 as b, 6 as c
2138
database: ${database}
2239
table: hoge
2340
output: ${output}
41+
partitioned_by: [b, c]
2442

43+
+step8:
44+
athena.drop_partition>:
45+
database: ${database}
46+
table: hoge
47+
partition_kv:
48+
b: "5"
49+
c: "6"
50+
51+
+step9:
52+
athena.add_partition>:
53+
database: ${database}
54+
table: hoge
55+
partition_kv:
56+
b: "5"
57+
c: "6"
58+
59+
+step10:
60+
athena.add_partition>:
61+
database: ${database}
62+
table: hoge
63+
partition_kv:
64+
b: "5"
65+
c: "6"
66+
67+
+step11:
68+
athena.add_partition>:
69+
database: ${database}
70+
table: hoge
71+
partition_kv:
72+
b: "5"
73+
c: "6"
74+
location: ${output}/hoge/fuga/hogo/
75+
76+
+step12:
77+
athena.drop_partition>:
78+
database: ${database}
79+
table: hoge
80+
with_location: true
81+
partition_kv:
82+
b: "2"
83+
c: "3"
84+
85+
+step13:
86+
athena.apas>: select 5 as a
87+
database: ${database}
88+
table: hoge
89+
partition_kv:
90+
b: "9"
91+
c: "10"
92+
save_mode: overwrite

example/run.sh

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ EXAMPLE_ROOT=$ROOT/example
55
LOCAL_MAVEN_REPO=$ROOT/build/repo
66
DATABASE="$1"
77
OUTPUT="$2"
8+
WORKGROUP="$3"
9+
PARAM_OPTION=""
810

911
if [ -z "$DATABASE" ]; then
1012
echo "[ERROR] Set database as the first argument."
@@ -14,6 +16,9 @@ if [ -z "$OUTPUT" ]; then
1416
echo "[ERROR] Set output s3 URI as the second argument."
1517
exit 1
1618
fi
19+
if [ -n "$WORKGROUP" ]; then
20+
PARAM_OPTION="-p workgroup=$WORKGROUP"
21+
fi
1722

1823
(
1924
cd $EXAMPLE_ROOT
@@ -22,5 +27,5 @@ fi
2227
rm -rfv .digdag
2328

2429
## run
25-
digdag run example.dig -c allow.properties -p repos=${LOCAL_MAVEN_REPO} -p output=${OUTPUT} -p database=${DATABASE} --no-save
30+
digdag run example.dig -c allow.properties -p repos=${LOCAL_MAVEN_REPO} -p output=${OUTPUT} -p database=${DATABASE} $PARAM_OPTION --no-save
2631
)
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
distributionBase=GRADLE_USER_HOME
22
distributionPath=wrapper/dists
3-
distributionUrl=https\://services.gradle.org/distributions/gradle-4.9-bin.zip
3+
distributionUrl=https\://services.gradle.org/distributions/gradle-5.5-bin.zip
44
zipStoreBase=GRADLE_USER_HOME
55
zipStorePath=wrapper/dists

0 commit comments

Comments
 (0)