GitHub - ssadhukhanv2/gh_archive_activity_downloader

Data Ingestion using AWS Lambda Function

Incremental Load with Bookmark

GH Archive is a public website that provides records of the public GitHub timeline. the records are available as files, each file provides an hourly view of the activity of users across the globe.
Files are available in url https://data.gharchive.org/{file_name}. Example https://data.gharchive.org/2015-01-01-15.json.gz would provide activity details for Activity for 1/1/2015 @ 3PM UTC
Our Lambda Function incrementally reads the files from gharchive and downloads these file to a folder within a s3 bucket while maintaining a bookmark of the last downloaded file.
The function starts reading from the next file after the bookmark and puts it in the s3 location, if bookmark is available. Then it updates saves the last saved file name in the 'bookmark' file. The bookmark file should have the name of the last read file which has been downloaded by the lamda function in the folder of the s3 bucket.
Incase bookmark is not available, the lamda function starts downloading the files from after the baseline file.

The bucket name, prefix, baseline, environment are supplied as Environment to the lambda function. While testing the code locally along with the other environment variables we need to specify the AWS profile information as well in another environment variables. Sample environment variables:

  ENVIRON=DEV; // Set to 'DEV' for testing locally, while deploying to AWS set to 'PROD' 
  BUCKET_NAME=gh-archive-files; // Name of the bucket to save the files
  BOOKMARK_FILE=bookmark; // Name of the bookmark file
  BASELINE_FILE=2022-11-08-0.json.gz; // Name of the baseline file
  FILE_PREFIX=sandbox; // The folder prefix within the bucket where we want to save the file
  PROFILE=iam-admin-de // name of the preconfigured aws profile that will be used for testing locally, for deploying in AWS this is not needed as we use Roles for lambda to access s3

Refer to code for creating the logic in python
Create a lambda function in AWS Console, set execution time in 60 secs and memory to 512 MB. Zip the contents of gh_activity_downloader_lambda and upload zip to lamda. Verify if all the files after the bookmark or after the baseline(incase bookmark is not available) are downloaded correctly in the S3 bucket.
Create a rule in Event Bridge to trigger this Lambda function every 1 hour.

Create folder structure for AWS lambda

Install python, venv, pip and pycharm.
Create the project folder as "gh_activity_downloader"

Create a Python Virtual Environment as "gh_activity_downloader_venv", activate it and install requests & boto3 package in it.

  mkdir gh_activity_downloader
  cd .\gh_activity_downloader\
  python -m venv gh_activity_downloader_venv
  .\gh_activity_downloader_venv\Scripts\activate
  pip install requests
  pip install boto3

When we deploy a lambda code in AWS the boto3 library is already available so we don't need to have boto3 installed as a library, so we create a separate library folder "gh_activity_downloader_lambda" and install only requests in it.
```
  pip install requests -t gh_activity_downloader_lambda
```
Create the two files:
- lambda_function.py -> This has the logic for the lambda code. Modify the lambda_function code to add the logic for incremental load. Add additional files as required.
- lambda_validate.py -> This has the logic to test the lambda code locally
Login to AWS Console and create a lambda function.
Zip the contents(not the folder) inside gh_activity_downloader_lambda as zip file gh_activity_downloader.zip and upload as lambda function
Upload .\gh_activity_downloader_lambda.zip and test the lambda
Files

References

Commenting in Python How to write comments in Python
Using Multiline String How to define multiline strings in python
Python fStrings: What are the advantages of using f-Strings should be used over %-formatting and .format() methods
strptime() This is used to format timestamps that are in strings to data time object using formatters such as %Y %M %d %H etc
timedelta() Used for calculating time differences, for example to add some delta for example an hour to a datetime object

strftime() Converts datetime objects to their string representation Formatter CheatSheet for strftime

  baseline_file = '2022-11-07-0.json.gz'

  // baseline_file is a string so it's split and the first part is stored in date_part_str
  date_part_str = baseline_file.split('.')[0]
  print(date_part_str)

  // date_part_str is parsed to datetime using datetime.strptime()
  date_part_parsed_to_datetime = datetime.strptime(date_part_str, '%Y-%M-%d-%H')
  print(date_part_parsed_to_datetime)

  // add 1 hour of timedelta to date_part_parsed_to_datetime and stores it in date_time_incremented that is of time datetime
  date_time_incremented = date_part_parsed_to_datetime + timedelta(hours=1)
  print(date_time_incremented)

  // converts the datetime to string and stores in date_time_incremented_formatted_to_string
  date_time_incremented_formatted_to_string = datetime.strftime(date_time_incremented, '%Y-%M-%d-%H')
  print(date_time_incremented_formatted_to_string)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
gh_activity_downloader_lambda		gh_activity_downloader_lambda
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Ingestion using AWS Lambda Function

Incremental Load with Bookmark

Create folder structure for AWS lambda

References

About

Uh oh!

Releases

Packages

Languages

ssadhukhanv2/gh_archive_activity_downloader

Folders and files

Latest commit

History

Repository files navigation

Data Ingestion using AWS Lambda Function

Incremental Load with Bookmark

Create folder structure for AWS lambda

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages