Skip to content

ssadhukhanv2/gh_archive_activity_downloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Incremental Load with Bookmark

  • GH Archive is a public website that provides records of the public GitHub timeline. the records are available as files, each file provides an hourly view of the activity of users across the globe.

  • Files are available in url https://data.gharchive.org/{file_name}. Example https://data.gharchive.org/2015-01-01-15.json.gz would provide activity details for Activity for 1/1/2015 @ 3PM UTC

  • Our Lambda Function incrementally reads the files from gharchive and downloads these file to a folder within a s3 bucket while maintaining a bookmark of the last downloaded file.

  • The function starts reading from the next file after the bookmark and puts it in the s3 location, if bookmark is available. Then it updates saves the last saved file name in the 'bookmark' file. The bookmark file should have the name of the last read file which has been downloaded by the lamda function in the folder of the s3 bucket.

  • Incase bookmark is not available, the lamda function starts downloading the files from after the baseline file.

  • The bucket name, prefix, baseline, environment are supplied as Environment to the lambda function. While testing the code locally along with the other environment variables we need to specify the AWS profile information as well in another environment variables. Sample environment variables:

      ENVIRON=DEV; // Set to 'DEV' for testing locally, while deploying to AWS set to 'PROD' 
      BUCKET_NAME=gh-archive-files; // Name of the bucket to save the files
      BOOKMARK_FILE=bookmark; // Name of the bookmark file
      BASELINE_FILE=2022-11-08-0.json.gz; // Name of the baseline file
      FILE_PREFIX=sandbox; // The folder prefix within the bucket where we want to save the file
      PROFILE=iam-admin-de // name of the preconfigured aws profile that will be used for testing locally, for deploying in AWS this is not needed as we use Roles for lambda to access s3  
    
  • Refer to code for creating the logic in python

  • Create a lambda function in AWS Console, set execution time in 60 secs and memory to 512 MB. Zip the contents of gh_activity_downloader_lambda and upload zip to lamda. Verify if all the files after the bookmark or after the baseline(incase bookmark is not available) are downloaded correctly in the S3 bucket.

  • Create a rule in Event Bridge to trigger this Lambda function every 1 hour.

Create folder structure for AWS lambda

  • Install python, venv, pip and pycharm.

  • Create the project folder as "gh_activity_downloader"

  • Create a Python Virtual Environment as "gh_activity_downloader_venv", activate it and install requests & boto3 package in it.

      mkdir gh_activity_downloader
      cd .\gh_activity_downloader\
      python -m venv gh_activity_downloader_venv
      .\gh_activity_downloader_venv\Scripts\activate
      pip install requests
      pip install boto3
    
  • When we deploy a lambda code in AWS the boto3 library is already available so we don't need to have boto3 installed as a library, so we create a separate library folder "gh_activity_downloader_lambda" and install only requests in it.

      pip install requests -t gh_activity_downloader_lambda
    
  • Create the two files:

    • lambda_function.py -> This has the logic for the lambda code. Modify the lambda_function code to add the logic for incremental load. Add additional files as required.
    • lambda_validate.py -> This has the logic to test the lambda code locally
  • Login to AWS Console and create a lambda function.

  • Zip the contents(not the folder) inside gh_activity_downloader_lambda as zip file gh_activity_downloader.zip and upload as lambda function

  • Upload .\gh_activity_downloader_lambda.zip and test the lambda

  • Files

References

  • Commenting in Python How to write comments in Python

  • Using Multiline String How to define multiline strings in python

  • Python fStrings: What are the advantages of using f-Strings should be used over %-formatting and .format() methods

  • strptime() This is used to format timestamps that are in strings to data time object using formatters such as %Y %M %d %H etc

  • timedelta() Used for calculating time differences, for example to add some delta for example an hour to a datetime object

  • strftime() Converts datetime objects to their string representation Formatter CheatSheet for strftime

      baseline_file = '2022-11-07-0.json.gz'
    
      // baseline_file is a string so it's split and the first part is stored in date_part_str
      date_part_str = baseline_file.split('.')[0]
      print(date_part_str)
    
      // date_part_str is parsed to datetime using datetime.strptime()
      date_part_parsed_to_datetime = datetime.strptime(date_part_str, '%Y-%M-%d-%H')
      print(date_part_parsed_to_datetime)
    
      // add 1 hour of timedelta to date_part_parsed_to_datetime and stores it in date_time_incremented that is of time datetime
      date_time_incremented = date_part_parsed_to_datetime + timedelta(hours=1)
      print(date_time_incremented)
    
      // converts the datetime to string and stores in date_time_incremented_formatted_to_string
      date_time_incremented_formatted_to_string = datetime.strftime(date_time_incremented, '%Y-%M-%d-%H')
      print(date_time_incremented_formatted_to_string)
    

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages