Skip to content

mscoco url link invalid? BucketNotFoundException: 404 gs://images.cocodataset.org bucket does not exist. #108

@brando90

Description

@brando90

I tried running but got error:

(mds_env_gpu) brando9~/data/mds/mscoco $ gsutil -m rsync gs://images.cocodataset.org/train2017 train2017

BucketNotFoundException: 404 gs://images.cocodataset.org bucket does not exist.

what to do?

full attempt:

# 1. Download the 2017 train images and annotations from http://cocodataset.org/:
#You can use gsutil to download them to mscoco/:
mkdir -p $MDS_DATA_PATH/mscoco/
cd $MDS_DATA_PATH/mscoco/
mkdir -p train2017
# seems to directly download all files, no zip file needed
gsutil -m rsync gs://images.cocodataset.org/train2017 train2017
# todo should have 118287? number of .jpg files (note no unziping needed)
ls $MDS_DATA_PATH/mscoco/train2017 | grep -c .jpg
# download & extract annotations_trainval2017.zip
gsutil -m cp gs://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip $MDS_DATA_PATH/mscoco/annotations_trainval2017.zip -d $MDS_DATA_PATH/mscoco
# todo says: 6?
ls $MDS_DATA_PATH/mscoco/annotations | grep -c .json

## Download Otherwise, you can download train2017.zip and annotations_trainval2017.zip and extract them into mscoco/. eta ~36m.
#mkdir -p $MDS_DATA_PATH/mscoco
#wget http://images.cocodataset.org/zips/train2017.zip -O $MDS_DATA_PATH/mscoco/train2017.zip
#wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip -O $MDS_DATA_PATH/mscoco/annotations_trainval2017.zip
## both zips should be there, note: downloading zip takes some time
#ls $MDS_DATA_PATH/mscoco/
## Extract them into mscoco/ (interpreting that as extracting both there, also due to how th gsutil command above looks like is doing)
## takes some time, but good progress display
#unzip $MDS_DATA_PATH/mscoco/train2017.zip -d $MDS_DATA_PATH/mscoco
#unzip $MDS_DATA_PATH/mscoco/annotations_trainval2017.zip -d $MDS_DATA_PATH/mscoco
## two folders should be there, annotations and train2017 stuff
#ls $MDS_DATA_PATH/mscoco/
## check jpg imgs are there
#ls $MDS_DATA_PATH/mscoco/train2017
#ls $MDS_DATA_PATH/mscoco/train2017 | grep -c .jpg
## says: 118287 for a 2nd time
#ls $MDS_DATA_PATH/mscoco/annotations
#ls $MDS_DATA_PATH/mscoco/annotations | grep -c .json
## says: 6 for a 2nd time
## move them since it says so in the google NL instructions ref: for moving large num files https://stackoverflow.com/a/75034830/1601580 thanks chatgpt!
#ls $MDS_DATA_PATH/mscoco/train2017 | grep -c .jpg
#find $MDS_DATA_PATH/mscoco/train2017 -type f -print0 | xargs -0 mv -t $MDS_DATA_PATH/mscoco
#ls $MDS_DATA_PATH/mscoco | grep -c .jpg
## says: 118287 for both
#ls $MDS_DATA_PATH/mscoco/annotations/ | grep -c .json
#mv $MDS_DATA_PATH/mscoco/annotations/* $MDS_DATA_PATH/mscoco/
#ls $MDS_DATA_PATH/mscoco/ | grep -c .json
## says: 6 for both

# 2. Launch the conversion script:
python -m meta_dataset.dataset_conversion.convert_datasets_to_records \
  --dataset=mscoco \
  --mscoco_data_root=$MDS_DATA_PATH/mscoco \
  --splits_root=$SPLITS \
  --records_root=$RECORDS

# 3. Expect the conversion to take about 4 hours.

# 4. Find the following outputs in $RECORDS/mscoco/:
#80 tfrecords files named [0-79].tfrecords
ls $RECORDS/mscoco/ | grep -c .tfrecords
#dataset_spec.json (see note 1)
ls $RECORDS/mscoco/dataset_spec.json

related: brando90/pytorch-meta-dataset#20

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions