如何从 Kaggle 将一个太大的 Kaggle 数据集的选定文件加载到 Colab 中太大、加载、文件、数据

2023-09-06 04:48:19 作者:再爱你一个暑假@

如果我想从 Kaggle 笔记本切换到 Colab 笔记本,我可以从 Kaggle 下载笔记本并在 Google Colab 中打开该笔记本.这样做的问题是,您通常还需要下载和上传 Kaggle 数据集,这相当费力.

如果您有一个小数据集,或者您只需要一个较小的数据集文件,您可以将数据集放入 Kaggle notebook 所期望的相同文件夹结构中.因此,您需要在 Google Colab 中创建该结构,例如 kaggle/input/ 或其他任何东西,然后将其上传到那里.这不是问题.

Kaggle 使用 Python 和 R 绘制数据地图的十七个经典案例 附资源

但是,如果您有一个大型数据集,您可以:

安装您的 Google 云端硬盘并使用那里的数据集/文件

或者,您可以按照

问题来了:这似乎只适用于较小的数据集.我试过了

kaggle 数据集下载 -d allen-institute-for-ai/CORD-19-research-challenge

它没有找到那个 API,可能是因为下载 40 GB 的数据只是被限制:404 - Not Found.

在这种情况下,你只能下载需要的文件并使用挂载的 Google Drive,或者你需要使用 Kaggle 而不是 Colab.

有没有办法只将 40 GB CORD-19 Kaggle 数据集的 800 MB metadata.csv 文件下载到 Colab 中?这是文件信息页面的链接:

https:///www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv

我现在已经在 Google Drive 中加载了文件,我很好奇这是否已经是最好的方法.相比之下,在 Kaggle 上,整个数据集已经可用,无需下载,加载速度很快.

PS:从 Kaggle 下载 zip 文件到 Colab 后,需要将其解压缩.再次引用quide:

使用 unzip 命令解压数据:

例如,创建一个名为 train 的目录,

!mkdir 火车

在那里解压缩火车数据,

!解压train.zip -d train

更新:我建议安装 Google Drive

在尝试了两种方式(安装 Google Drive 或从 Kaggle 直接加载)后,如果您的架构允许,我建议安装 Google Drive.这样做的好处是文件只需要上传一次:Google Colab 和 Google Drive 是直接连接的.安装 Google Drive 需要额外的步骤来从 Kaggle 下载文件,解压缩并将其上传到 Google Drive,并为每个 Python 会话获取并激活一个令牌以安装 Google Drive,但激活令牌很快完成.使用 Kaggle,您需要在每次会话时将文件从 Kaggle 上传到 Google Colab,这需要更多时间和流量.

解决方案

你可以编写一个脚本,只下载某些文件或一个接一个地下载文件:

导入操作系统os.environ['KAGGLE_USERNAME'] = YOUR_USERNAME_HERE"os.environ['KAGGLE_KEY'] = "YOUR_TOKEN_HERE"!kaggle 数据集文件 allen-institute-for-ai/CORD-19-research-challenge!kaggle 数据集下载 allen-institute-for-ai/CORD-19-research-challenge -f metadata.csv

If I want switch from a Kaggle notebook to a Colab notebook, I can download the notebook from Kaggle and open the notebook in Google Colab. The problem with this is that you would normally also need to download and upload the Kaggle dataset, which is quite an effort.

If you have a small dataset or if you need just a smaller file of a dataset, you can put the datasets into the same folder structure that the Kaggle notebook expects. Thus, you will need to create that structure in Google Colab, like kaggle/input/ or whatever, and upload it there. That is not the issue.

If you have a large dataset, though, you can either:

mount your Google Drive and use the dataset / file from there

or you download the Kaggle dataset from Kaggle into colab, following the official Colab guide at Easiest way to download kaggle data in Google Colab, please use the link for more details:

Please follow the steps below to download and use kaggle data within Google Colab:

Go to your Kaggle account, Scroll to API section and Click Expire API Token to remove previous tokens

Click on Create New API Token - It will download kaggle.json file on your machine.

Go to your Google Colab project file and run the following commands:

   ! pip install -q kaggle

Choose the kaggle.json file that you downloaded

from google.colab import files

files.upload()

Make directory named kaggle and copy kaggle.json file there.

! mkdir ~/.kaggle

! cp kaggle.json ~/.kaggle/

Change the permissions of the file.

! chmod 600 ~/.kaggle/kaggle.json

That's all ! You can check if everything's okay by running this command.

! kaggle datasets list

Download Data

   ! kaggle competitions download -c 'name-of-competition'

Or if you want to download datasets (taken from a comment):

! kaggle datasets download -d USERNAME/DATASET_NAME

You can get these dataset names (if unclear) from "copy API command" in the "three-dots drop down" next to "New Notebook" button on the Kaggle dataset page.

And here comes the issue: This seems to work only on smaller datasets. I have tried it on

kaggle datasets download -d allen-institute-for-ai/CORD-19-research-challenge

and it does not find that API, probably because downloading 40 GB of data is just restricted: 404 - Not Found.

In such a case, you can only download the needed file and use the mounted Google Drive, or you need to use Kaggle instead of Colab.

Is there a way to download into Colab only the 800 MB metadata.csv file of the 40 GB CORD-19 Kaggle dataset? Here is the link to the file's information page:

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv

I have now loaded the file in Google Drive, and I am curious whether that is already the best approach. It is quite a lot of effort if in contrast, on Kaggle, the whole dataset is already available, no need to download, and quickly loaded.

PS: After having downloaded the zip file from Kaggle to Colab, it needs to be extracted. Further quoting the quide again:

Use unzip command to unzip the data:

For example, create a directory named train,

   ! mkdir train

unzip train data there,

   ! unzip train.zip -d train

Update: I recommend mounting Google Drive

After having tried both ways (either mounting Google Drive or loading directly from Kaggle) I recommend mounting Google Drive if your architecture allows this. The advantage there is that the file needs to be uploaded only once: Google Colab and Google Drive are directly connected. Mounting Google Drive costs you the extra steps to download the file from Kaggle, unzip and upload it to Google Drive, and get and activate a token for each Python session to mount the Google Drive, but activating the token is done quickly. With Kaggle, you need to upload the file from Kaggle to Google Colab at each session instead, which takes more time and traffic.

解决方案

You could write a script that downloads only certain files or the files one after the other:

import os

os.environ['KAGGLE_USERNAME'] = "YOUR_USERNAME_HERE"
os.environ['KAGGLE_KEY'] = "YOUR_TOKEN_HERE"

!kaggle datasets files allen-institute-for-ai/CORD-19-research-challenge

!kaggle datasets download allen-institute-for-ai/CORD-19-research-challenge -f metadata.csv