pins 패키지를 이용하여 Kaggle 데이터 연결하는 방법에 대하여 알아봅니다.
R에서 캐글을 편하게 하기 위한 필수 R 패키지인 pins
를 소개합니다.
CRAN 정식 버전 설치는 다음과 같습니다.
# Install the released version from CRAN:
install.packages("pins")
버그가 있을 경우, 혹은 최신 버전을 다운 받길 원하는 경우 github에서 직접 다운 받아 설치할 수 있습니다.
# install.packages("remotes")
remotes::install_github("rstudio/pins")
캐글 API를 이용하기 위해서 캐글에 가입 후 사용자 토큰(token)을 다운 받아 등록해놓아야 합니다. 토큰 생성을 누르면 json파일이 다운 받아질텐데, 그것을 등록하도록 합니다.
board_register_kaggle(token = "path/to/kaggle.json")
팔머 펭귄 데이터를 캐글에 private으로 다음과 같이 등록할 수 있습니다.
library(palmerpenguins)
# 현재 업데이트 중 함수 현재 안됨.
pin(penguins, description = "The penguins data from R palmerpenguins", board = "kaggle")
올려진 데이터들은 자신의 캐글 아이디로 검색할 수 있습니다.
pin_find("issactoast", board = "kaggle")
# A tibble: 3 x 4
name description type board
<chr> <chr> <chr> <chr>
1 issactoast/actuariallossestima… Actuarial loss predicti… files kagg…
2 issactoast/topic-info topic_info files kagg…
3 issactoast/topicmodeling20 topicmodeling20 files kagg…
올려놓은 private 데이터 중 첫번째 데이터를 불러옵니다. 이렇게 되면 현재 로컬 컴퓨터에 자료가 다운이 받아서 캐글 사이트에 접속할 필요없이 작업할 수 있게 됩니다.
pin_get("actuariallossestimation", board = "kaggle")
[1] "/home/issac/.cache/pins/kaggle/issactoast/actuariallossestimation/sample_submission.csv"
[2] "/home/issac/.cache/pins/kaggle/issactoast/actuariallossestimation/test.csv/test.csv"
[3] "/home/issac/.cache/pins/kaggle/issactoast/actuariallossestimation/train.csv/train.csv"
캐글에 등록되어 있는 데이터 중 prediction
가 들어간 데이터셋을 다음과 같이 검색도 해보고, 자세한 정보도 볼 수 있습니다.
# A tibble: 6 x 4
name description type board
<chr> <chr> <chr> <chr>
1 aaron7sun/stocknews Daily News for Stock Market… files kagg…
2 andrewmvd/divorce-predicti… Divorce Prediction files kagg…
3 andrewmvd/heart-failure-cl… Heart Failure Prediction files kagg…
4 anmolkumar/health-insuranc… Health Insurance Cross Sell… files kagg…
5 anmolkumar/house-price-pre… House Price Prediction Chal… files kagg…
6 avikasliwal/used-cars-pric… Used Cars Price Prediction files kagg…
pin_info("divorce-prediction", board = "kaggle")
# Source: kaggle<andrewmvd/divorce-prediction> []
# Description: Divorce Prediction
# Properties:
# id: 807599
# subtitle: Uncover what makes relationships last or break
# tags:
# - ref:
# - social science
# - psychology
# - tabular data
# - culture and humanities
# competitionCount:
# - 0
# - 1
# - 112
# - 0
# datasetCount:
# - 2146
# - 521
# - 585
# - 85
# description:
# - Social science is the collection of disciplines studying how humans interact with
# each other.
# - Psychology is the study of how we use our brains (or don't) to interact with others.
# Humans are complicated and maybe data science can help us understand ourselves.
# - .na.character
# - What is it to be human? What activities and patterns of behavior define us and
# our societies? This tag will help you tackle these questions.
# fullPath:
# - subject > people and society > social science
# - subject > people and society > social science > psychology
# - data type > tabular data
# - subject > culture and humanities
# isAutomatic:
# - no
# - no
# - no
# - no
# name:
# - social science
# - psychology
# - tabular data
# - culture and humanities
# scriptCount:
# - 315
# - 137
# - 707
# - 18
# totalCount:
# - 2461
# - 659
# - 1404
# - 103
# creatorName: Larxel
# creatorUrl: andrewmvd
# totalBytes: 4221
# url: https://www.kaggle.com/andrewmvd/divorce-prediction
# lastUpdated: '2020-07-30T20:27:19.613Z'
# downloadCount: 877
# isPrivate: no
# isReviewed: no
# isFeatured: no
# licenseName: Other (specified in description)
# ownerName: Larxel
# ownerRef: andrewmvd
# kernelCount: 7
# topicCount: 0
# viewCount: 8540
# voteCount: 53
# currentVersionNumber: 4
# usabilityRating: 1.0
대회의 경우 c/
가 붙어있습니다. Crowdflower Search Results Relevance 대회와 관련한 데이터를 검색해봅시다.
pin_find("c/crowdflower", board = "kaggle")
# A tibble: 10 x 4
name description type board
<chr> <chr> <chr> <chr>
1 awsaf49/ecommerce-search-r… eCommerce Search Result Re… files kagg…
2 c/crowdflower-search-relev… Crowdflower Search Results… files kagg…
3 c/crowdflower-weather-twit… Partly Sunny with a Chance… files kagg…
4 crowdflower/first-gop-deba… First GOP Debate Twitter S… files kagg…
5 crowdflower/handwritten-na… Handwritten Names files kagg…
6 crowdflower/narrativity-in… Narrativity in Scientific … files kagg…
7 crowdflower/political-soci… Political Social Media Pos… files kagg…
8 crowdflower/twitter-airlin… Twitter US Airline Sentime… files kagg…
9 crowdflower/twitter-user-g… Twitter User Gender Classi… files kagg…
10 humancomp/worker-activity-… Workers Browser Activity i… files kagg…
대회의 자세한 정보도 pin_info
를 사용하여 접근 가능합니다.
pin_info("c/crowdflower-search-relevance", board = "kaggle")
# Source: kaggle<c/crowdflower-search-relevance> []
# Description: Crowdflower Search Results Relevance
# Properties:
# id: 4407
# subtitle: .na.character
# tags:
# - ref:
# - tabular data
# - internet
# competitionCount:
# - 112
# - 18
# datasetCount:
# - 585
# - 5475
# description:
# - .na.character
# - An interconnected network of tubes that connects the entire world together. This
# tag covers a broad range of tags; anything from cryptocurrency to website analytics.
# fullPath:
# - data type > tabular data
# - subject > science and technology > internet
# isAutomatic:
# - no
# - no
# name:
# - tabular data
# - internet
# scriptCount:
# - 707
# - 616
# totalCount:
# - 1404
# - 6109
# creatorName: .na.character
# creatorUrl: .na.character
# totalBytes: .na.integer
# url: https://www.kaggle.com/c/crowdflower-search-relevance
# lastUpdated: .na.character
# downloadCount: .na.integer
# isPrivate: .na
# isReviewed: .na
# isFeatured: .na
# licenseName: .na.character
# ownerName: .na.character
# ownerRef: .na.character
# kernelCount: 0
# topicCount: .na.integer
# viewCount: .na.integer
# voteCount: .na.integer
# currentVersionNumber: .na.integer
# files:
# - ~
# versions:
# - ~
# usabilityRating: .na.real
# deadline: '2015-07-06T23:59:00Z'
# category: Featured
# reward: $20,000
# organizationName: Figure Eight
# organizationRef: crowdflower
# teamCount: 1324
# userHasEntered: yes
# userRank: .na
# mergerDeadline: '2015-06-29T23:59:00Z'
# newEntrantDeadline: '2015-06-29T23:59:00Z'
# enabledDate: '2015-05-11T20:56:45.417Z'
# maxDailySubmissions: 5
# maxTeamSize: .na
# evaluationMetric: QuadraticWeightedKappa
# awardsPoints: yes
# isKernelsSubmissionsOnly: no
# submissionsDisabled: no
맞는 데이터인지 확인 했으므로, 데이터를 로컬 컴퓨터로 다운 받습니다.
pin_get("c/crowdflower-search-relevance", board = "kaggle")
[1] "/home/issac/.cache/pins/kaggle/crowdflower-search-relevance/sampleSubmission.csv.zip"
[2] "/home/issac/.cache/pins/kaggle/crowdflower-search-relevance/test.csv.zip"
[3] "/home/issac/.cache/pins/kaggle/crowdflower-search-relevance/train.csv.zip"
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-NC-ND 4.0. Source code is available at , unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".