pins 패키지를 사용한 캐글 데이터 연동

pins data

pins 패키지를 이용하여 Kaggle 데이터 연결하는 방법에 대하여 알아봅니다.

Issac Lee https://www.theissaclee.com/ko/ (슬기로운통계생활)https://www.youtube.com/c/statisticsplaybook
02-20-2021

R에서 캐글을 편하게 하기 위한 필수 R 패키지인 pins를 소개합니다.

패키지 설치하기

CRAN 정식 버전 설치는 다음과 같습니다.

# Install the released version from CRAN:
install.packages("pins")

버그가 있을 경우, 혹은 최신 버전을 다운 받길 원하는 경우 github에서 직접 다운 받아 설치할 수 있습니다.

# install.packages("remotes")
remotes::install_github("rstudio/pins")

패키지 로드

token 등록하기

캐글 API를 이용하기 위해서 캐글에 가입 후 사용자 토큰(token)을 다운 받아 등록해놓아야 합니다. 토큰 생성을 누르면 json파일이 다운 받아질텐데, 그것을 등록하도록 합니다.

board_register_kaggle(token = "path/to/kaggle.json")

나만의 데이터 등록해서 사용하기

팔머 펭귄 데이터를 캐글에 private으로 다음과 같이 등록할 수 있습니다.

library(palmerpenguins)

# 현재 업데이트 중 함수 현재 안됨.
pin(penguins, description = "The penguins data from R palmerpenguins", board = "kaggle")

올려진 데이터들은 자신의 캐글 아이디로 검색할 수 있습니다.

pin_find("issactoast", board = "kaggle")
# A tibble: 3 x 4
  name                            description              type  board
  <chr>                           <chr>                    <chr> <chr>
1 issactoast/actuariallossestima… Actuarial loss predicti… files kagg…
2 issactoast/topic-info           topic_info               files kagg…
3 issactoast/topicmodeling20      topicmodeling20          files kagg…

올려놓은 private 데이터 중 첫번째 데이터를 불러옵니다. 이렇게 되면 현재 로컬 컴퓨터에 자료가 다운이 받아서 캐글 사이트에 접속할 필요없이 작업할 수 있게 됩니다.

pin_get("actuariallossestimation", board = "kaggle")
[1] "/home/issac/.cache/pins/kaggle/issactoast/actuariallossestimation/sample_submission.csv"
[2] "/home/issac/.cache/pins/kaggle/issactoast/actuariallossestimation/test.csv/test.csv"    
[3] "/home/issac/.cache/pins/kaggle/issactoast/actuariallossestimation/train.csv/train.csv"  

캐글 데이터 찾기

캐글에 등록되어 있는 데이터 중 prediction가 들어간 데이터셋을 다음과 같이 검색도 해보고, 자세한 정보도 볼 수 있습니다.

head(pin_find("prediction", board = "kaggle"))
# A tibble: 6 x 4
  name                        description                  type  board
  <chr>                       <chr>                        <chr> <chr>
1 aaron7sun/stocknews         Daily News for Stock Market… files kagg…
2 andrewmvd/divorce-predicti… Divorce Prediction           files kagg…
3 andrewmvd/heart-failure-cl… Heart Failure Prediction     files kagg…
4 anmolkumar/health-insuranc… Health Insurance Cross Sell… files kagg…
5 anmolkumar/house-price-pre… House Price Prediction Chal… files kagg…
6 avikasliwal/used-cars-pric… Used Cars Price Prediction   files kagg…
pin_info("divorce-prediction", board = "kaggle")
# Source: kaggle<andrewmvd/divorce-prediction> []
# Description: Divorce Prediction
# Properties:
#   id: 807599
#   subtitle: Uncover what makes relationships last or break
#   tags:
#   - ref:
#     - social science
#     - psychology
#     - tabular data
#     - culture and humanities
#     competitionCount:
#     - 0
#     - 1
#     - 112
#     - 0
#     datasetCount:
#     - 2146
#     - 521
#     - 585
#     - 85
#     description:
#     - Social science is the collection of disciplines studying how humans interact with
#       each other.
#     - Psychology is the study of how we use our brains (or don't) to interact with others.
#       Humans are complicated and maybe data science can help us understand ourselves.
#     - .na.character
#     - What is it to be human? What activities and patterns of behavior define us and
#       our societies? This tag will help you tackle these questions.
#     fullPath:
#     - subject > people and society > social science
#     - subject > people and society > social science > psychology
#     - data type > tabular data
#     - subject > culture and humanities
#     isAutomatic:
#     - no
#     - no
#     - no
#     - no
#     name:
#     - social science
#     - psychology
#     - tabular data
#     - culture and humanities
#     scriptCount:
#     - 315
#     - 137
#     - 707
#     - 18
#     totalCount:
#     - 2461
#     - 659
#     - 1404
#     - 103
#   creatorName: Larxel
#   creatorUrl: andrewmvd
#   totalBytes: 4221
#   url: https://www.kaggle.com/andrewmvd/divorce-prediction
#   lastUpdated: '2020-07-30T20:27:19.613Z'
#   downloadCount: 877
#   isPrivate: no
#   isReviewed: no
#   isFeatured: no
#   licenseName: Other (specified in description)
#   ownerName: Larxel
#   ownerRef: andrewmvd
#   kernelCount: 7
#   topicCount: 0
#   viewCount: 8540
#   voteCount: 53
#   currentVersionNumber: 4
#   usabilityRating: 1.0

캐글 대회 데이터 찾기

대회의 경우 c/ 가 붙어있습니다. Crowdflower Search Results Relevance 대회와 관련한 데이터를 검색해봅시다.

pin_find("c/crowdflower", board = "kaggle")
# A tibble: 10 x 4
   name                        description                 type  board
   <chr>                       <chr>                       <chr> <chr>
 1 awsaf49/ecommerce-search-r… eCommerce Search Result Re… files kagg…
 2 c/crowdflower-search-relev… Crowdflower Search Results… files kagg…
 3 c/crowdflower-weather-twit… Partly Sunny with a Chance… files kagg…
 4 crowdflower/first-gop-deba… First GOP Debate Twitter S… files kagg…
 5 crowdflower/handwritten-na… Handwritten Names           files kagg…
 6 crowdflower/narrativity-in… Narrativity in Scientific … files kagg…
 7 crowdflower/political-soci… Political Social Media Pos… files kagg…
 8 crowdflower/twitter-airlin… Twitter US Airline Sentime… files kagg…
 9 crowdflower/twitter-user-g… Twitter User Gender Classi… files kagg…
10 humancomp/worker-activity-… Workers Browser Activity i… files kagg…

대회의 자세한 정보도 pin_info를 사용하여 접근 가능합니다.

pin_info("c/crowdflower-search-relevance", board = "kaggle")
# Source: kaggle<c/crowdflower-search-relevance> []
# Description: Crowdflower Search Results Relevance
# Properties:
#   id: 4407
#   subtitle: .na.character
#   tags:
#   - ref:
#     - tabular data
#     - internet
#     competitionCount:
#     - 112
#     - 18
#     datasetCount:
#     - 585
#     - 5475
#     description:
#     - .na.character
#     - An interconnected network of tubes that connects the entire world together. This
#       tag covers a broad range of tags; anything from cryptocurrency to website analytics.
#     fullPath:
#     - data type > tabular data
#     - subject > science and technology > internet
#     isAutomatic:
#     - no
#     - no
#     name:
#     - tabular data
#     - internet
#     scriptCount:
#     - 707
#     - 616
#     totalCount:
#     - 1404
#     - 6109
#   creatorName: .na.character
#   creatorUrl: .na.character
#   totalBytes: .na.integer
#   url: https://www.kaggle.com/c/crowdflower-search-relevance
#   lastUpdated: .na.character
#   downloadCount: .na.integer
#   isPrivate: .na
#   isReviewed: .na
#   isFeatured: .na
#   licenseName: .na.character
#   ownerName: .na.character
#   ownerRef: .na.character
#   kernelCount: 0
#   topicCount: .na.integer
#   viewCount: .na.integer
#   voteCount: .na.integer
#   currentVersionNumber: .na.integer
#   files:
#   - ~
#   versions:
#   - ~
#   usabilityRating: .na.real
#   deadline: '2015-07-06T23:59:00Z'
#   category: Featured
#   reward: $20,000
#   organizationName: Figure Eight
#   organizationRef: crowdflower
#   teamCount: 1324
#   userHasEntered: yes
#   userRank: .na
#   mergerDeadline: '2015-06-29T23:59:00Z'
#   newEntrantDeadline: '2015-06-29T23:59:00Z'
#   enabledDate: '2015-05-11T20:56:45.417Z'
#   maxDailySubmissions: 5
#   maxTeamSize: .na
#   evaluationMetric: QuadraticWeightedKappa
#   awardsPoints: yes
#   isKernelsSubmissionsOnly: no
#   submissionsDisabled: no

맞는 데이터인지 확인 했으므로, 데이터를 로컬 컴퓨터로 다운 받습니다.

pin_get("c/crowdflower-search-relevance", board = "kaggle")
[1] "/home/issac/.cache/pins/kaggle/crowdflower-search-relevance/sampleSubmission.csv.zip"
[2] "/home/issac/.cache/pins/kaggle/crowdflower-search-relevance/test.csv.zip"            
[3] "/home/issac/.cache/pins/kaggle/crowdflower-search-relevance/train.csv.zip"           

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC-ND 4.0. Source code is available at , unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".