⛏ Subreddit text downloader
🔖 Introduction
Download all the text comments from a subreddit
🔗 Github: pistocop/subreddit-comments-dl
Reddit is a perfect website to gather a lot of user comments about specific topics.
For this reason, it looks very attractive for NLP tasks, i.e. make sentiment analysis for specific products or politics.
Therefore I make this scraper/tool that downloads text comments from specific subreddits.
ℹ Update 20 Feb 2021
I wrote a better article for Toward Data Science on medium, you can read it here.
🚀 Usage
Basic usage to download submissions and relative comments from
subreddit AskReddit and News:
# Install the dependencies
pip install -r requirements.txt
# Download the AskReddit comments of the last 30 submissions
python src/subreddit_downloader.py AskReddit --batch-size 10 --laps 3 --reddit-id <reddit_id> --reddit-secret <reddit_secret> --reddit-username <reddit_username>
# Download the News comments after 1 January 2021
python src/subreddit_downloader.py News --batch-size 512 --laps 3 --reddit-id <reddit_id> --reddit-secret <reddit_secret> --reddit-username <reddit_username> --utc-after 1609459200
# Build the dataset and check the results under `./dataset/` path
python src/dataset_builder.py
ℹ Where I can get the Reddit parameters?
-
Parameters indicated with
<...>
on the previous script -
Official Reddit guide
-
TLDR: read this stack overflow
Parameter name Description How get it Example of the value reddit_id
The Client ID generated from the apps page Official guide 40oK80pF8ac3Cn reddit_secret
The secret generated from the apps page Copy the value as showed here 9KEUOE7pi8dsjs9507asdeurowGCcg reddit_username
The reddit account name The name you use for log in pistoSniffer
📖 Glossary
- subreddit: section of reddit website focused on a particular topic
- submission: the post that appear in each subreddit. When you open a subreddit page, all the posts you see. Each submission has a tree of comments.
- comment: text wrote by a reddit user under a submission inside a subreddit
- The main goal of this repository is sto gather the comments belong to the subreddit
✍ Notes
- Under the hood the script use pushshift to gather submissions id,
and praw
for collect the submissions comments - More info about the
subreddit_downloader.py
script under the--help
command:
🙏 Technologies
Each adventure brings with it some new discoveries: those are the technologies I had uses:
plac- a very handy pip package to manage the script arguments- Discarded because it can’t handle python3 typing
- typer - from the author of FastAPI, a script arguments manager based on python3 typing system
- praw - the official reddit API
- loguru - logging manager
- PushshiftAPI - unofficial reddit API
- PrettyErrors - prettifies Python exception output to make it legible
- PyCharm - why mention this IDE? because after an “incident” with git from CLI, it had restored my deleted file thanks to its internal history ♥