Python Scripts
Before scraping and crawling public contents from a channel or websites, you should obtain the permission of copyright owners in compliance with their terms of service.
Install Python from Python.org
- Check both "Install Launcher" and "Add Python to PATH"
- Click "Customize installation" otherwise it will be installed into a hidden folder AppData for application and file settings
- Check all but "install for all users" under Advanced Options
- type bash python --version to check in command prompt
Python script to extract video links and basic information of YouTube channels
- Osmosis YouTube Channel
- youtubescrapingos.
- channel ID obtained per StackOverflow
- go to the YouTube channel page
- view page source; check the box of "line wrap"
- search for "externalID")
- Armando Hasudungan YouTube Channel
- Script Youtubescraping.os
- Dr Yan Yu YouTube Channel
- enable YouTube Data API v3
- create credentials -> an API key (sufficient for public contents, if private, choose OAuth client ID)
- run the bash command line locally (like Command Prompts in Windows): pip install google-api-python-client
- per ChatGPT, used the youtubescraping py file but only scraping 50 records or so.
- Run the youtubescraping py file in the directory navigated to
- csv saved under the 13435 Documents folders (different from the Documents quick lookup) - it contains both shorts and general videos
Bash commands
Directory navigation: cd directory
per Builtin in Oct 2022
- Default under Users/13435/Documents/Python_Scripts
cd Scripts/Python_Scripts
cd .. /* navigating up */
python youtubescrapingos.py
Python script to extract websites pages in a library
- Firstly check the terms of service page and the /robots.txt page of the website to see what's allowed and permitted, per ChatGPT
- Install Beautifulsoup: pip install beautifulsoup4
- And parse the URL to access all the information architecture from the MD and RN library in the Osmosis.org/library, with the csv named "output"
- Osmosis.org allowed much more access than UpToDate.com
- UpToDate.com, e.g. Simplified Chinese version of Treatment of Psoriasis in Adults - doesn't include the physiology part, is not that open comparatively speaking
- Merck Manuals - mostly disorders
BUT sometimes for a dynamic website with lazy rendering, scraping may need Nodejs scripts, but the selectors need to be identified, which didn't work out well enough
Python virtual environment in VS Code
Per Jie Jenn in early 2023, virtual environment is an isolated environment that allows you to install Python packages and dependencies for a specific project without affecting other projects, easier to share.
- some use Bash as the terminal, PowerShell or Command Prompt also works