31 lines
1.1 KiB
Markdown
31 lines
1.1 KiB
Markdown
# ScrapeGraphAI
|
||
|
||
ScrapeGraphAI是一个用于网络爬虫和数据抓取的AI工具。
|
||
|
||
- https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/main/docs/chinese.md
|
||
- https://github.com/ScrapeGraphAI/ScrapegraphLib-Examples
|
||
- https://github.com/ScrapeGraphAI/ScrapegraphLib-Examples/blob/main/extras/authenticated_playwright.py
|
||
|
||
|
||
## Reference
|
||
https://www.aivi.fyi/aiagents/introduce-ScrapeGraphAI+LangChain+LangGraph
|
||
|
||
## Dependencies
|
||
```
|
||
pip install scrapegraphai
|
||
playwright install
|
||
pip install --upgrade duckduckgo-search
|
||
pip install scrapegraphai'[other-language-models]'
|
||
pip install scrapegraphai'[more-semantic-options]'
|
||
pip install scrapegraphai'[more-browser-options]'
|
||
|
||
ollama pull mistral-nemo
|
||
ollama list
|
||
```
|
||
|
||
## Tips
|
||
- Comment
|
||
- 小参数模型的api,比调用gpt-4o的省钱很多
|
||
- Playwright +plugins 能解决一部分captcha。如果再加上llm,基本就不是什么问题了
|
||
- 这个repo就是传统爬虫套了一个ai的壳子,数据解析部分用ai来做代替以前的hard code, 反爬只能通过ip proxy (家宅ip供应商最好) + playwright or chrome driver&selenium attach到 chrome进程来解决
|