[{"data":1,"prerenderedAt":2173},["ShallowReactive",2],{"page-\u002Fthe-complete-guide-to-python-web-scraping\u002F":3,"content-navigation":2024},{"id":4,"title":5,"body":6,"description":16,"extension":2018,"meta":2019,"navigation":222,"path":2020,"seo":2021,"stem":2022,"__hash__":2023},"content\u002Fthe-complete-guide-to-python-web-scraping\u002Findex.md","The Complete Guide to Python Web Scraping",{"type":7,"value":8,"toc":1974},"minimark",[9,13,17,20,25,28,33,45,49,52,56,73,77,80,84,99,103,126,130,144,570,577,581,584,588,600,604,607,611,626,906,913,917,920,924,927,931,934,938,949,1250,1257,1261,1264,1268,1278,1282,1285,1289,1292,1299,1303,1306,1310,1321,1325,1328,1332,1338,1556,1563,1567,1570,1574,1585,1589,1592,1596,1607,1847,1850,1854,1857,1861,1867,1871,1878,1882,1885,1889,1933,1937,1946,1952,1961,1970],[10,11,5],"h1",{"id":12},"the-complete-guide-to-python-web-scraping",[14,15,16],"p",{},"Web scraping is the automated process of extracting structured data from websites, and Python has become the industry standard due to its readability, extensive library ecosystem, and strong community support. This guide walks beginners and general developers through a complete, ethical, and scalable scraping workflow.",[14,18,19],{},"You will learn everything from initial environment configuration to final data validation. By following these foundational practices, you will build maintainable scripts that respect site policies, avoid common pitfalls, and deliver clean, actionable datasets.",[21,22,24],"h2",{"id":23},"_1-preparing-your-development-workspace","1. Preparing Your Development Workspace",[14,26,27],{},"Before writing extraction logic, developers must establish an isolated, reproducible workspace. This prevents dependency conflicts and ensures consistent behavior across different machines.",[29,30,32],"h3",{"id":31},"installing-python-and-pip","Installing Python and pip",[14,34,35,36,40,41,44],{},"Start by downloading the latest stable release of Python from the official website. Verify the installation by running ",[37,38,39],"code",{},"python --version"," in your terminal. The ",[37,42,43],{},"pip"," package manager is included by default and will handle all third-party library installations.",[29,46,48],{"id":47},"virtual-environments-explained","Virtual environments explained",[14,50,51],{},"Virtual environments create isolated directories for each project. This ensures that library versions do not interfere with your system Python or other projects. Always activate your environment before installing packages.",[29,53,55],{"id":54},"core-library-installation","Core library installation",[14,57,58,59,62,63,66,67,72],{},"Once your environment is active, install the foundational tools. You will primarily need ",[37,60,61],{},"requests"," for network communication and ",[37,64,65],{},"beautifulsoup4"," for HTML parsing. For a step-by-step walkthrough of installing dependencies and configuring your development tools, refer to ",[68,69,71],"a",{"href":70},"\u002Fthe-complete-guide-to-python-web-scraping\u002Fsetting-up-your-python-scraping-environment\u002F","Setting Up Your Python Scraping Environment",".",[21,74,76],{"id":75},"_2-how-the-web-communicates-http-fundamentals","2. How the Web Communicates: HTTP Fundamentals",[14,78,79],{},"Successful scraping relies on mimicking legitimate browser behavior and interpreting server feedback correctly. Understanding the underlying protocol prevents blocked requests and malformed data.",[29,81,83],{"id":82},"request-methods-get-post-put","Request methods (GET, POST, PUT)",[14,85,86,87,90,91,94,95,98],{},"The ",[37,88,89],{},"GET"," method retrieves data without modifying server state, making it ideal for scraping. ",[37,92,93],{},"POST"," sends data to the server, often used for search forms or login submissions. ",[37,96,97],{},"PUT"," updates existing resources and is rarely needed for standard extraction tasks.",[29,100,102],{"id":101},"status-codes-and-headers","Status codes and headers",[14,104,105,106,109,110,113,114,117,118,121,122,125],{},"HTTP status codes indicate request outcomes. ",[37,107,108],{},"200"," means success, while ",[37,111,112],{},"403"," signals access denial and ",[37,115,116],{},"429"," indicates rate limiting. Headers like ",[37,119,120],{},"User-Agent"," and ",[37,123,124],{},"Accept-Language"," identify your client to the server. Omitting them often triggers anti-bot filters.",[29,127,129],{"id":128},"rate-limiting-and-retry-strategies","Rate limiting and retry strategies",[14,131,132,133,135,136,139,140,143],{},"Servers enforce request limits to maintain performance. Implement exponential backoff strategies when encountering ",[37,134,116],{}," or ",[37,137,138],{},"503"," responses. Always include a ",[37,141,142],{},"time.sleep()"," delay between requests to distribute load evenly.",[145,146,151],"pre",{"className":147,"code":148,"language":149,"meta":150,"style":150},"language-python shiki shiki-themes material-theme-lighter github-light github-dark","import requests\nimport time\nfrom requests.adapters import HTTPAdapter\nfrom urllib3.util.retry import Retry\n\ndef fetch_page(url: str) -> requests.Response:\n session = requests.Session()\n retry_strategy = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504])\n session.mount(\"https:\u002F\u002F\", HTTPAdapter(max_retries=retry_strategy))\n \n headers = {\"User-Agent\": \"Mozilla\u002F5.0 (Windows NT 10.0; Win64; x64)\"}\n \n try:\n response = session.get(url, headers=headers, timeout=10)\n response.raise_for_status()\n return response\n except requests.exceptions.RequestException as e:\n print(f\"Request failed: {e}\")\n raise\n","python","",[37,152,153,166,174,194,217,224,266,287,356,397,403,433,438,446,488,501,510,536,564],{"__ignoreMap":150},[154,155,158,162],"span",{"class":156,"line":157},"line",1,[154,159,161],{"class":160},"sVHd0","import",[154,163,165],{"class":164},"su5hD"," requests\n",[154,167,169,171],{"class":156,"line":168},2,[154,170,161],{"class":160},[154,172,173],{"class":164}," time\n",[154,175,177,180,183,186,189,191],{"class":156,"line":176},3,[154,178,179],{"class":160},"from",[154,181,182],{"class":164}," requests",[154,184,72],{"class":185},"sP7_E",[154,187,188],{"class":164},"adapters ",[154,190,161],{"class":160},[154,192,193],{"class":164}," HTTPAdapter\n",[154,195,197,199,202,204,207,209,212,214],{"class":156,"line":196},4,[154,198,179],{"class":160},[154,200,201],{"class":164}," urllib3",[154,203,72],{"class":185},[154,205,206],{"class":164},"util",[154,208,72],{"class":185},[154,210,211],{"class":164},"retry ",[154,213,161],{"class":160},[154,215,216],{"class":164}," Retry\n",[154,218,220],{"class":156,"line":219},5,[154,221,223],{"emptyLinePlaceholder":222},true,"\n",[154,225,227,231,235,238,242,245,249,252,255,257,259,263],{"class":156,"line":226},6,[154,228,230],{"class":229},"sbsja","def",[154,232,234],{"class":233},"sGLFI"," fetch_page",[154,236,237],{"class":185},"(",[154,239,241],{"class":240},"sFwrP","url",[154,243,244],{"class":185},":",[154,246,248],{"class":247},"sZMiF"," str",[154,250,251],{"class":185},")",[154,253,254],{"class":185}," ->",[154,256,182],{"class":164},[154,258,72],{"class":185},[154,260,262],{"class":261},"skxfh","Response",[154,264,265],{"class":185},":\n",[154,267,269,272,276,278,280,284],{"class":156,"line":268},7,[154,270,271],{"class":164}," session ",[154,273,275],{"class":274},"smGrS","=",[154,277,182],{"class":164},[154,279,72],{"class":185},[154,281,283],{"class":282},"slqww","Session",[154,285,286],{"class":185},"()\n",[154,288,290,293,295,298,300,304,306,310,313,316,318,321,323,326,328,331,333,335,338,340,343,345,348,350,353],{"class":156,"line":289},8,[154,291,292],{"class":164}," retry_strategy ",[154,294,275],{"class":274},[154,296,297],{"class":282}," Retry",[154,299,237],{"class":185},[154,301,303],{"class":302},"s99_P","total",[154,305,275],{"class":274},[154,307,309],{"class":308},"srdBf","3",[154,311,312],{"class":185},",",[154,314,315],{"class":302}," backoff_factor",[154,317,275],{"class":274},[154,319,320],{"class":308},"1",[154,322,312],{"class":185},[154,324,325],{"class":302}," status_forcelist",[154,327,275],{"class":274},[154,329,330],{"class":185},"[",[154,332,116],{"class":308},[154,334,312],{"class":185},[154,336,337],{"class":308}," 500",[154,339,312],{"class":185},[154,341,342],{"class":308}," 502",[154,344,312],{"class":185},[154,346,347],{"class":308}," 503",[154,349,312],{"class":185},[154,351,352],{"class":308}," 504",[154,354,355],{"class":185},"])\n",[154,357,359,362,364,367,369,373,377,379,381,384,386,389,391,394],{"class":156,"line":358},9,[154,360,361],{"class":164}," session",[154,363,72],{"class":185},[154,365,366],{"class":282},"mount",[154,368,237],{"class":185},[154,370,372],{"class":371},"sjJ54","\"",[154,374,376],{"class":375},"s_sjI","https:\u002F\u002F",[154,378,372],{"class":371},[154,380,312],{"class":185},[154,382,383],{"class":282}," HTTPAdapter",[154,385,237],{"class":185},[154,387,388],{"class":302},"max_retries",[154,390,275],{"class":274},[154,392,393],{"class":282},"retry_strategy",[154,395,396],{"class":185},"))\n",[154,398,400],{"class":156,"line":399},10,[154,401,402],{"class":164}," \n",[154,404,406,409,411,414,416,418,420,422,425,428,430],{"class":156,"line":405},11,[154,407,408],{"class":164}," headers ",[154,410,275],{"class":274},[154,412,413],{"class":185}," {",[154,415,372],{"class":371},[154,417,120],{"class":375},[154,419,372],{"class":371},[154,421,244],{"class":185},[154,423,424],{"class":371}," \"",[154,426,427],{"class":375},"Mozilla\u002F5.0 (Windows NT 10.0; Win64; x64)",[154,429,372],{"class":371},[154,431,432],{"class":185},"}\n",[154,434,436],{"class":156,"line":435},12,[154,437,402],{"class":164},[154,439,441,444],{"class":156,"line":440},13,[154,442,443],{"class":160}," try",[154,445,265],{"class":185},[154,447,449,452,454,456,458,461,463,465,467,470,472,475,477,480,482,485],{"class":156,"line":448},14,[154,450,451],{"class":164}," response ",[154,453,275],{"class":274},[154,455,361],{"class":164},[154,457,72],{"class":185},[154,459,460],{"class":282},"get",[154,462,237],{"class":185},[154,464,241],{"class":282},[154,466,312],{"class":185},[154,468,469],{"class":302}," headers",[154,471,275],{"class":274},[154,473,474],{"class":282},"headers",[154,476,312],{"class":185},[154,478,479],{"class":302}," timeout",[154,481,275],{"class":274},[154,483,484],{"class":308},"10",[154,486,487],{"class":185},")\n",[154,489,491,494,496,499],{"class":156,"line":490},15,[154,492,493],{"class":164}," response",[154,495,72],{"class":185},[154,497,498],{"class":282},"raise_for_status",[154,500,286],{"class":185},[154,502,504,507],{"class":156,"line":503},16,[154,505,506],{"class":160}," return",[154,508,509],{"class":164}," response\n",[154,511,513,516,518,520,523,525,528,531,534],{"class":156,"line":512},17,[154,514,515],{"class":160}," except",[154,517,182],{"class":164},[154,519,72],{"class":185},[154,521,522],{"class":261},"exceptions",[154,524,72],{"class":185},[154,526,527],{"class":261},"RequestException",[154,529,530],{"class":160}," as",[154,532,533],{"class":164}," e",[154,535,265],{"class":185},[154,537,539,543,545,548,551,554,557,560,562],{"class":156,"line":538},18,[154,540,542],{"class":541},"sptTA"," print",[154,544,237],{"class":185},[154,546,547],{"class":229},"f",[154,549,550],{"class":375},"\"Request failed: ",[154,552,553],{"class":308},"{",[154,555,556],{"class":282},"e",[154,558,559],{"class":308},"}",[154,561,372],{"class":375},[154,563,487],{"class":185},[154,565,567],{"class":156,"line":566},19,[154,568,569],{"class":160}," raise\n",[14,571,572,573,72],{},"A deep dive into the mechanics of client-server communication is available in ",[68,574,576],{"href":575},"\u002Fthe-complete-guide-to-python-web-scraping\u002Funderstanding-http-requests-and-responses\u002F","Understanding HTTP Requests and Responses",[21,578,580],{"id":579},"_3-fetching-and-parsing-web-content","3. Fetching and Parsing Web Content",[14,582,583],{},"Once a page is downloaded, the raw HTML must be transformed into a navigable structure. This allows your script to query specific elements efficiently without overcomplicating your logic.",[29,585,587],{"id":586},"using-the-requests-library","Using the Requests library",[14,589,86,590,592,593,595,596,599],{},[37,591,61],{}," library handles connection pooling, SSL verification, and automatic decoding. It returns a ",[37,594,262],{}," object containing the raw HTML string in the ",[37,597,598],{},".text"," attribute.",[29,601,603],{"id":602},"dom-tree-structure","DOM tree structure",[14,605,606],{},"The Document Object Model (DOM) represents HTML as a hierarchical tree of nodes. Each tag becomes a parent, child, or sibling element. Parsers traverse this tree to locate target data points.",[29,608,610],{"id":609},"selecting-elements-by-tag-class-and-id","Selecting elements by tag, class, and ID",[14,612,613,614,617,618,621,622,625],{},"CSS selectors provide a concise syntax for targeting nodes. Use ",[37,615,616],{},"#id"," for unique elements, ",[37,619,620],{},".class"," for grouped items, and ",[37,623,624],{},"tag"," for structural containers. Combine them for precise extraction paths.",[145,627,629],{"className":147,"code":628,"language":149,"meta":150,"style":150},"from bs4 import BeautifulSoup\n\ndef extract_product_data(html_content: str) -> list[dict]:\n soup = BeautifulSoup(html_content, \"html.parser\")\n products = []\n \n for item in soup.select(\"div.product-card\"):\n name_tag = item.select_one(\"h2.product-title\")\n price_tag = item.select_one(\"span.price\")\n \n if name_tag and price_tag:\n products.append({\n \"name\": name_tag.get_text(strip=True),\n \"price\": price_tag.get_text(strip=True)\n })\n \n return products\n",[37,630,631,643,647,678,703,713,717,748,774,798,802,817,830,863,890,895,899],{"__ignoreMap":150},[154,632,633,635,638,640],{"class":156,"line":157},[154,634,179],{"class":160},[154,636,637],{"class":164}," bs4 ",[154,639,161],{"class":160},[154,641,642],{"class":164}," BeautifulSoup\n",[154,644,645],{"class":156,"line":168},[154,646,223],{"emptyLinePlaceholder":222},[154,648,649,651,654,656,659,661,663,665,667,670,672,675],{"class":156,"line":176},[154,650,230],{"class":229},[154,652,653],{"class":233}," extract_product_data",[154,655,237],{"class":185},[154,657,658],{"class":240},"html_content",[154,660,244],{"class":185},[154,662,248],{"class":247},[154,664,251],{"class":185},[154,666,254],{"class":185},[154,668,669],{"class":164}," list",[154,671,330],{"class":185},[154,673,674],{"class":247},"dict",[154,676,677],{"class":185},"]:\n",[154,679,680,683,685,688,690,692,694,696,699,701],{"class":156,"line":196},[154,681,682],{"class":164}," soup ",[154,684,275],{"class":274},[154,686,687],{"class":282}," BeautifulSoup",[154,689,237],{"class":185},[154,691,658],{"class":282},[154,693,312],{"class":185},[154,695,424],{"class":371},[154,697,698],{"class":375},"html.parser",[154,700,372],{"class":371},[154,702,487],{"class":185},[154,704,705,708,710],{"class":156,"line":219},[154,706,707],{"class":164}," products ",[154,709,275],{"class":274},[154,711,712],{"class":185}," []\n",[154,714,715],{"class":156,"line":226},[154,716,402],{"class":164},[154,718,719,722,725,728,731,733,736,738,740,743,745],{"class":156,"line":268},[154,720,721],{"class":160}," for",[154,723,724],{"class":164}," item ",[154,726,727],{"class":160},"in",[154,729,730],{"class":164}," soup",[154,732,72],{"class":185},[154,734,735],{"class":282},"select",[154,737,237],{"class":185},[154,739,372],{"class":371},[154,741,742],{"class":375},"div.product-card",[154,744,372],{"class":371},[154,746,747],{"class":185},"):\n",[154,749,750,753,755,758,760,763,765,767,770,772],{"class":156,"line":289},[154,751,752],{"class":164}," name_tag ",[154,754,275],{"class":274},[154,756,757],{"class":164}," item",[154,759,72],{"class":185},[154,761,762],{"class":282},"select_one",[154,764,237],{"class":185},[154,766,372],{"class":371},[154,768,769],{"class":375},"h2.product-title",[154,771,372],{"class":371},[154,773,487],{"class":185},[154,775,776,779,781,783,785,787,789,791,794,796],{"class":156,"line":358},[154,777,778],{"class":164}," price_tag ",[154,780,275],{"class":274},[154,782,757],{"class":164},[154,784,72],{"class":185},[154,786,762],{"class":282},[154,788,237],{"class":185},[154,790,372],{"class":371},[154,792,793],{"class":375},"span.price",[154,795,372],{"class":371},[154,797,487],{"class":185},[154,799,800],{"class":156,"line":399},[154,801,402],{"class":164},[154,803,804,807,809,812,815],{"class":156,"line":405},[154,805,806],{"class":160}," if",[154,808,752],{"class":164},[154,810,811],{"class":274},"and",[154,813,814],{"class":164}," price_tag",[154,816,265],{"class":185},[154,818,819,822,824,827],{"class":156,"line":435},[154,820,821],{"class":164}," products",[154,823,72],{"class":185},[154,825,826],{"class":282},"append",[154,828,829],{"class":185},"({\n",[154,831,832,834,837,839,841,844,846,849,851,854,856,860],{"class":156,"line":440},[154,833,424],{"class":371},[154,835,836],{"class":375},"name",[154,838,372],{"class":371},[154,840,244],{"class":185},[154,842,843],{"class":282}," name_tag",[154,845,72],{"class":185},[154,847,848],{"class":282},"get_text",[154,850,237],{"class":185},[154,852,853],{"class":302},"strip",[154,855,275],{"class":274},[154,857,859],{"class":858},"s39Yj","True",[154,861,862],{"class":185},"),\n",[154,864,865,867,870,872,874,876,878,880,882,884,886,888],{"class":156,"line":448},[154,866,424],{"class":371},[154,868,869],{"class":375},"price",[154,871,372],{"class":371},[154,873,244],{"class":185},[154,875,814],{"class":282},[154,877,72],{"class":185},[154,879,848],{"class":282},[154,881,237],{"class":185},[154,883,853],{"class":302},[154,885,275],{"class":274},[154,887,859],{"class":858},[154,889,487],{"class":185},[154,891,892],{"class":156,"line":490},[154,893,894],{"class":185}," })\n",[154,896,897],{"class":156,"line":503},[154,898,402],{"class":164},[154,900,901,903],{"class":156,"line":512},[154,902,506],{"class":160},[154,904,905],{"class":164}," products\n",[14,907,908,909,72],{},"For comprehensive syntax examples and CSS selector strategies, see ",[68,910,912],{"href":911},"\u002Fthe-complete-guide-to-python-web-scraping\u002Fparsing-html-with-beautifulsoup\u002F","Parsing HTML with BeautifulSoup",[21,914,916],{"id":915},"_4-advanced-text-extraction-techniques","4. Advanced Text Extraction Techniques",[14,918,919],{},"Not all valuable data resides in clean HTML tags. Sometimes, information is embedded in raw strings, JavaScript variables, or poorly formatted markup.",[29,921,923],{"id":922},"pattern-matching-basics","Pattern matching basics",[14,925,926],{},"Regular expressions (regex) allow you to define search patterns using special character sequences. They excel at extracting consistent formats like dates, IDs, or contact details from unstructured text.",[29,928,930],{"id":929},"regex-vs-dom-parsing","Regex vs. DOM parsing",[14,932,933],{},"DOM parsing is safer for structural data. Regex should only supplement parsing when dealing with inline scripts, meta tags, or malformed HTML. Overusing regex on complex markup leads to fragile code.",[29,935,937],{"id":936},"handling-unstructured-or-embedded-text","Handling unstructured or embedded text",[14,939,940,941,944,945,948],{},"Use the ",[37,942,943],{},"re"," module to compile patterns once and reuse them efficiently. Always apply non-greedy quantifiers (",[37,946,947],{},".*?",") to avoid capturing excessive text. Validate matches before storing them.",[145,950,952],{"className":147,"code":951,"language":149,"meta":150,"style":150},"import re\n\ndef extract_contact_info(text: str) -> dict:\n email_pattern = re.compile(r\"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+\")\n phone_pattern = re.compile(r\"\\b(?:\\+?\\d{1,3}[-.\\s]?)?\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}\\b\")\n \n emails = email_pattern.findall(text)\n phones = phone_pattern.findall(text)\n \n return {\"emails\": list(set(emails)), \"phones\": list(set(phones))}\n",[37,953,954,961,965,990,1054,1150,1154,1175,1195,1199],{"__ignoreMap":150},[154,955,956,958],{"class":156,"line":157},[154,957,161],{"class":160},[154,959,960],{"class":164}," re\n",[154,962,963],{"class":156,"line":168},[154,964,223],{"emptyLinePlaceholder":222},[154,966,967,969,972,974,977,979,981,983,985,988],{"class":156,"line":176},[154,968,230],{"class":229},[154,970,971],{"class":233}," extract_contact_info",[154,973,237],{"class":185},[154,975,976],{"class":240},"text",[154,978,244],{"class":185},[154,980,248],{"class":247},[154,982,251],{"class":185},[154,984,254],{"class":185},[154,986,987],{"class":247}," dict",[154,989,265],{"class":185},[154,991,992,995,997,1000,1002,1005,1007,1010,1012,1014,1018,1021,1024,1028,1030,1033,1035,1037,1041,1043,1046,1048,1050,1052],{"class":156,"line":196},[154,993,994],{"class":164}," email_pattern ",[154,996,275],{"class":274},[154,998,999],{"class":164}," re",[154,1001,72],{"class":185},[154,1003,1004],{"class":282},"compile",[154,1006,237],{"class":185},[154,1008,1009],{"class":229},"r",[154,1011,372],{"class":371},[154,1013,330],{"class":858},[154,1015,1017],{"class":1016},"stzsN","a-zA-Z0-9_.+-",[154,1019,1020],{"class":858},"]",[154,1022,1023],{"class":274},"+",[154,1025,1027],{"class":1026},"sQRbd","@",[154,1029,330],{"class":858},[154,1031,1032],{"class":1016},"a-zA-Z0-9-",[154,1034,1020],{"class":858},[154,1036,1023],{"class":274},[154,1038,1040],{"class":1039},"sjYin","\\.",[154,1042,330],{"class":858},[154,1044,1045],{"class":1016},"a-zA-Z0-9-.",[154,1047,1020],{"class":858},[154,1049,1023],{"class":274},[154,1051,372],{"class":371},[154,1053,487],{"class":185},[154,1055,1056,1059,1061,1063,1065,1067,1069,1071,1073,1076,1079,1082,1085,1088,1091,1093,1096,1098,1100,1102,1104,1107,1109,1111,1114,1117,1119,1121,1123,1125,1127,1129,1131,1133,1135,1137,1139,1141,1144,1146,1148],{"class":156,"line":219},[154,1057,1058],{"class":164}," phone_pattern ",[154,1060,275],{"class":274},[154,1062,999],{"class":164},[154,1064,72],{"class":185},[154,1066,1004],{"class":282},[154,1068,237],{"class":185},[154,1070,1009],{"class":229},[154,1072,372],{"class":371},[154,1074,1075],{"class":1016},"\\b",[154,1077,1078],{"class":858},"(?:",[154,1080,1081],{"class":1039},"\\+",[154,1083,1084],{"class":274},"?",[154,1086,1087],{"class":1016},"\\d",[154,1089,1090],{"class":274},"{1,3}",[154,1092,330],{"class":858},[154,1094,1095],{"class":1016},"-.\\s",[154,1097,1020],{"class":858},[154,1099,1084],{"class":274},[154,1101,251],{"class":858},[154,1103,1084],{"class":274},[154,1105,1106],{"class":1039},"\\(",[154,1108,1084],{"class":274},[154,1110,1087],{"class":1016},[154,1112,1113],{"class":274},"{3}",[154,1115,1116],{"class":1039},"\\)",[154,1118,1084],{"class":274},[154,1120,330],{"class":858},[154,1122,1095],{"class":1016},[154,1124,1020],{"class":858},[154,1126,1084],{"class":274},[154,1128,1087],{"class":1016},[154,1130,1113],{"class":274},[154,1132,330],{"class":858},[154,1134,1095],{"class":1016},[154,1136,1020],{"class":858},[154,1138,1084],{"class":274},[154,1140,1087],{"class":1016},[154,1142,1143],{"class":274},"{4}",[154,1145,1075],{"class":1016},[154,1147,372],{"class":371},[154,1149,487],{"class":185},[154,1151,1152],{"class":156,"line":226},[154,1153,402],{"class":164},[154,1155,1156,1159,1161,1164,1166,1169,1171,1173],{"class":156,"line":268},[154,1157,1158],{"class":164}," emails ",[154,1160,275],{"class":274},[154,1162,1163],{"class":164}," email_pattern",[154,1165,72],{"class":185},[154,1167,1168],{"class":282},"findall",[154,1170,237],{"class":185},[154,1172,976],{"class":282},[154,1174,487],{"class":185},[154,1176,1177,1180,1182,1185,1187,1189,1191,1193],{"class":156,"line":289},[154,1178,1179],{"class":164}," phones ",[154,1181,275],{"class":274},[154,1183,1184],{"class":164}," phone_pattern",[154,1186,72],{"class":185},[154,1188,1168],{"class":282},[154,1190,237],{"class":185},[154,1192,976],{"class":282},[154,1194,487],{"class":185},[154,1196,1197],{"class":156,"line":358},[154,1198,402],{"class":164},[154,1200,1201,1203,1205,1207,1210,1212,1214,1216,1218,1221,1223,1225,1228,1230,1233,1235,1237,1239,1241,1243,1245,1247],{"class":156,"line":399},[154,1202,506],{"class":160},[154,1204,413],{"class":185},[154,1206,372],{"class":371},[154,1208,1209],{"class":375},"emails",[154,1211,372],{"class":371},[154,1213,244],{"class":185},[154,1215,669],{"class":247},[154,1217,237],{"class":185},[154,1219,1220],{"class":247},"set",[154,1222,237],{"class":185},[154,1224,1209],{"class":282},[154,1226,1227],{"class":185},")),",[154,1229,424],{"class":371},[154,1231,1232],{"class":375},"phones",[154,1234,372],{"class":371},[154,1236,244],{"class":185},[154,1238,669],{"class":247},[154,1240,237],{"class":185},[154,1242,1220],{"class":247},[154,1244,237],{"class":185},[154,1246,1232],{"class":282},[154,1248,1249],{"class":185},"))}\n",[14,1251,1252,1253,72],{},"Mastering these techniques is covered extensively in ",[68,1254,1256],{"href":1255},"\u002Fthe-complete-guide-to-python-web-scraping\u002Fextracting-data-with-regular-expressions\u002F","Extracting Data with Regular Expressions",[21,1258,1260],{"id":1259},"_5-scaling-across-multiple-pages","5. Scaling Across Multiple Pages",[14,1262,1263],{},"Real-world datasets rarely fit on a single page. Scrapers must programmatically navigate through paginated lists, query string offsets, or simulate user scrolling.",[29,1265,1267],{"id":1266},"url-parameter-manipulation","URL parameter manipulation",[14,1269,1270,1271,135,1274,1277],{},"Many sites use query parameters like ",[37,1272,1273],{},"?page=2",[37,1275,1276],{},"?offset=50"," for pagination. Extract the base URL and increment these values in a loop until no new data appears.",[29,1279,1281],{"id":1280},"detecting-next-page-tokens","Detecting next-page tokens",[14,1283,1284],{},"Some platforms use opaque tokens or cursor-based pagination. Inspect network traffic to locate these values in API responses or hidden form fields. Pass them sequentially to maintain traversal continuity.",[29,1286,1288],{"id":1287},"scroll-based-content-loading","Scroll-based content loading",[14,1290,1291],{},"Infinite scroll triggers JavaScript to fetch additional data dynamically. Identify the background API endpoints using browser developer tools. Calling these endpoints directly is faster and more reliable than simulating scroll events.",[14,1293,1294,1295,72],{},"Strategies for automating multi-page traversal while maintaining request efficiency are detailed in ",[68,1296,1298],{"href":1297},"\u002Fthe-complete-guide-to-python-web-scraping\u002Fhandling-pagination-and-infinite-scroll\u002F","Handling Pagination and Infinite Scroll",[21,1300,1302],{"id":1301},"_6-maintaining-state-and-authentication","6. Maintaining State and Authentication",[14,1304,1305],{},"Many target sites require user authentication or track browsing state across multiple requests. Proper state management prevents session drops and redundant logins.",[29,1307,1309],{"id":1308},"session-objects-vs-standalone-requests","Session objects vs. standalone requests",[14,1311,1312,1313,1316,1317,1320],{},"Standalone ",[37,1314,1315],{},"requests.get()"," calls create new connections each time. ",[37,1318,1319],{},"requests.Session()"," persists cookies and headers across multiple requests, drastically reducing overhead and mimicking real browser behavior.",[29,1322,1324],{"id":1323},"cookie-persistence","Cookie persistence",[14,1326,1327],{},"Cookies store session identifiers, preferences, and tracking data. Sessions automatically attach relevant cookies to subsequent requests. Manually exporting them is rarely necessary unless migrating to a different environment.",[29,1329,1331],{"id":1330},"login-form-automation","Login form automation",[14,1333,1334,1335,1337],{},"Identify the form action URL and required payload fields. Submit credentials via a ",[37,1336,93],{}," request through a session object. Verify success by checking for redirect URLs or authenticated dashboard elements.",[145,1339,1341],{"className":147,"code":1340,"language":149,"meta":150,"style":150},"import requests\n\ndef authenticated_session(login_url: str, credentials: dict) -> requests.Session:\n session = requests.Session()\n \n # Load initial cookies\n session.get(login_url)\n \n # Submit login form\n response = session.post(login_url, data=credentials)\n response.raise_for_status()\n \n # Verify authentication\n if \"dashboard\" in response.url or response.status_code == 200:\n return session\n else:\n raise ValueError(\"Authentication failed. Check credentials.\")\n",[37,1342,1343,1349,1353,1390,1404,1408,1414,1428,1432,1437,1466,1476,1480,1485,1523,1530,1537],{"__ignoreMap":150},[154,1344,1345,1347],{"class":156,"line":157},[154,1346,161],{"class":160},[154,1348,165],{"class":164},[154,1350,1351],{"class":156,"line":168},[154,1352,223],{"emptyLinePlaceholder":222},[154,1354,1355,1357,1360,1362,1365,1367,1369,1371,1374,1376,1378,1380,1382,1384,1386,1388],{"class":156,"line":176},[154,1356,230],{"class":229},[154,1358,1359],{"class":233}," authenticated_session",[154,1361,237],{"class":185},[154,1363,1364],{"class":240},"login_url",[154,1366,244],{"class":185},[154,1368,248],{"class":247},[154,1370,312],{"class":185},[154,1372,1373],{"class":240}," credentials",[154,1375,244],{"class":185},[154,1377,987],{"class":247},[154,1379,251],{"class":185},[154,1381,254],{"class":185},[154,1383,182],{"class":164},[154,1385,72],{"class":185},[154,1387,283],{"class":261},[154,1389,265],{"class":185},[154,1391,1392,1394,1396,1398,1400,1402],{"class":156,"line":196},[154,1393,271],{"class":164},[154,1395,275],{"class":274},[154,1397,182],{"class":164},[154,1399,72],{"class":185},[154,1401,283],{"class":282},[154,1403,286],{"class":185},[154,1405,1406],{"class":156,"line":219},[154,1407,402],{"class":164},[154,1409,1410],{"class":156,"line":226},[154,1411,1413],{"class":1412},"sutJx"," # Load initial cookies\n",[154,1415,1416,1418,1420,1422,1424,1426],{"class":156,"line":268},[154,1417,361],{"class":164},[154,1419,72],{"class":185},[154,1421,460],{"class":282},[154,1423,237],{"class":185},[154,1425,1364],{"class":282},[154,1427,487],{"class":185},[154,1429,1430],{"class":156,"line":289},[154,1431,402],{"class":164},[154,1433,1434],{"class":156,"line":358},[154,1435,1436],{"class":1412}," # Submit login form\n",[154,1438,1439,1441,1443,1445,1447,1450,1452,1454,1456,1459,1461,1464],{"class":156,"line":399},[154,1440,451],{"class":164},[154,1442,275],{"class":274},[154,1444,361],{"class":164},[154,1446,72],{"class":185},[154,1448,1449],{"class":282},"post",[154,1451,237],{"class":185},[154,1453,1364],{"class":282},[154,1455,312],{"class":185},[154,1457,1458],{"class":302}," data",[154,1460,275],{"class":274},[154,1462,1463],{"class":282},"credentials",[154,1465,487],{"class":185},[154,1467,1468,1470,1472,1474],{"class":156,"line":405},[154,1469,493],{"class":164},[154,1471,72],{"class":185},[154,1473,498],{"class":282},[154,1475,286],{"class":185},[154,1477,1478],{"class":156,"line":435},[154,1479,402],{"class":164},[154,1481,1482],{"class":156,"line":440},[154,1483,1484],{"class":1412}," # Verify authentication\n",[154,1486,1487,1489,1491,1494,1496,1499,1501,1503,1505,1508,1510,1512,1515,1518,1521],{"class":156,"line":448},[154,1488,806],{"class":160},[154,1490,424],{"class":371},[154,1492,1493],{"class":375},"dashboard",[154,1495,372],{"class":371},[154,1497,1498],{"class":274}," in",[154,1500,493],{"class":164},[154,1502,72],{"class":185},[154,1504,241],{"class":261},[154,1506,1507],{"class":274}," or",[154,1509,493],{"class":164},[154,1511,72],{"class":185},[154,1513,1514],{"class":261},"status_code",[154,1516,1517],{"class":274}," ==",[154,1519,1520],{"class":308}," 200",[154,1522,265],{"class":185},[154,1524,1525,1527],{"class":156,"line":490},[154,1526,506],{"class":160},[154,1528,1529],{"class":164}," session\n",[154,1531,1532,1535],{"class":156,"line":503},[154,1533,1534],{"class":160}," else",[154,1536,265],{"class":185},[154,1538,1539,1542,1545,1547,1549,1552,1554],{"class":156,"line":512},[154,1540,1541],{"class":160}," raise",[154,1543,1544],{"class":247}," ValueError",[154,1546,237],{"class":185},[154,1548,372],{"class":371},[154,1550,1551],{"class":375},"Authentication failed. Check credentials.",[154,1553,372],{"class":371},[154,1555,487],{"class":185},[14,1557,1558,1559,72],{},"For implementation details on stateful browsing, consult ",[68,1560,1562],{"href":1561},"\u002Fthe-complete-guide-to-python-web-scraping\u002Fmanaging-cookies-and-sessions\u002F","Managing Cookies and Sessions",[21,1564,1566],{"id":1565},"_7-post-processing-and-data-storage","7. Post-Processing and Data Storage",[14,1568,1569],{},"Raw scraped data is rarely production-ready. It requires normalization, type casting, and quality checks before integration into downstream applications.",[29,1571,1573],{"id":1572},"removing-duplicates-and-nulls","Removing duplicates and nulls",[14,1575,1576,1577,1580,1581,1584],{},"Use Python sets or pandas ",[37,1578,1579],{},"drop_duplicates()"," to eliminate redundant records. Filter out ",[37,1582,1583],{},"None"," values or empty strings early in the pipeline to prevent downstream errors.",[29,1586,1588],{"id":1587},"schema-validation-with-pydantic","Schema validation with Pydantic",[14,1590,1591],{},"Pydantic enforces data types and required fields at runtime. Define models that match your expected output. Invalid records trigger clear validation errors instead of silent failures.",[29,1593,1595],{"id":1594},"exporting-to-csv-json-and-databases","Exporting to CSV, JSON, and databases",[14,1597,1598,1599,1602,1603,1606],{},"Serialize validated data using standard libraries. Write to CSV for spreadsheet compatibility, JSON for API consumption, or use ",[37,1600,1601],{},"sqlite3","\u002F",[37,1604,1605],{},"SQLAlchemy"," for relational storage. Always append incrementally to avoid overwrites.",[145,1608,1610],{"className":147,"code":1609,"language":149,"meta":150,"style":150},"from pydantic import BaseModel, ValidationError\nfrom typing import Optional\n\nclass Product(BaseModel):\n name: str\n price: float\n sku: Optional[str] = None\n\ndef validate_and_store(raw_data: list[dict]) -> list[Product]:\n validated = []\n for item in raw_data:\n try:\n product = Product(**item)\n validated.append(product)\n except ValidationError as e:\n print(f\"Skipping invalid record: {e}\")\n return validated\n",[37,1611,1612,1629,1641,1645,1661,1671,1681,1704,1708,1742,1751,1764,1770,1789,1805,1819,1840],{"__ignoreMap":150},[154,1613,1614,1616,1619,1621,1624,1626],{"class":156,"line":157},[154,1615,179],{"class":160},[154,1617,1618],{"class":164}," pydantic ",[154,1620,161],{"class":160},[154,1622,1623],{"class":164}," BaseModel",[154,1625,312],{"class":185},[154,1627,1628],{"class":164}," ValidationError\n",[154,1630,1631,1633,1636,1638],{"class":156,"line":168},[154,1632,179],{"class":160},[154,1634,1635],{"class":164}," typing ",[154,1637,161],{"class":160},[154,1639,1640],{"class":164}," Optional\n",[154,1642,1643],{"class":156,"line":176},[154,1644,223],{"emptyLinePlaceholder":222},[154,1646,1647,1650,1654,1656,1659],{"class":156,"line":196},[154,1648,1649],{"class":229},"class",[154,1651,1653],{"class":1652},"sbgvK"," Product",[154,1655,237],{"class":185},[154,1657,1658],{"class":1652},"BaseModel",[154,1660,747],{"class":185},[154,1662,1663,1666,1668],{"class":156,"line":219},[154,1664,1665],{"class":164}," name",[154,1667,244],{"class":185},[154,1669,1670],{"class":247}," str\n",[154,1672,1673,1676,1678],{"class":156,"line":226},[154,1674,1675],{"class":164}," price",[154,1677,244],{"class":185},[154,1679,1680],{"class":247}," float\n",[154,1682,1683,1686,1688,1691,1693,1696,1698,1701],{"class":156,"line":268},[154,1684,1685],{"class":164}," sku",[154,1687,244],{"class":185},[154,1689,1690],{"class":164}," Optional",[154,1692,330],{"class":185},[154,1694,1695],{"class":247},"str",[154,1697,1020],{"class":185},[154,1699,1700],{"class":274}," =",[154,1702,1703],{"class":858}," None\n",[154,1705,1706],{"class":156,"line":289},[154,1707,223],{"emptyLinePlaceholder":222},[154,1709,1710,1712,1715,1717,1720,1722,1724,1726,1728,1731,1733,1735,1737,1740],{"class":156,"line":358},[154,1711,230],{"class":229},[154,1713,1714],{"class":233}," validate_and_store",[154,1716,237],{"class":185},[154,1718,1719],{"class":240},"raw_data",[154,1721,244],{"class":185},[154,1723,669],{"class":164},[154,1725,330],{"class":185},[154,1727,674],{"class":247},[154,1729,1730],{"class":185},"])",[154,1732,254],{"class":185},[154,1734,669],{"class":164},[154,1736,330],{"class":185},[154,1738,1739],{"class":164},"Product",[154,1741,677],{"class":185},[154,1743,1744,1747,1749],{"class":156,"line":399},[154,1745,1746],{"class":164}," validated ",[154,1748,275],{"class":274},[154,1750,712],{"class":185},[154,1752,1753,1755,1757,1759,1762],{"class":156,"line":405},[154,1754,721],{"class":160},[154,1756,724],{"class":164},[154,1758,727],{"class":160},[154,1760,1761],{"class":164}," raw_data",[154,1763,265],{"class":185},[154,1765,1766,1768],{"class":156,"line":435},[154,1767,443],{"class":160},[154,1769,265],{"class":185},[154,1771,1772,1775,1777,1779,1781,1784,1787],{"class":156,"line":440},[154,1773,1774],{"class":164}," product ",[154,1776,275],{"class":274},[154,1778,1653],{"class":282},[154,1780,237],{"class":185},[154,1782,1783],{"class":274},"**",[154,1785,1786],{"class":282},"item",[154,1788,487],{"class":185},[154,1790,1791,1794,1796,1798,1800,1803],{"class":156,"line":448},[154,1792,1793],{"class":164}," validated",[154,1795,72],{"class":185},[154,1797,826],{"class":282},[154,1799,237],{"class":185},[154,1801,1802],{"class":282},"product",[154,1804,487],{"class":185},[154,1806,1807,1809,1812,1815,1817],{"class":156,"line":490},[154,1808,515],{"class":160},[154,1810,1811],{"class":164}," ValidationError ",[154,1813,1814],{"class":160},"as",[154,1816,533],{"class":164},[154,1818,265],{"class":185},[154,1820,1821,1823,1825,1827,1830,1832,1834,1836,1838],{"class":156,"line":503},[154,1822,542],{"class":541},[154,1824,237],{"class":185},[154,1826,547],{"class":229},[154,1828,1829],{"class":375},"\"Skipping invalid record: ",[154,1831,553],{"class":308},[154,1833,556],{"class":282},[154,1835,559],{"class":308},[154,1837,372],{"class":375},[154,1839,487],{"class":185},[154,1841,1842,1844],{"class":156,"line":512},[154,1843,506],{"class":160},[154,1845,1846],{"class":164}," validated\n",[14,1848,1849],{},"Building robust transformation workflows is the focus of Data Cleaning and Validation Pipelines.",[21,1851,1853],{"id":1852},"_8-ethical-guidelines-and-legal-compliance","8. Ethical Guidelines and Legal Compliance",[14,1855,1856],{},"Responsible scraping is non-negotiable for long-term project viability. Automation must balance data acquisition with server health and legal boundaries.",[29,1858,1860],{"id":1859},"respecting-robotstxt","Respecting robots.txt",[14,1862,86,1863,1866],{},[37,1864,1865],{},"robots.txt"," file specifies which paths crawlers may access. Always parse this file before deployment. Ignoring it violates webmaster guidelines and increases ban risk.",[29,1868,1870],{"id":1869},"implementing-polite-delays","Implementing polite delays",[14,1872,1873,1874,1877],{},"Aggressive request bursts degrade site performance for legitimate users. Add randomized delays between 2 and 5 seconds. Use asynchronous libraries like ",[37,1875,1876],{},"aiohttp"," only when paired with strict concurrency limits.",[29,1879,1881],{"id":1880},"copyright-and-data-usage-laws","Copyright and data usage laws",[14,1883,1884],{},"Publicly accessible data is not always free to use commercially. Respect intellectual property rights, avoid scraping personal information without consent, and review terms of service. When in doubt, seek explicit permission or legal counsel.",[21,1886,1888],{"id":1887},"common-pitfalls-to-avoid","Common Pitfalls to Avoid",[1890,1891,1892,1903,1909,1915,1927],"ul",{},[1893,1894,1895,1899,1900,1902],"li",{},[1896,1897,1898],"strong",{},"Ignoring rate limits and triggering IP bans:"," Always implement delays and exponential backoff. Monitor ",[37,1901,116],{}," status codes closely.",[1893,1904,1905,1908],{},[1896,1906,1907],{},"Hardcoding URLs instead of parsing dynamic pagination parameters:"," Build flexible URL generators that adapt to changing query strings or API endpoints.",[1893,1910,1911,1914],{},[1896,1912,1913],{},"Attempting to parse complex HTML structures with regex alone:"," Regex breaks easily on nested markup. Use DOM parsers for structural queries and reserve regex for inline text.",[1893,1916,1917,1920,1921,1923,1924,1926],{},[1896,1918,1919],{},"Failing to implement fallback logic for missing or malformed elements:"," Always check if selectors return ",[37,1922,1583],{}," before calling ",[37,1925,598],{}," or accessing attributes.",[1893,1928,1929,1932],{},[1896,1930,1931],{},"Neglecting to check robots.txt and site terms of service before deployment:"," Compliance prevents legal exposure and ensures sustainable data access.",[21,1934,1936],{"id":1935},"frequently-asked-questions","Frequently Asked Questions",[14,1938,1939,1942,1943,1945],{},[1896,1940,1941],{},"Is web scraping legal in Python?","\nWeb scraping is generally legal when applied to publicly available data, provided you respect copyright laws, avoid bypassing authentication without permission, and comply with a site's ",[37,1944,1865],{}," and terms of service. Always prioritize ethical scraping practices and consult legal counsel for sensitive or commercial use cases.",[14,1947,1948,1951],{},[1896,1949,1950],{},"Should I use BeautifulSoup or Scrapy for my project?","\nBeautifulSoup is ideal for beginners and lightweight scripts that parse static HTML pages. Scrapy is better suited for large-scale, production-grade crawlers requiring built-in concurrency, middleware pipelines, and automated request scheduling.",[14,1953,1954,1957,1958,1960],{},[1896,1955,1956],{},"How do I avoid getting blocked while scraping?","\nImplement respectful delays between requests, rotate user-agent strings, use session management to mimic real browsers, respect ",[37,1959,1865],{}," directives, and consider using residential proxies if scaling to enterprise levels.",[14,1962,1963,1966,1967,1969],{},[1896,1964,1965],{},"Can Python scrape JavaScript-rendered websites?","\nYes, but standard HTTP clients like ",[37,1968,61],{}," cannot execute JavaScript. For dynamic sites, use headless browser automation tools like Playwright or Selenium, or reverse-engineer the underlying API endpoints that populate the frontend data.",[1971,1972,1973],"style",{},"html pre.shiki code .sVHd0, html code.shiki .sVHd0{--shiki-light:#39ADB5;--shiki-light-font-style:italic;--shiki-default:#D73A49;--shiki-default-font-style:inherit;--shiki-dark:#F97583;--shiki-dark-font-style:inherit}html pre.shiki code .su5hD, html code.shiki .su5hD{--shiki-light:#90A4AE;--shiki-default:#24292E;--shiki-dark:#E1E4E8}html pre.shiki code .sP7_E, html code.shiki .sP7_E{--shiki-light:#39ADB5;--shiki-default:#24292E;--shiki-dark:#E1E4E8}html pre.shiki code .sbsja, html code.shiki .sbsja{--shiki-light:#9C3EDA;--shiki-default:#D73A49;--shiki-dark:#F97583}html pre.shiki code .sGLFI, html code.shiki .sGLFI{--shiki-light:#6182B8;--shiki-default:#6F42C1;--shiki-dark:#B392F0}html pre.shiki code .sFwrP, html code.shiki .sFwrP{--shiki-light:#90A4AE;--shiki-light-font-style:italic;--shiki-default:#24292E;--shiki-default-font-style:inherit;--shiki-dark:#E1E4E8;--shiki-dark-font-style:inherit}html pre.shiki code .sZMiF, html code.shiki .sZMiF{--shiki-light:#E2931D;--shiki-default:#005CC5;--shiki-dark:#79B8FF}html pre.shiki code .skxfh, html code.shiki .skxfh{--shiki-light:#E53935;--shiki-default:#24292E;--shiki-dark:#E1E4E8}html pre.shiki code .smGrS, html code.shiki .smGrS{--shiki-light:#39ADB5;--shiki-default:#D73A49;--shiki-dark:#F97583}html pre.shiki code .slqww, html code.shiki .slqww{--shiki-light:#6182B8;--shiki-default:#24292E;--shiki-dark:#E1E4E8}html pre.shiki code .s99_P, html code.shiki .s99_P{--shiki-light:#90A4AE;--shiki-light-font-style:italic;--shiki-default:#E36209;--shiki-default-font-style:inherit;--shiki-dark:#FFAB70;--shiki-dark-font-style:inherit}html pre.shiki code .srdBf, html code.shiki .srdBf{--shiki-light:#F76D47;--shiki-default:#005CC5;--shiki-dark:#79B8FF}html pre.shiki code .sjJ54, html code.shiki .sjJ54{--shiki-light:#39ADB5;--shiki-default:#032F62;--shiki-dark:#9ECBFF}html pre.shiki code .s_sjI, html code.shiki .s_sjI{--shiki-light:#91B859;--shiki-default:#032F62;--shiki-dark:#9ECBFF}html pre.shiki code .sptTA, html code.shiki .sptTA{--shiki-light:#6182B8;--shiki-default:#005CC5;--shiki-dark:#79B8FF}html .light .shiki span {color: var(--shiki-light);background: var(--shiki-light-bg);font-style: var(--shiki-light-font-style);font-weight: var(--shiki-light-font-weight);text-decoration: var(--shiki-light-text-decoration);}html.light .shiki span {color: var(--shiki-light);background: var(--shiki-light-bg);font-style: var(--shiki-light-font-style);font-weight: var(--shiki-light-font-weight);text-decoration: var(--shiki-light-text-decoration);}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html.dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html pre.shiki code .s39Yj, html code.shiki .s39Yj{--shiki-light:#39ADB5;--shiki-default:#005CC5;--shiki-dark:#79B8FF}html pre.shiki code .stzsN, html code.shiki .stzsN{--shiki-light:#91B859;--shiki-default:#005CC5;--shiki-dark:#79B8FF}html pre.shiki code .sQRbd, html code.shiki .sQRbd{--shiki-light:#91B859;--shiki-default:#032F62;--shiki-dark:#DBEDFF}html pre.shiki code .sjYin, html code.shiki .sjYin{--shiki-light:#90A4AE;--shiki-light-font-weight:inherit;--shiki-default:#22863A;--shiki-default-font-weight:bold;--shiki-dark:#85E89D;--shiki-dark-font-weight:bold}html pre.shiki code .sutJx, html code.shiki .sutJx{--shiki-light:#90A4AE;--shiki-light-font-style:italic;--shiki-default:#6A737D;--shiki-default-font-style:inherit;--shiki-dark:#6A737D;--shiki-dark-font-style:inherit}html pre.shiki code .sbgvK, html code.shiki .sbgvK{--shiki-light:#E2931D;--shiki-default:#6F42C1;--shiki-dark:#B392F0}",{"title":150,"searchDepth":168,"depth":168,"links":1975},[1976,1981,1986,1991,1996,2001,2006,2011,2016,2017],{"id":23,"depth":168,"text":24,"children":1977},[1978,1979,1980],{"id":31,"depth":176,"text":32},{"id":47,"depth":176,"text":48},{"id":54,"depth":176,"text":55},{"id":75,"depth":168,"text":76,"children":1982},[1983,1984,1985],{"id":82,"depth":176,"text":83},{"id":101,"depth":176,"text":102},{"id":128,"depth":176,"text":129},{"id":579,"depth":168,"text":580,"children":1987},[1988,1989,1990],{"id":586,"depth":176,"text":587},{"id":602,"depth":176,"text":603},{"id":609,"depth":176,"text":610},{"id":915,"depth":168,"text":916,"children":1992},[1993,1994,1995],{"id":922,"depth":176,"text":923},{"id":929,"depth":176,"text":930},{"id":936,"depth":176,"text":937},{"id":1259,"depth":168,"text":1260,"children":1997},[1998,1999,2000],{"id":1266,"depth":176,"text":1267},{"id":1280,"depth":176,"text":1281},{"id":1287,"depth":176,"text":1288},{"id":1301,"depth":168,"text":1302,"children":2002},[2003,2004,2005],{"id":1308,"depth":176,"text":1309},{"id":1323,"depth":176,"text":1324},{"id":1330,"depth":176,"text":1331},{"id":1565,"depth":168,"text":1566,"children":2007},[2008,2009,2010],{"id":1572,"depth":176,"text":1573},{"id":1587,"depth":176,"text":1588},{"id":1594,"depth":176,"text":1595},{"id":1852,"depth":168,"text":1853,"children":2012},[2013,2014,2015],{"id":1859,"depth":176,"text":1860},{"id":1869,"depth":176,"text":1870},{"id":1880,"depth":176,"text":1881},{"id":1887,"depth":168,"text":1888},{"id":1935,"depth":168,"text":1936},"md",{},"\u002Fthe-complete-guide-to-python-web-scraping",{"title":5,"description":16},"the-complete-guide-to-python-web-scraping\u002Findex","sUTEUBd8KcS3ctKPZZp_ud7ys3RiWuYXFtlA4Em9Sr4",[2025,2075,2105],{"title":2026,"path":2027,"stem":2028,"children":2029,"page":-1},"Advanced Scraping Techniques Anti Bot Evasion","\u002Fadvanced-scraping-techniques-anti-bot-evasion","advanced-scraping-techniques-anti-bot-evasion",[2030,2033,2039,2051,2063],{"title":2031,"path":2027,"stem":2032},"Advanced Scraping Techniques & Anti-Bot Evasion","advanced-scraping-techniques-anti-bot-evasion\u002Findex",{"title":2034,"path":2035,"stem":2036,"children":2037},"Bypassing Cloudflare and Akamai Protections in Python","\u002Fadvanced-scraping-techniques-anti-bot-evasion\u002Fbypassing-cloudflare-and-akamai-protections","advanced-scraping-techniques-anti-bot-evasion\u002Fbypassing-cloudflare-and-akamai-protections\u002Findex",[2038],{"title":2034,"path":2035,"stem":2036},{"title":2040,"path":2041,"stem":2042,"children":2043,"page":-1},"Mastering Selenium for Dynamic Websites","\u002Fadvanced-scraping-techniques-anti-bot-evasion\u002Fmastering-selenium-for-dynamic-websites","advanced-scraping-techniques-anti-bot-evasion\u002Fmastering-selenium-for-dynamic-websites\u002Findex",[2044,2045],{"title":2040,"path":2041,"stem":2042},{"title":2046,"path":2047,"stem":2048,"children":2049},"How to Configure Selenium Stealth to Avoid Detection","\u002Fadvanced-scraping-techniques-anti-bot-evasion\u002Fmastering-selenium-for-dynamic-websites\u002Fhow-to-configure-selenium-stealth-to-avoid-detection","advanced-scraping-techniques-anti-bot-evasion\u002Fmastering-selenium-for-dynamic-websites\u002Fhow-to-configure-selenium-stealth-to-avoid-detection\u002Findex",[2050],{"title":2046,"path":2047,"stem":2048},{"title":2052,"path":2053,"stem":2054,"children":2055,"page":-1},"Rotating Proxies and Managing IP Blocks","\u002Fadvanced-scraping-techniques-anti-bot-evasion\u002Frotating-proxies-and-managing-ip-blocks","advanced-scraping-techniques-anti-bot-evasion\u002Frotating-proxies-and-managing-ip-blocks\u002Findex",[2056,2057],{"title":2052,"path":2053,"stem":2054},{"title":2058,"path":2059,"stem":2060,"children":2061},"Best Free and Paid Proxy Providers for Scraping: A Python Developer's Guide","\u002Fadvanced-scraping-techniques-anti-bot-evasion\u002Frotating-proxies-and-managing-ip-blocks\u002Fbest-free-and-paid-proxy-providers-for-scraping","advanced-scraping-techniques-anti-bot-evasion\u002Frotating-proxies-and-managing-ip-blocks\u002Fbest-free-and-paid-proxy-providers-for-scraping\u002Findex",[2062],{"title":2058,"path":2059,"stem":2060},{"title":2064,"path":2065,"stem":2066,"children":2067},"Using Playwright for Modern Web Automation","\u002Fadvanced-scraping-techniques-anti-bot-evasion\u002Fusing-playwright-for-modern-web-automation","advanced-scraping-techniques-anti-bot-evasion\u002Fusing-playwright-for-modern-web-automation\u002Findex",[2068,2069],{"title":2064,"path":2065,"stem":2066},{"title":2070,"path":2071,"stem":2072,"children":2073},"Playwright vs Selenium: Performance Benchmarks for Python Scrapers","\u002Fadvanced-scraping-techniques-anti-bot-evasion\u002Fusing-playwright-for-modern-web-automation\u002Fplaywright-vs-selenium-performance-benchmarks","advanced-scraping-techniques-anti-bot-evasion\u002Fusing-playwright-for-modern-web-automation\u002Fplaywright-vs-selenium-performance-benchmarks\u002Findex",[2074],{"title":2070,"path":2071,"stem":2072},{"title":2076,"path":2077,"stem":2078,"children":2079},"Legal, Ethical & Compliance in Web Scraping","\u002Flegal-ethical-compliance-in-web-scraping","legal-ethical-compliance-in-web-scraping\u002Findex",[2080,2081,2093],{"title":2076,"path":2077,"stem":2078},{"title":2082,"path":2083,"stem":2084,"children":2085,"page":-1},"Navigating Copyright and Fair Use Laws in Python Web Scraping","\u002Flegal-ethical-compliance-in-web-scraping\u002Fnavigating-copyright-and-fair-use-laws","legal-ethical-compliance-in-web-scraping\u002Fnavigating-copyright-and-fair-use-laws\u002Findex",[2086,2087],{"title":2082,"path":2083,"stem":2084},{"title":2088,"path":2089,"stem":2090,"children":2091},"How to Read and Interpret Robots.txt Files","\u002Flegal-ethical-compliance-in-web-scraping\u002Fnavigating-copyright-and-fair-use-laws\u002Fhow-to-read-and-interpret-robotstxt-files","legal-ethical-compliance-in-web-scraping\u002Fnavigating-copyright-and-fair-use-laws\u002Fhow-to-read-and-interpret-robotstxt-files\u002Findex",[2092],{"title":2088,"path":2089,"stem":2090},{"title":2094,"path":2095,"stem":2096,"children":2097},"Understanding Robots.txt and Sitemap Rules for Python Web Scraping","\u002Flegal-ethical-compliance-in-web-scraping\u002Funderstanding-robotstxt-and-sitemap-rules","legal-ethical-compliance-in-web-scraping\u002Funderstanding-robotstxt-and-sitemap-rules\u002Findex",[2098,2099],{"title":2094,"path":2095,"stem":2096},{"title":2100,"path":2101,"stem":2102,"children":2103},"Is Web Scraping Legal in the US and EU? A Python Developer’s Compliance Guide","\u002Flegal-ethical-compliance-in-web-scraping\u002Funderstanding-robotstxt-and-sitemap-rules\u002Fis-web-scraping-legal-in-the-us-and-eu","legal-ethical-compliance-in-web-scraping\u002Funderstanding-robotstxt-and-sitemap-rules\u002Fis-web-scraping-legal-in-the-us-and-eu\u002Findex",[2104],{"title":2100,"path":2101,"stem":2102},{"title":2106,"path":2020,"stem":12,"children":2107,"page":-1},"The Complete Guide To Python Web Scraping",[2108,2109,2121,2133,2139,2151,2162],{"title":5,"path":2020,"stem":2022},{"title":2110,"path":2111,"stem":2112,"children":2113,"page":-1},"Extracting Data with Regular Expressions in Python","\u002Fthe-complete-guide-to-python-web-scraping\u002Fextracting-data-with-regular-expressions","the-complete-guide-to-python-web-scraping\u002Fextracting-data-with-regular-expressions\u002Findex",[2114,2115],{"title":2110,"path":2111,"stem":2112},{"title":2116,"path":2117,"stem":2118,"children":2119},"Fixing Common Unicode Errors in Python Scraping","\u002Fthe-complete-guide-to-python-web-scraping\u002Fextracting-data-with-regular-expressions\u002Ffixing-common-unicode-errors-in-python-scraping","the-complete-guide-to-python-web-scraping\u002Fextracting-data-with-regular-expressions\u002Ffixing-common-unicode-errors-in-python-scraping\u002Findex",[2120],{"title":2116,"path":2117,"stem":2118},{"title":2122,"path":2123,"stem":2124,"children":2125,"page":-1},"Handling Pagination and Infinite Scroll in Python Web Scraping","\u002Fthe-complete-guide-to-python-web-scraping\u002Fhandling-pagination-and-infinite-scroll","the-complete-guide-to-python-web-scraping\u002Fhandling-pagination-and-infinite-scroll\u002Findex",[2126,2127],{"title":2122,"path":2123,"stem":2124},{"title":2128,"path":2129,"stem":2130,"children":2131},"How to Scrape a Static Website Without Getting Blocked","\u002Fthe-complete-guide-to-python-web-scraping\u002Fhandling-pagination-and-infinite-scroll\u002Fhow-to-scrape-a-static-website-without-getting-blocked","the-complete-guide-to-python-web-scraping\u002Fhandling-pagination-and-infinite-scroll\u002Fhow-to-scrape-a-static-website-without-getting-blocked\u002Findex",[2132],{"title":2128,"path":2129,"stem":2130},{"title":2134,"path":2135,"stem":2136,"children":2137},"Managing Cookies and Sessions in Python Web Scraping","\u002Fthe-complete-guide-to-python-web-scraping\u002Fmanaging-cookies-and-sessions","the-complete-guide-to-python-web-scraping\u002Fmanaging-cookies-and-sessions\u002Findex",[2138],{"title":2134,"path":2135,"stem":2136},{"title":2140,"path":2141,"stem":2142,"children":2143,"page":-1},"Parsing HTML with BeautifulSoup: A Practical Guide","\u002Fthe-complete-guide-to-python-web-scraping\u002Fparsing-html-with-beautifulsoup","the-complete-guide-to-python-web-scraping\u002Fparsing-html-with-beautifulsoup\u002Findex",[2144,2145],{"title":2140,"path":2141,"stem":2142},{"title":2146,"path":2147,"stem":2148,"children":2149},"BeautifulSoup vs LXML: Which Parser is Faster?","\u002Fthe-complete-guide-to-python-web-scraping\u002Fparsing-html-with-beautifulsoup\u002Fbeautifulsoup-vs-lxml-which-parser-is-faster","the-complete-guide-to-python-web-scraping\u002Fparsing-html-with-beautifulsoup\u002Fbeautifulsoup-vs-lxml-which-parser-is-faster\u002Findex",[2150],{"title":2146,"path":2147,"stem":2148},{"title":71,"path":2152,"stem":2153,"children":2154,"page":-1},"\u002Fthe-complete-guide-to-python-web-scraping\u002Fsetting-up-your-python-scraping-environment","the-complete-guide-to-python-web-scraping\u002Fsetting-up-your-python-scraping-environment\u002Findex",[2155,2156],{"title":71,"path":2152,"stem":2153},{"title":2157,"path":2158,"stem":2159,"children":2160},"How to Install Python and Requests for Beginners","\u002Fthe-complete-guide-to-python-web-scraping\u002Fsetting-up-your-python-scraping-environment\u002Fhow-to-install-python-and-requests-for-beginners","the-complete-guide-to-python-web-scraping\u002Fsetting-up-your-python-scraping-environment\u002Fhow-to-install-python-and-requests-for-beginners\u002Findex",[2161],{"title":2157,"path":2158,"stem":2159},{"title":576,"path":2163,"stem":2164,"children":2165},"\u002Fthe-complete-guide-to-python-web-scraping\u002Funderstanding-http-requests-and-responses","the-complete-guide-to-python-web-scraping\u002Funderstanding-http-requests-and-responses\u002Findex",[2166,2167],{"title":576,"path":2163,"stem":2164},{"title":2168,"path":2169,"stem":2170,"children":2171},"Step-by-Step Guide to Extracting Tables from HTML","\u002Fthe-complete-guide-to-python-web-scraping\u002Funderstanding-http-requests-and-responses\u002Fstep-by-step-guide-to-extracting-tables-from-html","the-complete-guide-to-python-web-scraping\u002Funderstanding-http-requests-and-responses\u002Fstep-by-step-guide-to-extracting-tables-from-html\u002Findex",[2172],{"title":2168,"path":2169,"stem":2170},1777978431761]