[{"data":1,"prerenderedAt":1534},["ShallowReactive",2],{"page-\u002Fscaling-python-web-scrapers\u002Fstoring-and-exporting-scraped-data\u002F":3,"content-navigation":1381},{"id":4,"title":5,"body":6,"description":1374,"extension":1375,"meta":1376,"navigation":191,"path":1377,"seo":1378,"stem":1379,"__hash__":1380},"content\u002Fscaling-python-web-scrapers\u002Fstoring-and-exporting-scraped-data\u002Findex.md","Storing and Exporting Scraped Data",{"type":7,"value":8,"toc":1365},"minimark",[9,13,28,31,36,39,128,131,135,145,545,549,552,912,924,928,934,1147,1159,1163,1182,1289,1293,1327,1331,1337,1343,1355,1361],[10,11,5],"h1",{"id":12},"storing-and-exporting-scraped-data",[14,15,16,17,22,23,27],"p",{},"Extraction is only half of a scraping pipeline; the data still has to land somewhere clean, queryable, and durable. The wrong storage choice shows up later as duplicate records, corrupted exports, or a multi-hour crawl lost to a crash. This guide covers how to pick a storage format, validate records before writing, deduplicate, and persist incrementally so progress survives failures. It is the final stage of the ",[18,19,21],"a",{"href":20},"\u002Fscaling-python-web-scrapers\u002F","Scaling & Deploying Python Web Scrapers"," workflow and follows naturally from cleaning data in ",[18,24,26],{"href":25},"\u002Fthe-complete-guide-to-python-web-scraping\u002F","The Complete Guide to Python Web Scraping",".",[29,30],"diagram-storage-decision",{},[32,33,35],"h2",{"id":34},"choosing-a-storage-format","Choosing a Storage Format",[14,37,38],{},"Match the sink to the volume of data and how it will be consumed:",[40,41,42,58],"table",{},[43,44,45],"thead",{},[46,47,48,52,55],"tr",{},[49,50,51],"th",{},"Format",[49,53,54],{},"Best for",[49,56,57],{},"Trade-offs",[59,60,61,76,89,102,115],"tbody",{},[46,62,63,70,73],{},[64,65,66],"td",{},[67,68,69],"strong",{},"CSV",[64,71,72],{},"Small, flat datasets; spreadsheet hand-off",[64,74,75],{},"No nesting; type information is lost",[46,77,78,83,86],{},[64,79,80],{},[67,81,82],{},"JSON \u002F JSON Lines",[64,84,85],{},"Nested records; API consumption",[64,87,88],{},"Larger files; not columnar",[46,90,91,96,99],{},[64,92,93],{},[67,94,95],{},"SQLite",[64,97,98],{},"Local structured storage, dedup, queries",[64,100,101],{},"Single-writer; not for heavy concurrency",[46,103,104,109,112],{},[64,105,106],{},[67,107,108],{},"PostgreSQL",[64,110,111],{},"Large, concurrent, relational workloads",[64,113,114],{},"Requires a running server",[46,116,117,122,125],{},[64,118,119],{},[67,120,121],{},"Parquet",[64,123,124],{},"Large analytical datasets",[64,126,127],{},"Columnar; needs pandas\u002Fpyarrow to read",[14,129,130],{},"A useful rule of thumb: CSV or JSON Lines for one-off exports, SQLite for local projects that need querying and deduplication, and PostgreSQL or Parquet once data volume or analytics demands outgrow a single file.",[32,132,134],{"id":133},"validate-before-you-store","Validate Before You Store",[14,136,137,138,144],{},"Storing unvalidated scraped data poisons everything downstream. Define a schema and reject or flag malformed records at the boundary. ",[18,139,143],{"href":140,"rel":141},"https:\u002F\u002Fdocs.pydantic.dev\u002F",[142],"nofollow","Pydantic"," makes this concise and gives clear errors instead of silent corruption.",[146,147,152],"pre",{"className":148,"code":149,"language":150,"meta":151,"style":151},"language-python shiki shiki-themes material-theme-lighter github-light github-dark","from pydantic import BaseModel, ValidationError, field_validator\n\nclass Book(BaseModel):\n    title: str\n    price: float\n    in_stock: bool = True\n\n    @field_validator(\"price\", mode=\"before\")\n    @classmethod\n    def clean_price(cls, v):\n        return float(str(v).replace(\"£\", \"\").strip())\n\ndef validate(rows: list[dict]) -> list[Book]:\n    clean = []\n    for row in rows:\n        try:\n            clean.append(Book(**row))\n        except ValidationError as exc:\n            print(f\"Skipping invalid record: {exc}\")\n    return clean\n","python","",[153,154,155,186,193,213,226,237,256,261,303,311,333,381,386,427,438,456,464,490,507,536],"code",{"__ignoreMap":151},[156,157,160,164,168,171,174,178,181,183],"span",{"class":158,"line":159},"line",1,[156,161,163],{"class":162},"sVHd0","from",[156,165,167],{"class":166},"su5hD"," pydantic ",[156,169,170],{"class":162},"import",[156,172,173],{"class":166}," BaseModel",[156,175,177],{"class":176},"sP7_E",",",[156,179,180],{"class":166}," ValidationError",[156,182,177],{"class":176},[156,184,185],{"class":166}," field_validator\n",[156,187,189],{"class":158,"line":188},2,[156,190,192],{"emptyLinePlaceholder":191},true,"\n",[156,194,196,200,204,207,210],{"class":158,"line":195},3,[156,197,199],{"class":198},"sbsja","class",[156,201,203],{"class":202},"sbgvK"," Book",[156,205,206],{"class":176},"(",[156,208,209],{"class":202},"BaseModel",[156,211,212],{"class":176},"):\n",[156,214,216,219,222],{"class":158,"line":215},4,[156,217,218],{"class":166},"    title",[156,220,221],{"class":176},":",[156,223,225],{"class":224},"sZMiF"," str\n",[156,227,229,232,234],{"class":158,"line":228},5,[156,230,231],{"class":166},"    price",[156,233,221],{"class":176},[156,235,236],{"class":224}," float\n",[156,238,240,243,245,248,252],{"class":158,"line":239},6,[156,241,242],{"class":166},"    in_stock",[156,244,221],{"class":176},[156,246,247],{"class":224}," bool",[156,249,251],{"class":250},"smGrS"," =",[156,253,255],{"class":254},"s39Yj"," True\n",[156,257,259],{"class":158,"line":258},7,[156,260,192],{"emptyLinePlaceholder":191},[156,262,264,268,272,274,278,282,284,286,290,293,295,298,300],{"class":158,"line":263},8,[156,265,267],{"class":266},"stp6e","    @",[156,269,271],{"class":270},"sGLFI","field_validator",[156,273,206],{"class":176},[156,275,277],{"class":276},"sjJ54","\"",[156,279,281],{"class":280},"s_sjI","price",[156,283,277],{"class":276},[156,285,177],{"class":176},[156,287,289],{"class":288},"s99_P"," mode",[156,291,292],{"class":250},"=",[156,294,277],{"class":276},[156,296,297],{"class":280},"before",[156,299,277],{"class":276},[156,301,302],{"class":176},")\n",[156,304,306,308],{"class":158,"line":305},9,[156,307,267],{"class":266},[156,309,310],{"class":224},"classmethod\n",[156,312,314,317,320,322,326,328,331],{"class":158,"line":313},10,[156,315,316],{"class":198},"    def",[156,318,319],{"class":270}," clean_price",[156,321,206],{"class":176},[156,323,325],{"class":324},"sFwrP","cls",[156,327,177],{"class":176},[156,329,330],{"class":324}," v",[156,332,212],{"class":176},[156,334,336,339,342,344,347,349,353,356,359,361,363,366,368,370,373,375,378],{"class":158,"line":335},11,[156,337,338],{"class":162},"        return",[156,340,341],{"class":224}," float",[156,343,206],{"class":176},[156,345,346],{"class":224},"str",[156,348,206],{"class":176},[156,350,352],{"class":351},"slqww","v",[156,354,355],{"class":176},").",[156,357,358],{"class":351},"replace",[156,360,206],{"class":176},[156,362,277],{"class":276},[156,364,365],{"class":280},"£",[156,367,277],{"class":276},[156,369,177],{"class":176},[156,371,372],{"class":276}," \"\"",[156,374,355],{"class":176},[156,376,377],{"class":351},"strip",[156,379,380],{"class":176},"())\n",[156,382,384],{"class":158,"line":383},12,[156,385,192],{"emptyLinePlaceholder":191},[156,387,389,392,395,397,400,402,405,408,411,414,417,419,421,424],{"class":158,"line":388},13,[156,390,391],{"class":198},"def",[156,393,394],{"class":270}," validate",[156,396,206],{"class":176},[156,398,399],{"class":324},"rows",[156,401,221],{"class":176},[156,403,404],{"class":166}," list",[156,406,407],{"class":176},"[",[156,409,410],{"class":224},"dict",[156,412,413],{"class":176},"])",[156,415,416],{"class":176}," ->",[156,418,404],{"class":166},[156,420,407],{"class":176},[156,422,423],{"class":166},"Book",[156,425,426],{"class":176},"]:\n",[156,428,430,433,435],{"class":158,"line":429},14,[156,431,432],{"class":166},"    clean ",[156,434,292],{"class":250},[156,436,437],{"class":176}," []\n",[156,439,441,444,447,450,453],{"class":158,"line":440},15,[156,442,443],{"class":162},"    for",[156,445,446],{"class":166}," row ",[156,448,449],{"class":162},"in",[156,451,452],{"class":166}," rows",[156,454,455],{"class":176},":\n",[156,457,459,462],{"class":158,"line":458},16,[156,460,461],{"class":162},"        try",[156,463,455],{"class":176},[156,465,467,470,472,475,477,479,481,484,487],{"class":158,"line":466},17,[156,468,469],{"class":166},"            clean",[156,471,27],{"class":176},[156,473,474],{"class":351},"append",[156,476,206],{"class":176},[156,478,423],{"class":351},[156,480,206],{"class":176},[156,482,483],{"class":250},"**",[156,485,486],{"class":351},"row",[156,488,489],{"class":176},"))\n",[156,491,493,496,499,502,505],{"class":158,"line":492},18,[156,494,495],{"class":162},"        except",[156,497,498],{"class":166}," ValidationError ",[156,500,501],{"class":162},"as",[156,503,504],{"class":166}," exc",[156,506,455],{"class":176},[156,508,510,514,516,519,522,526,529,532,534],{"class":158,"line":509},19,[156,511,513],{"class":512},"sptTA","            print",[156,515,206],{"class":176},[156,517,518],{"class":198},"f",[156,520,521],{"class":280},"\"Skipping invalid record: ",[156,523,525],{"class":524},"srdBf","{",[156,527,528],{"class":351},"exc",[156,530,531],{"class":524},"}",[156,533,277],{"class":280},[156,535,302],{"class":176},[156,537,539,542],{"class":158,"line":538},20,[156,540,541],{"class":162},"    return",[156,543,544],{"class":166}," clean\n",[32,546,548],{"id":547},"writing-csv-and-json-lines","Writing CSV and JSON Lines",[14,550,551],{},"For flat data, stream rows to CSV with the standard library. JSON Lines (one JSON object per line) is preferable to a single JSON array for large datasets because it can be appended to and read line by line without loading the whole file.",[146,553,555],{"className":148,"code":554,"language":150,"meta":151,"style":151},"import csv, json\n\ndef to_csv(books: list[Book], path: str) -> None:\n    with open(path, \"w\", newline=\"\", encoding=\"utf-8\") as f:\n        writer = csv.DictWriter(f, fieldnames=[\"title\", \"price\", \"in_stock\"])\n        writer.writeheader()\n        for b in books:\n            writer.writerow(b.model_dump())\n\ndef to_jsonl(books: list[Book], path: str) -> None:\n    with open(path, \"a\", encoding=\"utf-8\") as f:           # append-friendly\n        for b in books:\n            f.write(json.dumps(b.model_dump()) + \"\\n\")\n",[153,556,557,569,573,614,671,725,738,753,775,779,814,856,868],{"__ignoreMap":151},[156,558,559,561,564,566],{"class":158,"line":159},[156,560,170],{"class":162},[156,562,563],{"class":166}," csv",[156,565,177],{"class":176},[156,567,568],{"class":166}," json\n",[156,570,571],{"class":158,"line":188},[156,572,192],{"emptyLinePlaceholder":191},[156,574,575,577,580,582,585,587,589,591,593,596,599,601,604,607,609,612],{"class":158,"line":195},[156,576,391],{"class":198},[156,578,579],{"class":270}," to_csv",[156,581,206],{"class":176},[156,583,584],{"class":324},"books",[156,586,221],{"class":176},[156,588,404],{"class":166},[156,590,407],{"class":176},[156,592,423],{"class":166},[156,594,595],{"class":176},"],",[156,597,598],{"class":324}," path",[156,600,221],{"class":176},[156,602,603],{"class":224}," str",[156,605,606],{"class":176},")",[156,608,416],{"class":176},[156,610,611],{"class":254}," None",[156,613,455],{"class":176},[156,615,616,619,622,624,627,629,632,635,637,639,642,644,647,649,652,654,656,659,661,663,666,669],{"class":158,"line":215},[156,617,618],{"class":162},"    with",[156,620,621],{"class":512}," open",[156,623,206],{"class":176},[156,625,626],{"class":351},"path",[156,628,177],{"class":176},[156,630,631],{"class":276}," \"",[156,633,634],{"class":280},"w",[156,636,277],{"class":276},[156,638,177],{"class":176},[156,640,641],{"class":288}," newline",[156,643,292],{"class":250},[156,645,646],{"class":276},"\"\"",[156,648,177],{"class":176},[156,650,651],{"class":288}," encoding",[156,653,292],{"class":250},[156,655,277],{"class":276},[156,657,658],{"class":280},"utf-8",[156,660,277],{"class":276},[156,662,606],{"class":176},[156,664,665],{"class":162}," as",[156,667,668],{"class":166}," f",[156,670,455],{"class":176},[156,672,673,676,678,680,682,685,687,689,691,694,696,698,700,703,705,707,709,711,713,715,717,720,722],{"class":158,"line":228},[156,674,675],{"class":166},"        writer ",[156,677,292],{"class":250},[156,679,563],{"class":166},[156,681,27],{"class":176},[156,683,684],{"class":351},"DictWriter",[156,686,206],{"class":176},[156,688,518],{"class":351},[156,690,177],{"class":176},[156,692,693],{"class":288}," fieldnames",[156,695,292],{"class":250},[156,697,407],{"class":176},[156,699,277],{"class":276},[156,701,702],{"class":280},"title",[156,704,277],{"class":276},[156,706,177],{"class":176},[156,708,631],{"class":276},[156,710,281],{"class":280},[156,712,277],{"class":276},[156,714,177],{"class":176},[156,716,631],{"class":276},[156,718,719],{"class":280},"in_stock",[156,721,277],{"class":276},[156,723,724],{"class":176},"])\n",[156,726,727,730,732,735],{"class":158,"line":239},[156,728,729],{"class":166},"        writer",[156,731,27],{"class":176},[156,733,734],{"class":351},"writeheader",[156,736,737],{"class":176},"()\n",[156,739,740,743,746,748,751],{"class":158,"line":258},[156,741,742],{"class":162},"        for",[156,744,745],{"class":166}," b ",[156,747,449],{"class":162},[156,749,750],{"class":166}," books",[156,752,455],{"class":176},[156,754,755,758,760,763,765,768,770,773],{"class":158,"line":263},[156,756,757],{"class":166},"            writer",[156,759,27],{"class":176},[156,761,762],{"class":351},"writerow",[156,764,206],{"class":176},[156,766,767],{"class":351},"b",[156,769,27],{"class":176},[156,771,772],{"class":351},"model_dump",[156,774,380],{"class":176},[156,776,777],{"class":158,"line":305},[156,778,192],{"emptyLinePlaceholder":191},[156,780,781,783,786,788,790,792,794,796,798,800,802,804,806,808,810,812],{"class":158,"line":313},[156,782,391],{"class":198},[156,784,785],{"class":270}," to_jsonl",[156,787,206],{"class":176},[156,789,584],{"class":324},[156,791,221],{"class":176},[156,793,404],{"class":166},[156,795,407],{"class":176},[156,797,423],{"class":166},[156,799,595],{"class":176},[156,801,598],{"class":324},[156,803,221],{"class":176},[156,805,603],{"class":224},[156,807,606],{"class":176},[156,809,416],{"class":176},[156,811,611],{"class":254},[156,813,455],{"class":176},[156,815,816,818,820,822,824,826,828,830,832,834,836,838,840,842,844,846,848,850,852],{"class":158,"line":335},[156,817,618],{"class":162},[156,819,621],{"class":512},[156,821,206],{"class":176},[156,823,626],{"class":351},[156,825,177],{"class":176},[156,827,631],{"class":276},[156,829,18],{"class":280},[156,831,277],{"class":276},[156,833,177],{"class":176},[156,835,651],{"class":288},[156,837,292],{"class":250},[156,839,277],{"class":276},[156,841,658],{"class":280},[156,843,277],{"class":276},[156,845,606],{"class":176},[156,847,665],{"class":162},[156,849,668],{"class":166},[156,851,221],{"class":176},[156,853,855],{"class":854},"sutJx","           # append-friendly\n",[156,857,858,860,862,864,866],{"class":158,"line":383},[156,859,742],{"class":162},[156,861,745],{"class":166},[156,863,449],{"class":162},[156,865,750],{"class":166},[156,867,455],{"class":176},[156,869,870,873,875,878,880,883,885,888,890,892,894,896,899,902,904,908,910],{"class":158,"line":388},[156,871,872],{"class":166},"            f",[156,874,27],{"class":176},[156,876,877],{"class":351},"write",[156,879,206],{"class":176},[156,881,882],{"class":351},"json",[156,884,27],{"class":176},[156,886,887],{"class":351},"dumps",[156,889,206],{"class":176},[156,891,767],{"class":351},[156,893,27],{"class":176},[156,895,772],{"class":351},[156,897,898],{"class":176},"())",[156,900,901],{"class":250}," +",[156,903,631],{"class":276},[156,905,907],{"class":906},"s_hVV","\\n",[156,909,277],{"class":276},[156,911,302],{"class":176},[14,913,914,915,918,919,923],{},"Always specify ",[153,916,917],{},"encoding=\"utf-8\""," — mismatched encodings are a frequent source of corruption. See ",[18,920,922],{"href":921},"\u002Fthe-complete-guide-to-python-web-scraping\u002Fextracting-data-with-regular-expressions\u002Ffixing-common-unicode-errors-in-python-scraping\u002F","Fixing Common Unicode Errors in Python Scraping"," when text comes out garbled.",[32,925,927],{"id":926},"incremental-writes-and-deduplication-with-sqlite","Incremental Writes and Deduplication with SQLite",[14,929,930,931,27],{},"For any crawl longer than a few seconds, write as you go rather than accumulating everything in memory and dumping at the end — a crash should never cost the whole run. SQLite is ideal for this locally: it gives you durable, incremental inserts plus deduplication via a unique constraint and ",[153,932,933],{},"INSERT OR IGNORE",[146,935,937],{"className":148,"code":936,"language":150,"meta":151,"style":151},"import sqlite3\n\nconn = sqlite3.connect(\"books.db\")\nconn.execute(\"\"\"\n    CREATE TABLE IF NOT EXISTS books (\n        title TEXT,\n        price REAL,\n        in_stock INTEGER,\n        url TEXT UNIQUE          -- natural key prevents duplicates\n    )\n\"\"\")\n\ndef save(book: Book, url: str) -> None:\n    conn.execute(\n        \"INSERT OR IGNORE INTO books (title, price, in_stock, url) VALUES (?, ?, ?, ?)\",\n        (book.title, book.price, int(book.in_stock), url),\n    )\n    conn.commit()\n",[153,938,939,946,950,976,991,996,1001,1006,1011,1016,1021,1028,1032,1065,1077,1090,1132,1136],{"__ignoreMap":151},[156,940,941,943],{"class":158,"line":159},[156,942,170],{"class":162},[156,944,945],{"class":166}," sqlite3\n",[156,947,948],{"class":158,"line":188},[156,949,192],{"emptyLinePlaceholder":191},[156,951,952,955,957,960,962,965,967,969,972,974],{"class":158,"line":195},[156,953,954],{"class":166},"conn ",[156,956,292],{"class":250},[156,958,959],{"class":166}," sqlite3",[156,961,27],{"class":176},[156,963,964],{"class":351},"connect",[156,966,206],{"class":176},[156,968,277],{"class":276},[156,970,971],{"class":280},"books.db",[156,973,277],{"class":276},[156,975,302],{"class":176},[156,977,978,981,983,986,988],{"class":158,"line":215},[156,979,980],{"class":166},"conn",[156,982,27],{"class":176},[156,984,985],{"class":351},"execute",[156,987,206],{"class":176},[156,989,990],{"class":276},"\"\"\"\n",[156,992,993],{"class":158,"line":228},[156,994,995],{"class":280},"    CREATE TABLE IF NOT EXISTS books (\n",[156,997,998],{"class":158,"line":239},[156,999,1000],{"class":280},"        title TEXT,\n",[156,1002,1003],{"class":158,"line":258},[156,1004,1005],{"class":280},"        price REAL,\n",[156,1007,1008],{"class":158,"line":263},[156,1009,1010],{"class":280},"        in_stock INTEGER,\n",[156,1012,1013],{"class":158,"line":305},[156,1014,1015],{"class":280},"        url TEXT UNIQUE          -- natural key prevents duplicates\n",[156,1017,1018],{"class":158,"line":313},[156,1019,1020],{"class":280},"    )\n",[156,1022,1023,1026],{"class":158,"line":335},[156,1024,1025],{"class":276},"\"\"\"",[156,1027,302],{"class":176},[156,1029,1030],{"class":158,"line":383},[156,1031,192],{"emptyLinePlaceholder":191},[156,1033,1034,1036,1039,1041,1044,1046,1048,1050,1053,1055,1057,1059,1061,1063],{"class":158,"line":388},[156,1035,391],{"class":198},[156,1037,1038],{"class":270}," save",[156,1040,206],{"class":176},[156,1042,1043],{"class":324},"book",[156,1045,221],{"class":176},[156,1047,203],{"class":166},[156,1049,177],{"class":176},[156,1051,1052],{"class":324}," url",[156,1054,221],{"class":176},[156,1056,603],{"class":224},[156,1058,606],{"class":176},[156,1060,416],{"class":176},[156,1062,611],{"class":254},[156,1064,455],{"class":176},[156,1066,1067,1070,1072,1074],{"class":158,"line":429},[156,1068,1069],{"class":166},"    conn",[156,1071,27],{"class":176},[156,1073,985],{"class":351},[156,1075,1076],{"class":176},"(\n",[156,1078,1079,1082,1085,1087],{"class":158,"line":440},[156,1080,1081],{"class":276},"        \"",[156,1083,1084],{"class":280},"INSERT OR IGNORE INTO books (title, price, in_stock, url) VALUES (?, ?, ?, ?)",[156,1086,277],{"class":276},[156,1088,1089],{"class":176},",\n",[156,1091,1092,1095,1097,1099,1102,1104,1107,1109,1111,1113,1116,1118,1120,1122,1124,1127,1129],{"class":158,"line":458},[156,1093,1094],{"class":176},"        (",[156,1096,1043],{"class":351},[156,1098,27],{"class":176},[156,1100,702],{"class":1101},"skxfh",[156,1103,177],{"class":176},[156,1105,1106],{"class":351}," book",[156,1108,27],{"class":176},[156,1110,281],{"class":1101},[156,1112,177],{"class":176},[156,1114,1115],{"class":224}," int",[156,1117,206],{"class":176},[156,1119,1043],{"class":351},[156,1121,27],{"class":176},[156,1123,719],{"class":1101},[156,1125,1126],{"class":176},"),",[156,1128,1052],{"class":351},[156,1130,1131],{"class":176},"),\n",[156,1133,1134],{"class":158,"line":466},[156,1135,1020],{"class":176},[156,1137,1138,1140,1142,1145],{"class":158,"line":492},[156,1139,1069],{"class":166},[156,1141,27],{"class":176},[156,1143,1144],{"class":351},"commit",[156,1146,737],{"class":176},[14,1148,1149,1150,1153,1154,1158],{},"The ",[153,1151,1152],{},"UNIQUE"," column means re-running the crawl will not create duplicates — essential for resumable or scheduled jobs. In a ",[18,1155,1157],{"href":1156},"\u002Fscaling-python-web-scrapers\u002Fweb-scraping-with-scrapy\u002F","Scrapy project",", this same logic belongs in an item pipeline.",[32,1160,1162],{"id":1161},"scaling-up-postgresql-and-parquet","Scaling Up: PostgreSQL and Parquet",[14,1164,1165,1166,1169,1170,1173,1174,1177,1178,1181],{},"When data outgrows a single file or needs concurrent writers, move to PostgreSQL — use ",[153,1167,1168],{},"psycopg","'s ",[153,1171,1172],{},"execute_many"," or ",[153,1175,1176],{},"COPY"," for efficient bulk inserts, and an ",[153,1179,1180],{},"ON CONFLICT DO NOTHING"," clause for deduplication. For analytical datasets measured in millions of rows, write Parquet with pandas or pyarrow: it is columnar, compressed, and dramatically faster to query than CSV.",[146,1183,1185],{"className":148,"code":1184,"language":150,"meta":151,"style":151},"import pandas as pd\n\ndf = pd.DataFrame(b.model_dump() for b in books)\ndf.drop_duplicates(subset=\"title\").to_parquet(\"books.parquet\", index=False)\n",[153,1186,1187,1199,1203,1240],{"__ignoreMap":151},[156,1188,1189,1191,1194,1196],{"class":158,"line":159},[156,1190,170],{"class":162},[156,1192,1193],{"class":166}," pandas ",[156,1195,501],{"class":162},[156,1197,1198],{"class":166}," pd\n",[156,1200,1201],{"class":158,"line":188},[156,1202,192],{"emptyLinePlaceholder":191},[156,1204,1205,1208,1210,1213,1215,1218,1220,1222,1224,1226,1229,1232,1234,1236,1238],{"class":158,"line":195},[156,1206,1207],{"class":166},"df ",[156,1209,292],{"class":250},[156,1211,1212],{"class":166}," pd",[156,1214,27],{"class":176},[156,1216,1217],{"class":351},"DataFrame",[156,1219,206],{"class":176},[156,1221,767],{"class":351},[156,1223,27],{"class":176},[156,1225,772],{"class":351},[156,1227,1228],{"class":176},"()",[156,1230,1231],{"class":162}," for",[156,1233,745],{"class":351},[156,1235,449],{"class":162},[156,1237,750],{"class":351},[156,1239,302],{"class":176},[156,1241,1242,1245,1247,1250,1252,1255,1257,1259,1261,1263,1265,1268,1270,1272,1275,1277,1279,1282,1284,1287],{"class":158,"line":215},[156,1243,1244],{"class":166},"df",[156,1246,27],{"class":176},[156,1248,1249],{"class":351},"drop_duplicates",[156,1251,206],{"class":176},[156,1253,1254],{"class":288},"subset",[156,1256,292],{"class":250},[156,1258,277],{"class":276},[156,1260,702],{"class":280},[156,1262,277],{"class":276},[156,1264,355],{"class":176},[156,1266,1267],{"class":351},"to_parquet",[156,1269,206],{"class":176},[156,1271,277],{"class":276},[156,1273,1274],{"class":280},"books.parquet",[156,1276,277],{"class":276},[156,1278,177],{"class":176},[156,1280,1281],{"class":288}," index",[156,1283,292],{"class":250},[156,1285,1286],{"class":254},"False",[156,1288,302],{"class":176},[32,1290,1292],{"id":1291},"common-mistakes-to-avoid","Common Mistakes to Avoid",[1294,1295,1296,1303,1309,1315,1321],"ul",{},[1297,1298,1299,1302],"li",{},[67,1300,1301],{},"Buffering everything in memory:"," accumulating a giant list and writing once means a crash loses all of it. Stream to disk or a database incrementally.",[1297,1304,1305,1308],{},[67,1306,1307],{},"No deduplication key:"," without a unique constraint, re-runs and overlapping pagination create duplicate rows. Pick a natural key (URL, ID).",[1297,1310,1311,1314],{},[67,1312,1313],{},"Skipping validation:"," unvalidated records corrupt analytics silently. Validate at the storage boundary.",[1297,1316,1317,1320],{},[67,1318,1319],{},"Wrong format for the scale:"," CSV for millions of rows is slow and lossy; Parquet or a database is the right tool.",[1297,1322,1323,1326],{},[67,1324,1325],{},"Ignoring encoding:"," always write UTF-8 explicitly to avoid mojibake in exported text.",[32,1328,1330],{"id":1329},"frequently-asked-questions","Frequently Asked Questions",[14,1332,1333,1336],{},[67,1334,1335],{},"CSV or JSON for scraped data?","\nUse CSV for flat, tabular data destined for spreadsheets. Use JSON (or JSON Lines) when records are nested or feed an API. JSON Lines is best for large, append-as-you-go exports.",[14,1338,1339,1342],{},[67,1340,1341],{},"When should I use a database instead of files?","\nOnce you need querying, deduplication, concurrent writes, or resumable crawls. SQLite covers local needs with zero setup; PostgreSQL handles large, concurrent, or relational workloads.",[14,1344,1345,1348,1349,1351,1352,1354],{},[67,1346,1347],{},"How do I avoid duplicate records?","\nDefine a unique key (such as the source URL or an ID) and let the database enforce it with a unique constraint plus ",[153,1350,933],{}," (SQLite) or ",[153,1353,1180],{}," (PostgreSQL).",[14,1356,1357,1360],{},[67,1358,1359],{},"What is Parquet and when should I use it?","\nParquet is a columnar, compressed file format optimized for analytics. Use it for large datasets you will query or load into data tools — it is far faster and smaller than CSV at scale.",[1362,1363,1364],"style",{},"html pre.shiki code .sVHd0, html code.shiki .sVHd0{--shiki-light:#39ADB5;--shiki-light-font-style:italic;--shiki-default:#D73A49;--shiki-default-font-style:inherit;--shiki-dark:#F97583;--shiki-dark-font-style:inherit}html pre.shiki code .su5hD, html code.shiki .su5hD{--shiki-light:#90A4AE;--shiki-default:#24292E;--shiki-dark:#E1E4E8}html pre.shiki code .sP7_E, html code.shiki .sP7_E{--shiki-light:#39ADB5;--shiki-default:#24292E;--shiki-dark:#E1E4E8}html pre.shiki code .sbsja, html code.shiki .sbsja{--shiki-light:#9C3EDA;--shiki-default:#D73A49;--shiki-dark:#F97583}html pre.shiki code .sbgvK, html code.shiki .sbgvK{--shiki-light:#E2931D;--shiki-default:#6F42C1;--shiki-dark:#B392F0}html pre.shiki code .sZMiF, html code.shiki .sZMiF{--shiki-light:#E2931D;--shiki-default:#005CC5;--shiki-dark:#79B8FF}html pre.shiki code .smGrS, html code.shiki .smGrS{--shiki-light:#39ADB5;--shiki-default:#D73A49;--shiki-dark:#F97583}html pre.shiki code .s39Yj, html code.shiki .s39Yj{--shiki-light:#39ADB5;--shiki-default:#005CC5;--shiki-dark:#79B8FF}html pre.shiki code .stp6e, html code.shiki .stp6e{--shiki-light:#39ADB5;--shiki-default:#6F42C1;--shiki-dark:#B392F0}html pre.shiki code .sGLFI, html code.shiki .sGLFI{--shiki-light:#6182B8;--shiki-default:#6F42C1;--shiki-dark:#B392F0}html pre.shiki code .sjJ54, html code.shiki .sjJ54{--shiki-light:#39ADB5;--shiki-default:#032F62;--shiki-dark:#9ECBFF}html pre.shiki code .s_sjI, html code.shiki .s_sjI{--shiki-light:#91B859;--shiki-default:#032F62;--shiki-dark:#9ECBFF}html pre.shiki code .s99_P, html code.shiki .s99_P{--shiki-light:#90A4AE;--shiki-light-font-style:italic;--shiki-default:#E36209;--shiki-default-font-style:inherit;--shiki-dark:#FFAB70;--shiki-dark-font-style:inherit}html pre.shiki code .sFwrP, html code.shiki .sFwrP{--shiki-light:#90A4AE;--shiki-light-font-style:italic;--shiki-default:#24292E;--shiki-default-font-style:inherit;--shiki-dark:#E1E4E8;--shiki-dark-font-style:inherit}html pre.shiki code .slqww, html code.shiki .slqww{--shiki-light:#6182B8;--shiki-default:#24292E;--shiki-dark:#E1E4E8}html pre.shiki code .sptTA, html code.shiki .sptTA{--shiki-light:#6182B8;--shiki-default:#005CC5;--shiki-dark:#79B8FF}html pre.shiki code .srdBf, html code.shiki .srdBf{--shiki-light:#F76D47;--shiki-default:#005CC5;--shiki-dark:#79B8FF}html .light .shiki span {color: var(--shiki-light);background: var(--shiki-light-bg);font-style: var(--shiki-light-font-style);font-weight: var(--shiki-light-font-weight);text-decoration: var(--shiki-light-text-decoration);}html.light .shiki span {color: var(--shiki-light);background: var(--shiki-light-bg);font-style: var(--shiki-light-font-style);font-weight: var(--shiki-light-font-weight);text-decoration: var(--shiki-light-text-decoration);}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html.dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html pre.shiki code .sutJx, html code.shiki .sutJx{--shiki-light:#90A4AE;--shiki-light-font-style:italic;--shiki-default:#6A737D;--shiki-default-font-style:inherit;--shiki-dark:#6A737D;--shiki-dark-font-style:inherit}html pre.shiki code .s_hVV, html code.shiki .s_hVV{--shiki-light:#90A4AE;--shiki-default:#005CC5;--shiki-dark:#79B8FF}html pre.shiki code .skxfh, html code.shiki .skxfh{--shiki-light:#E53935;--shiki-default:#24292E;--shiki-dark:#E1E4E8}",{"title":151,"searchDepth":188,"depth":188,"links":1366},[1367,1368,1369,1370,1371,1372,1373],{"id":34,"depth":188,"text":35},{"id":133,"depth":188,"text":134},{"id":547,"depth":188,"text":548},{"id":926,"depth":188,"text":927},{"id":1161,"depth":188,"text":1162},{"id":1291,"depth":188,"text":1292},{"id":1329,"depth":188,"text":1330},"Persist scraped data the right way — choosing between CSV, JSON, SQLite, PostgreSQL, and Parquet, plus schema validation, deduplication, and incremental writes.","md",{},"\u002Fscaling-python-web-scrapers\u002Fstoring-and-exporting-scraped-data",{"title":5,"description":1374},"scaling-python-web-scrapers\u002Fstoring-and-exporting-scraped-data\u002Findex","srNrgOhkbSG8ZdSki6c89uiyWTky-l_KB2nOzuTrb2M",[1382,1432,1460],{"title":1383,"path":1384,"stem":1385,"children":1386},"Advanced Scraping Techniques Anti Bot Evasion","\u002Fadvanced-scraping-techniques-anti-bot-evasion","advanced-scraping-techniques-anti-bot-evasion",[1387,1390,1396,1408,1420],{"title":1388,"path":1384,"stem":1389},"Advanced Python Scraping & Anti-Bot Evasion","advanced-scraping-techniques-anti-bot-evasion\u002Findex",{"title":1391,"path":1392,"stem":1393,"children":1394},"Bypass Cloudflare & Akamai with Python","\u002Fadvanced-scraping-techniques-anti-bot-evasion\u002Fbypassing-cloudflare-and-akamai-protections","advanced-scraping-techniques-anti-bot-evasion\u002Fbypassing-cloudflare-and-akamai-protections\u002Findex",[1395],{"title":1391,"path":1392,"stem":1393},{"title":1397,"path":1398,"stem":1399,"children":1400},"Mastering Selenium for Dynamic Websites","\u002Fadvanced-scraping-techniques-anti-bot-evasion\u002Fmastering-selenium-for-dynamic-websites","advanced-scraping-techniques-anti-bot-evasion\u002Fmastering-selenium-for-dynamic-websites\u002Findex",[1401,1402],{"title":1397,"path":1398,"stem":1399},{"title":1403,"path":1404,"stem":1405,"children":1406},"Python Selenium Stealth Setup Guide","\u002Fadvanced-scraping-techniques-anti-bot-evasion\u002Fmastering-selenium-for-dynamic-websites\u002Fhow-to-configure-selenium-stealth-to-avoid-detection","advanced-scraping-techniques-anti-bot-evasion\u002Fmastering-selenium-for-dynamic-websites\u002Fhow-to-configure-selenium-stealth-to-avoid-detection\u002Findex",[1407],{"title":1403,"path":1404,"stem":1405},{"title":1409,"path":1410,"stem":1411,"children":1412},"Rotating Proxies & Managing IP Blocks","\u002Fadvanced-scraping-techniques-anti-bot-evasion\u002Frotating-proxies-and-managing-ip-blocks","advanced-scraping-techniques-anti-bot-evasion\u002Frotating-proxies-and-managing-ip-blocks\u002Findex",[1413,1414],{"title":1409,"path":1410,"stem":1411},{"title":1415,"path":1416,"stem":1417,"children":1418},"Best Proxy Providers for Python Scrapers","\u002Fadvanced-scraping-techniques-anti-bot-evasion\u002Frotating-proxies-and-managing-ip-blocks\u002Fbest-free-and-paid-proxy-providers-for-scraping","advanced-scraping-techniques-anti-bot-evasion\u002Frotating-proxies-and-managing-ip-blocks\u002Fbest-free-and-paid-proxy-providers-for-scraping\u002Findex",[1419],{"title":1415,"path":1416,"stem":1417},{"title":1421,"path":1422,"stem":1423,"children":1424},"Playwright for Python Web Automation","\u002Fadvanced-scraping-techniques-anti-bot-evasion\u002Fusing-playwright-for-modern-web-automation","advanced-scraping-techniques-anti-bot-evasion\u002Fusing-playwright-for-modern-web-automation\u002Findex",[1425,1426],{"title":1421,"path":1422,"stem":1423},{"title":1427,"path":1428,"stem":1429,"children":1430},"Playwright vs Selenium: Python Benchmarks","\u002Fadvanced-scraping-techniques-anti-bot-evasion\u002Fusing-playwright-for-modern-web-automation\u002Fplaywright-vs-selenium-performance-benchmarks","advanced-scraping-techniques-anti-bot-evasion\u002Fusing-playwright-for-modern-web-automation\u002Fplaywright-vs-selenium-performance-benchmarks\u002Findex",[1431],{"title":1427,"path":1428,"stem":1429},{"title":1433,"path":1434,"stem":1435,"children":1436},"Scaling Python Web Scrapers","\u002Fscaling-python-web-scrapers","scaling-python-web-scrapers",[1437,1439,1445,1448],{"title":21,"path":1434,"stem":1438},"scaling-python-web-scrapers\u002Findex",{"title":1440,"path":1441,"stem":1442,"children":1443},"Asynchronous Scraping with asyncio and HTTPX","\u002Fscaling-python-web-scrapers\u002Fasynchronous-scraping-with-asyncio-and-httpx","scaling-python-web-scrapers\u002Fasynchronous-scraping-with-asyncio-and-httpx\u002Findex",[1444],{"title":1440,"path":1441,"stem":1442},{"title":5,"path":1377,"stem":1379,"children":1446},[1447],{"title":5,"path":1377,"stem":1379},{"title":1449,"path":1450,"stem":1451,"children":1452},"Web Scraping with Scrapy","\u002Fscaling-python-web-scrapers\u002Fweb-scraping-with-scrapy","scaling-python-web-scrapers\u002Fweb-scraping-with-scrapy\u002Findex",[1453,1454],{"title":1449,"path":1450,"stem":1451},{"title":1455,"path":1456,"stem":1457,"children":1458},"Scrapy vs BeautifulSoup: Which to Use","\u002Fscaling-python-web-scrapers\u002Fweb-scraping-with-scrapy\u002Fscrapy-vs-beautifulsoup-which-to-use","scaling-python-web-scrapers\u002Fweb-scraping-with-scrapy\u002Fscrapy-vs-beautifulsoup-which-to-use\u002Findex",[1459],{"title":1455,"path":1456,"stem":1457},{"title":1461,"path":1462,"stem":1463,"children":1464},"The Complete Guide To Python Web Scraping","\u002Fthe-complete-guide-to-python-web-scraping","the-complete-guide-to-python-web-scraping",[1465,1468,1480,1492,1498,1510,1522],{"title":1466,"path":1462,"stem":1467},"The Complete Python Web Scraping Guide","the-complete-guide-to-python-web-scraping\u002Findex",{"title":1469,"path":1470,"stem":1471,"children":1472},"Regex Data Extraction in Python Scraping","\u002Fthe-complete-guide-to-python-web-scraping\u002Fextracting-data-with-regular-expressions","the-complete-guide-to-python-web-scraping\u002Fextracting-data-with-regular-expressions\u002Findex",[1473,1474],{"title":1469,"path":1470,"stem":1471},{"title":1475,"path":1476,"stem":1477,"children":1478},"Fix Unicode Errors in Python Web Scraping","\u002Fthe-complete-guide-to-python-web-scraping\u002Fextracting-data-with-regular-expressions\u002Ffixing-common-unicode-errors-in-python-scraping","the-complete-guide-to-python-web-scraping\u002Fextracting-data-with-regular-expressions\u002Ffixing-common-unicode-errors-in-python-scraping\u002Findex",[1479],{"title":1475,"path":1476,"stem":1477},{"title":1481,"path":1482,"stem":1483,"children":1484},"Pagination & Infinite Scroll in Python","\u002Fthe-complete-guide-to-python-web-scraping\u002Fhandling-pagination-and-infinite-scroll","the-complete-guide-to-python-web-scraping\u002Fhandling-pagination-and-infinite-scroll\u002Findex",[1485,1486],{"title":1481,"path":1482,"stem":1483},{"title":1487,"path":1488,"stem":1489,"children":1490},"Scrape Static Sites Without Getting Blocked","\u002Fthe-complete-guide-to-python-web-scraping\u002Fhandling-pagination-and-infinite-scroll\u002Fhow-to-scrape-a-static-website-without-getting-blocked","the-complete-guide-to-python-web-scraping\u002Fhandling-pagination-and-infinite-scroll\u002Fhow-to-scrape-a-static-website-without-getting-blocked\u002Findex",[1491],{"title":1487,"path":1488,"stem":1489},{"title":1493,"path":1494,"stem":1495,"children":1496},"Managing Cookies & Sessions in Python","\u002Fthe-complete-guide-to-python-web-scraping\u002Fmanaging-cookies-and-sessions","the-complete-guide-to-python-web-scraping\u002Fmanaging-cookies-and-sessions\u002Findex",[1497],{"title":1493,"path":1494,"stem":1495},{"title":1499,"path":1500,"stem":1501,"children":1502},"Parsing HTML with BeautifulSoup in Python","\u002Fthe-complete-guide-to-python-web-scraping\u002Fparsing-html-with-beautifulsoup","the-complete-guide-to-python-web-scraping\u002Fparsing-html-with-beautifulsoup\u002Findex",[1503,1504],{"title":1499,"path":1500,"stem":1501},{"title":1505,"path":1506,"stem":1507,"children":1508},"BeautifulSoup vs lxml Speed Comparison","\u002Fthe-complete-guide-to-python-web-scraping\u002Fparsing-html-with-beautifulsoup\u002Fbeautifulsoup-vs-lxml-which-parser-is-faster","the-complete-guide-to-python-web-scraping\u002Fparsing-html-with-beautifulsoup\u002Fbeautifulsoup-vs-lxml-which-parser-is-faster\u002Findex",[1509],{"title":1505,"path":1506,"stem":1507},{"title":1511,"path":1512,"stem":1513,"children":1514},"Setting Up Your Python Scraping Environment","\u002Fthe-complete-guide-to-python-web-scraping\u002Fsetting-up-your-python-scraping-environment","the-complete-guide-to-python-web-scraping\u002Fsetting-up-your-python-scraping-environment\u002Findex",[1515,1516],{"title":1511,"path":1512,"stem":1513},{"title":1517,"path":1518,"stem":1519,"children":1520},"Install Python & Requests for Beginners","\u002Fthe-complete-guide-to-python-web-scraping\u002Fsetting-up-your-python-scraping-environment\u002Fhow-to-install-python-and-requests-for-beginners","the-complete-guide-to-python-web-scraping\u002Fsetting-up-your-python-scraping-environment\u002Fhow-to-install-python-and-requests-for-beginners\u002Findex",[1521],{"title":1517,"path":1518,"stem":1519},{"title":1523,"path":1524,"stem":1525,"children":1526},"HTTP Requests & Responses for Scrapers","\u002Fthe-complete-guide-to-python-web-scraping\u002Funderstanding-http-requests-and-responses","the-complete-guide-to-python-web-scraping\u002Funderstanding-http-requests-and-responses\u002Findex",[1527,1528],{"title":1523,"path":1524,"stem":1525},{"title":1529,"path":1530,"stem":1531,"children":1532},"Extract HTML Tables with Python","\u002Fthe-complete-guide-to-python-web-scraping\u002Funderstanding-http-requests-and-responses\u002Fstep-by-step-guide-to-extracting-tables-from-html","the-complete-guide-to-python-web-scraping\u002Funderstanding-http-requests-and-responses\u002Fstep-by-step-guide-to-extracting-tables-from-html\u002Findex",[1533],{"title":1529,"path":1530,"stem":1531},1781700487014]