Mastering Web Scraping with Python: A Complete Step-by-Step Guide
Learn Web Scraping with Python in 2025 with this complete step-by-step tutorial. Includes practical examples, code snippets, tools, and best practices for safe and efficient scraping.
Learn how garbage collection works in Python. You’ll learn the core ideas (reference counting and generational GC), explore the gc module, diagnose cyclic references, use weakref safely, and adopt practical patterns to keep memory usage healthy in real-world apps.
Table of contents [Show]
Garbage collection (GC) frees memory that your program no longer needs. Most of the time it “just works,” but understanding it helps you:
In CPython (the most widely used Python implementation), every object has a reference count. When the count drops to zero, the object’s memory is reclaimed immediately.
import sys
x = [] # create a list object
print(sys.getrefcount(x)) # note: adds a temporary ref for the call
y = x # add another reference
print(sys.getrefcount(x))
del y # drop a reference
print(sys.getrefcount(x))
Pro: Fast reclamation when objects become unreachable.
Con: Pure reference counting can’t reclaim cycles (objects that reference each other but are otherwise unreachable).
To handle cycles, CPython layers a cyclic GC on top of reference counting. It periodically scans containers (lists, dicts, sets, instances, etc.) to find groups that only reference each other.
CPython organizes container objects into three “generations.” Younger generations are collected more frequently, based on how many allocations and deallocations have happened since the last collection.
The GC tracks counts and triggers collections when certain thresholds are exceeded.
| Generation | Typical Role | Collection Frequency |
|---|---|---|
| 0 | Newly allocated container objects | Most frequent |
| 1 | Objects that survived gen-0 collection | Less frequent |
| 2 | Long-lived survivors | Least frequent |
If objects survive a collection, they are promoted to an older generation. The assumption is that older objects are more likely to live longer, so collecting them less often saves work.
gc Module: Essential APIsThe gc module exposes controls and introspection for the cyclic collector.
import gc
# Is the cyclic GC enabled?
print("GC enabled:", gc.isenabled())
# Manually run a full collection (all generations)
unreachable = gc.collect()
print("Unreachable objects found:", unreachable)
# Current thresholds per generation (g0, g1, g2)
print("Thresholds:", gc.get_threshold())
# Current allocation/deallocation counters since last collection
print("Counts:", gc.get_count())
# Set custom thresholds: (gen0, gen1, gen2)
gc.set_threshold(700, 10, 10)
print("New thresholds:", gc.get_threshold())
# Temporarily disable GC in a tight loop to reduce overhead
gc.disable()
try:
# ... create many short-lived objects ...
pass
finally:
gc.enable() # always re-enable
import gc
class Node:
def __init__(self, name):
self.name = name
self.ref = None
a = Node("A")
b = Node("B")
a.ref = b
b.ref = a # A cycle
del a, b # Drop our references; objects still reference each other
# Force a collection
unreachable = gc.collect()
print("Collected:", unreachable)
Without the cyclic GC, a and b would never be freed. The collector detects the cycle and reclaims it.
# Save all unreachable objects in gc.garbage for inspection
gc.set_debug(gc.DEBUG_SAVEALL)
# Create cycles, then collect
gc.collect()
# Inspect garbage
for obj in gc.garbage:
print("Unreachable:", type(obj), getattr(obj, "__dict__", obj))
If you see objects with __del__ methods inside cycles, CPython may move them to gc.garbage to avoid unsafe finalization order. You must break those cycles or avoid __del__.
__del__) and Why They’re Tricky__del__ is a destructor-like hook that runs when an object is reclaimed. It can complicate cycle collection because Python cannot safely decide the destruction order of cyclic objects that both define __del__.
__del__ Exampleimport gc
class Resource:
def __init__(self, name):
self.name = name
self.partner = None
def __del__(self):
# Potentially problematic if part of a cycle
print("Cleaning up", self.name)
x = Resource("x")
y = Resource("y")
x.partner = y
y.partner = x # cycle
del x, y
gc.collect() # may put objects into gc.garbage if unsafe to finalize
print("Garbage size:", len(gc.garbage))
Safer alternative: prefer weakref.finalize for cleanup logic that doesn’t interfere with the collector (see next section).
weakref and weakref.finalize for Safer CleanupA weak reference does not increase an object’s reference count. This is useful when you need to refer to objects without preventing their collection, or to avoid forming cycles.
import weakref
class Expensive:
pass
obj = Expensive()
r = weakref.ref(obj) # does not increment refcount
print("Alive?", r() is not None)
del obj
print("Alive after del?", r() is not None) # becomes None when collected
weakref.finalize for Cleanupimport weakref
class Connection:
def __init__(self):
self.open = True
def close(self):
self.open = False
print("Connection closed")
c = Connection()
finalizer = weakref.finalize(c, c.close)
# When c is unreachable, finalizer will call c.close() safely
del c
# Finalizer runs when GC reclaims the object
Using finalize avoids __del__-related issues in cycles and gives you better control over cleanup timing.
Spotting memory growth early is crucial for servers and batch jobs. Here are practical tools and patterns:
import gc
def snapshot():
counts = {}
for obj in gc.get_objects():
t = type(obj)
counts[t] = counts.get(t, 0) + 1
return counts
before = snapshot()
# ... run workload ...
gc.collect()
after = snapshot()
for t in sorted(after, key=lambda k: after[k] - before.get(k, 0), reverse=True)[:10]:
delta = after[t] - before.get(t, 0)
if delta != 0:
print(f"{t.__name__:+30s} Δ={delta}")
import gc, time
for _ in range(5):
print("Counts:", gc.get_count()) # (gen0, gen1, gen2)
time.sleep(1)
gc.garbageimport gc
gc.set_debug(gc.DEBUG_SAVEALL)
gc.collect()
print("Garbage objects:", len(gc.garbage))
Objects in gc.garbage often indicate cycles that involve __del__ or other tricky patterns.
If your application creates many short-lived objects, you can raise the gen-0 threshold to reduce collection frequency and overhead:
import gc
old = gc.get_threshold()
gc.set_threshold(1200, 10, 10) # example: raise gen-0 threshold
print("Old:", old, "New:", gc.get_threshold())
For short, compute-heavy sections, temporarily disable the GC to reduce pauses:
import gc
gc.disable()
try:
# tight loop creating lots of small objects
data = [tuple(range(20)) for _ in range(1_000_000)]
finally:
gc.enable()
Always re-enable GC. Disabling it permanently can hide leaks.
weakref to avoid cycles.__del__ for critical resource cleanup. Use context managers or weakref.finalize.with) to deterministically release files, sockets, and locks.gc.collect() in batch jobs and at safe points in long-running services.gc.get_count() and process RSS (via psutil) in production for early signals of leaks.No. This guide focuses on CPython. Other interpreters (like PyPy) use different GC strategies and heuristics.
del free memory immediately?del drops a reference. If the reference count hits zero and the object is not part of a problematic cycle, CPython reclaims it immediately.
gc.collect() manually?Usually not. It’s helpful for diagnostics, at the end of large batch stages, or before memory-sensitive tasks—measure the impact.
__del__ bad?Not inherently, but it complicates cycles and finalization order. Prefer context managers and weakref.finalize for safer cleanup.
import gc
class Node:
def __init__(self, name):
self.name = name
self.next = None
def make_cycle():
a = Node("a")
b = Node("b")
a.next = b
b.next = a
return a, b
gc.set_debug(gc.DEBUG_SAVEALL)
a, b = make_cycle()
# Drop strong references; only the cycle remains
a_id, b_id = id(a), id(b)
del a, b
# Force collection and inspect results
unreachable = gc.collect()
print("Unreachable:", unreachable)
print("Garbage objects:", len(gc.garbage))
# Manually break the cycle if needed (example: when you still hold refs)
# for obj in list(gc.garbage):
# if isinstance(obj, Node):
# obj.next = None
# Clear the garbage list once done inspecting
gc.garbage.clear()
weakref.finalizeimport weakref
import time
class TempFile:
def __init__(self, name):
self.name = name
self.open = True
print("Opened", self.name)
def close(self):
if self.open:
print("Closed", self.name)
self.open = False
def use_tempfile():
t = TempFile("session.tmp")
# Ensure cleanup even if the object participates in a cycle
weakref.finalize(t, t.close)
return t
t = use_tempfile()
# Drop last strong reference; finalizer will run when GC reclaims
del t
# give the GC a moment in interactive sessions
time.sleep(0.1)
from contextlib import contextmanager
@contextmanager
def resource(name):
print("Acquired", name)
try:
yield
finally:
print("Released", name)
with resource("db-connection"):
print("Do work with connection")
import gc
import time
def workload(n=500_000):
# Allocate many small container objects
data = []
for i in range(n):
data.append([i, i+1])
return data
def timed_run(thresholds):
gc.set_threshold(*thresholds)
start = time.perf_counter()
data = workload()
del data
gc.collect()
end = time.perf_counter()
return end - start
baseline = gc.get_threshold()
print("Baseline thresholds:", baseline)
for t in [(700, 10, 10), (1200, 10, 10), (2000, 10, 10)]:
dur = timed_run(t)
print("Thresholds", t, "Duration:", round(dur, 3), "s")
# restore
gc.set_threshold(*baseline)
Python’s memory management blends immediate reference counting with a generational cyclic collector. Most applications never need manual intervention, but understanding the model pays off when debugging leaks, handling long-running services, or tuning performance. Reach first for context managers and weakref.finalize, monitor with gc diagnostics, and only tune thresholds when measurements justify it.
Learn Web Scraping with Python in 2025 with this complete step-by-step tutorial. Includes practical examples, code snippets, tools, and best practices for safe and efficient scraping.
Use Python and machine learning to forecast sales trends. Learn practical applications for retail, e-commerce, and business growth.
Step-by-step guide to creating a Python-powered smart home automation system using Raspberry Pi. Includes real-world IoT applications.
These cookies are essential for the website to function properly.
These cookies help us understand how visitors interact with the website.
These cookies are used to deliver personalized advertisements.


