3.1. Advanced Data Structures & Algorithms
Overview
Go beyond the basics with built-in and specialized data structures that make your code faster and more efficient for real-world tasks.
Introduction: Why Advanced Data Structures Matter
As your Python projects grow in complexity, the simple lists and dictionaries you learned as a beginner start to show their limitations. When you're processing thousands of records, building recommendation systems, or creating high-performance applications, choosing the right data structure can mean the difference between code that runs in milliseconds versus minutes.
In this chapter, we'll explore Python's advanced built-in data structures and dive into algorithmic thinking that will make you a more effective developer. You'll learn not just what* these structures do, but when* and why to use them in real-world scenarios.
Tuples, Sets, and Frozensets
Tuples: Immutable Sequences
While you might already know tuples as "immutable lists," their real power lies in their specific use cases and performance characteristics.
# Basic tuple creation
coordinates = (10, 20)
rgb_color = (255, 128, 0)
# Tuples are great for structured data
person = ("Alice", 30, "Engineer")
name, age, job = person # Tuple unpacking
# Multiple assignment using tuples
x, y = 10, 20 # Actually creates a tuple (10, 20) then unpacks it
When to use tuples:
- Fixed-size collections: When you know exactly how many items you need
- Heterogeneous data: Storing different types together (like database records)
- Dictionary keys: Unlike lists, tuples can be dictionary keys because they're hashable
- Function returns: Returning multiple values from functions
# Tuples as dictionary keys
locations = {
(0, 0): "Origin",
(10, 20): "Point A",
(-5, 15): "Point B"
}
# Named tuples for better readability
from collections import namedtuple
Point = namedtuple('Point', ['x', 'y'])
Employee = namedtuple('Employee', ['name', 'age', 'department'])
# Usage
p1 = Point(10, 20)
print(p1.x, p1.y) # More readable than p1[0], p1[1]
emp = Employee("Alice", 30, "Engineering")
print(f"{emp.name} works in {emp.department}")
Sets: Fast Membership Testing
Sets are unordered collections of unique elements, optimized for membership testing and set operations.
# Creating sets
numbers = {1, 2, 3, 4, 5}
fruits = set(['apple', 'banana', 'orange'])
empty_set = set() # Note: {} creates an empty dict, not set
# Fast membership testing - O(1) average case
if 'apple' in fruits: # Much faster than list membership testing
print("Found apple!")
# Set operations
set_a = {1, 2, 3, 4}
set_b = {3, 4, 5, 6}
print(set_a | set_b) # Union: {1, 2, 3, 4, 5, 6}
print(set_a & set_b) # Intersection: {3, 4}
print(set_a - set_b) # Difference: {1, 2}
print(set_a ^ set_b) # Symmetric difference: {1, 2, 5, 6}
Real-world use cases:
- Removing duplicates:
unique_items = list(set(items))
- Permission systems: Checking if user has required permissions
- Data analysis: Finding common elements between datasets
# Example: Finding common interests between users
alice_interests = {"python", "data_science", "machine_learning", "hiking"}
bob_interests = {"python", "web_development", "hiking", "photography"}
common_interests = alice_interests & bob_interests
print(f"Common interests: {common_interests}")
# Output: Common interests: {'python', 'hiking'}
Frozensets: Immutable Sets
Frozensets are the immutable version of sets, useful when you need hashable collections.
# Creating frozensets
immutable_tags = frozenset(['python', 'programming', 'tutorial'])
# Can be used as dictionary keys
category_tags = {
frozenset(['python', 'programming']): "Python Development",
frozenset(['html', 'css', 'javascript']): "Web Development",
frozenset(['data', 'analysis', 'pandas']): "Data Science"
}
# Useful for caching expensive computations
def expensive_operation(tags):
# Convert to frozenset for hashing
frozen_tags = frozenset(tags)
if frozen_tags in cache:
return cache[frozen_tags]
result = perform_computation(tags)
cache[frozen_tags] = result
return result
The collections Module: Specialized Containers
Python's collections
module provides specialized alternatives to built-in containers, each optimized for specific use cases.
defaultdict: Handling Missing Keys
from collections import defaultdict
# Instead of this error-prone code:
regular_dict = {}
for item in ['apple', 'banana', 'apple', 'orange', 'banana']:
if item in regular_dict:
regular_dict[item] += 1
else:
regular_dict[item] = 1
# Use defaultdict:
counter = defaultdict(int)
for item in ['apple', 'banana', 'apple', 'orange', 'banana']:
counter[item] += 1 # No KeyError, defaults to 0
print(dict(counter)) # {'apple': 2, 'banana': 2, 'orange': 1}
# Grouping data
from collections import defaultdict
students = [
('Alice', 'Math'), ('Bob', 'Science'), ('Charlie', 'Math'),
('Diana', 'Science'), ('Eve', 'Math')
]
by_subject = defaultdict(list)
for name, subject in students:
by_subject[subject].append(name)
print(dict(by_subject))
# {'Math': ['Alice', 'Charlie', 'Eve'], 'Science': ['Bob', 'Diana']}
Counter: Counting Made Easy
from collections import Counter
# Counting elements
text = "hello world"
char_count = Counter(text)
print(char_count) # Counter({'l': 3, 'o': 2, 'h': 1, 'e': 1, ' ': 1, 'w': 1, 'r': 1, 'd': 1})
# Most common elements
words = ['apple', 'banana', 'apple', 'cherry', 'banana', 'apple']
word_count = Counter(words)
print(word_count.most_common(2)) # [('apple', 3), ('banana', 2)]
# Counter arithmetic
counter1 = Counter(['a', 'b', 'c', 'a'])
counter2 = Counter(['a', 'b', 'b', 'd'])
print(counter1 + counter2) # Counter({'a': 3, 'b': 3, 'c': 1, 'd': 1})
print(counter1 - counter2) # Counter({'c': 1, 'a': 1})
deque: Double-ended Queues
from collections import deque
# Efficient operations at both ends
d = deque(['a', 'b', 'c'])
d.appendleft('x') # O(1) - much faster than list.insert(0, 'x')
d.append('y') # O(1)
print(d) # deque(['x', 'a', 'b', 'c', 'y'])
print(d.popleft()) # O(1) - 'x'
print(d.pop()) # O(1) - 'y'
# Rotating elements
d = deque(range(10))
d.rotate(3) # Rotate right by 3
print(d) # deque([7, 8, 9, 0, 1, 2, 3, 4, 5, 6])
# Useful for implementing queues and circular buffers
class CircularBuffer:
def __init__(self, size):
self.buffer = deque(maxlen=size)
def add(self, item):
self.buffer.append(item) # Automatically removes oldest if full
def get_recent(self, n):
return list(self.buffer)[-n:]
OrderedDict: Preserving Insertion Order
Note: As of Python 3.7+, regular dictionaries maintain insertion order, but OrderedDict still has unique features.
from collections import OrderedDict
# Before Python 3.7, order wasn't guaranteed in regular dicts
ordered = OrderedDict([('first', 1), ('second', 2), ('third', 3)])
# Move to end
ordered.move_to_end('first')
print(ordered) # OrderedDict([('second', 2), ('third', 3), ('first', 1)])
# Useful for LRU cache implementation
class LRUCache:
def __init__(self, capacity):
self.capacity = capacity
self.cache = OrderedDict()
def get(self, key):
if key in self.cache:
self.cache.move_to_end(key) # Mark as recently used
return self.cache[key]
return None
def put(self, key, value):
if key in self.cache:
self.cache.move_to_end(key)
elif len(self.cache) >= self.capacity:
self.cache.popitem(last=False) # Remove least recently used
self.cache[key] = value
Heaps, Queues, and Priority Queues
Heaps with heapq
Python's heapq
module provides heap operations on regular lists, implementing a min-heap.
import heapq
# Creating a heap
numbers = [3, 1, 4, 1, 5, 9, 2, 6]
heapq.heapify(numbers) # Convert list to heap in-place
print(numbers) # [1, 1, 2, 3, 5, 9, 4, 6] - not sorted, but heap property maintained
# Heap operations
heapq.heappush(numbers, 0) # Add element
smallest = heapq.heappop(numbers) # Remove and return smallest
print(smallest) # 0
# Finding k largest/smallest elements
data = [1, 3, 5, 7, 9, 2, 4, 6, 8, 0]
print(heapq.nlargest(3, data)) # [9, 8, 7]
print(heapq.nsmallest(3, data)) # [0, 1, 2]
# Priority queue implementation
class PriorityQueue:
def __init__(self):
self._queue = []
self._index = 0
def push(self, item, priority):
heapq.heappush(self._queue, (priority, self._index, item))
self._index += 1
def pop(self):
return heapq.heappop(self._queue)[-1] # Return just the item
def is_empty(self):
return len(self._queue) == 0
# Usage
pq = PriorityQueue()
pq.push('task1', 3)
pq.push('task2', 1) # Higher priority (lower number)
pq.push('task3', 2)
while not pq.is_empty():
print(pq.pop()) # Outputs: task2, task3, task1
Queue Module for Threading
import queue
import threading
import time
# Thread-safe queues
q = queue.Queue() # FIFO queue
lq = queue.LifoQueue() # LIFO queue (stack)
pq = queue.PriorityQueue() # Priority queue
# Producer-consumer pattern
def producer(q):
for i in range(5):
q.put(f"item_{i}")
time.sleep(0.1)
def consumer(q):
while True:
item = q.get()
print(f"Processing {item}")
time.sleep(0.2)
q.task_done()
# Usage
work_queue = queue.Queue()
threading.Thread(target=producer, args=(work_queue,)).start()
threading.Thread(target=consumer, args=(work_queue,)).start()
Sorting and Searching Algorithms
Advanced Sorting Techniques
# Custom sorting with key functions
students = [
{'name': 'Alice', 'grade': 85, 'age': 20},
{'name': 'Bob', 'grade': 90, 'age': 22},
{'name': 'Charlie', 'grade': 85, 'age': 19}
]
# Sort by multiple criteria
from operator import itemgetter
# Sort by grade (descending), then by age (ascending)
students.sort(key=itemgetter('grade', 'age'), reverse=True)
print(students)
# Custom comparison function
def compare_students(student):
return (-student['grade'], student['age']) # Negative for descending
students.sort(key=compare_students)
# Stable sorting for complex scenarios
import functools
def multi_sort(data, *sort_keys):
"""Sort by multiple keys in order of priority"""
for key, reverse in reversed(sort_keys):
data.sort(key=key, reverse=reverse)
return data
# Usage
multi_sort(students,
(lambda x: x['grade'], True), # Primary: grade descending
(lambda x: x['age'], False) # Secondary: age ascending
)
Binary Search and Bisect
import bisect
# Binary search on sorted lists
sorted_list = [1, 3, 5, 7, 9, 11, 13, 15]
# Find insertion point
pos = bisect.bisect_left(sorted_list, 6)
print(f"Insert 6 at position {pos}") # 3
# Insert while maintaining sort order
bisect.insort(sorted_list, 6)
print(sorted_list) # [1, 3, 5, 6, 7, 9, 11, 13, 15]
# Practical example: Grade boundaries
def grade_lookup(score):
breakpoints = [60, 70, 80, 90]
grades = ['F', 'D', 'C', 'B', 'A']
return grades[bisect.bisect(breakpoints, score)]
print(grade_lookup(85)) # 'B'
print(grade_lookup(95)) # 'A'
Big O Notation & Performance Considerations
Understanding algorithmic complexity is crucial for writing efficient code. Here's a practical guide to Big O notation and how it applies to Python data structures:
Time Complexity Comparison
Operation | List | Dict | Set | Deque |
---|---|---|---|---|
Access by index | O(1) | N/A | N/A | O(1) |
Search | O(n) | O(1) | O(1) | O(n) |
Insert at beginning | O(n) | N/A | N/A | O(1) |
Insert at end | O(1) | N/A | N/A | O(1) |
Delete by value | O(n) | O(1) | O(1) | O(n) |
Practical Performance Examples
import time
import random
def time_operation(func, *args):
start = time.time()
result = func(*args)
end = time.time()
return result, end - start
# Comparing list vs set membership testing
data_list = list(range(100000))
data_set = set(data_list)
target = 99999
# List search - O(n)
_, list_time = time_operation(lambda: target in data_list)
# Set search - O(1)
_, set_time = time_operation(lambda: target in data_set)
print(f"List search: {list_time:.6f} seconds")
print(f"Set search: {set_time:.6f} seconds")
print(f"Set is {list_time/set_time:.0f}x faster")
# Deque vs list for front operations
from collections import deque
def append_front_list(n):
result = []
for i in range(n):
result.insert(0, i) # O(n) operation
return result
def append_front_deque(n):
result = deque()
for i in range(n):
result.appendleft(i) # O(1) operation
return result
# Test with smaller numbers to see the difference
n = 1000
_, list_time = time_operation(append_front_list, n)
_, deque_time = time_operation(append_front_deque, n)
print(f"List insert at front: {list_time:.6f} seconds")
print(f"Deque insert at front: {deque_time:.6f} seconds")
Memory Considerations
import sys
# Memory usage comparison
regular_list = [i for i in range(1000)]
tuple_data = tuple(regular_list)
set_data = set(regular_list)
print(f"List size: {sys.getsizeof(regular_list)} bytes")
print(f"Tuple size: {sys.getsizeof(tuple_data)} bytes")
print(f"Set size: {sys.getsizeof(set_data)} bytes")
# Generator for memory-efficient processing
def memory_efficient_processing(data):
"""Process large datasets without loading everything into memory"""
for item in data:
if item % 2 == 0: # Some condition
yield item * 2
# Instead of creating a large list
large_data = range(1000000) # This doesn't create a list in memory
processed = memory_efficient_processing(large_data)
# Process one item at a time
for item in processed:
if item > 100: # Stop early if needed
break
# Process item
Practical Application: Building a URL Shortener
Let's tie everything together with a practical example that uses multiple advanced data structures:
from collections import defaultdict, deque
import hashlib
import time
import heapq
class URLShortener:
def __init__(self, max_cache_size=1000):
self.urls = {} # short -> long URL mapping
self.reverse_urls = {} # long -> short URL mapping
self.access_count = defaultdict(int) # URL access statistics
self.recent_access = deque(maxlen=100) # Recent access history
self.cache = {} # LRU cache for frequent URLs
self.max_cache_size = max_cache_size
self.cache_access_times = {} # For LRU implementation
def _generate_short_url(self, long_url):
"""Generate short URL using hash"""
hash_object = hashlib.md5(long_url.encode())
return hash_object.hexdigest()[:8]
def shorten(self, long_url):
"""Shorten a URL"""
if long_url in self.reverse_urls:
return self.reverse_urls[long_url]
short_url = self._generate_short_url(long_url)
self.urls[short_url] = long_url
self.reverse_urls[long_url] = short_url
return short_url
def expand(self, short_url):
"""Expand a short URL with caching"""
current_time = time.time()
# Check cache first
if short_url in self.cache:
self.cache_access_times[short_url] = current_time
self.access_count[short_url] += 1
self.recent_access.append((short_url, current_time))
return self.cache[short_url]
# Get from main storage
if short_url not in self.urls:
return None
long_url = self.urls[short_url]
# Add to cache with LRU eviction
if len(self.cache) >= self.max_cache_size:
# Remove least recently used
lru_url = min(self.cache_access_times.keys(),
key=lambda k: self.cache_access_times[k])
del self.cache[lru_url]
del self.cache_access_times[lru_url]
self.cache[short_url] = long_url
self.cache_access_times[short_url] = current_time
self.access_count[short_url] += 1
self.recent_access.append((short_url, current_time))
return long_url
def get_popular_urls(self, top_n=10):
"""Get most popular URLs using heap"""
return heapq.nlargest(top_n, self.access_count.items(),
key=lambda x: x[1])
def get_recent_activity(self):
"""Get recent URL access activity"""
return list(self.recent_access)
# Usage example
shortener = URLShortener()
# Shorten some URLs
short1 = shortener.shorten("https://www.example.com/very/long/url/path")
short2 = shortener.shorten("https://www.google.com")
print(f"Shortened URL: {short1}")
# Expand URLs (with caching)
expanded = shortener.expand(short1)
print(f"Expanded URL: {expanded}")
# Access statistics
print(f"Popular URLs: {shortener.get_popular_urls()}")
print(f"Recent activity: {shortener.get_recent_activity()}")
Key Takeaways
Choose the right data structure for your specific use case - sets for membership testing, deques for double-ended operations, heaps for priority queues.
Understand time complexity - O(1) operations scale much better than O(n) operations as your data grows.
Use specialized collections from the
collections
module when built-in types don't fit your needs perfectly.Consider memory usage alongside time complexity - sometimes a slower algorithm uses significantly less memory.
Combine structures effectively - real-world applications often benefit from using multiple data structures together, each optimized for different operations.
Profile your code - measure actual performance rather than assuming. The most elegant solution isn't always the fastest.
By mastering these advanced data structures and understanding their performance characteristics, you'll be able to write Python code that scales efficiently and handles real-world data processing tasks with confidence.
Next, let's dive into generators, decorators and context managers.