3.1. Advanced Data Structures & Algorithms

Overview

Go beyond the basics with built-in and specialized data structures that make your code faster and more efficient for real-world tasks.


Introduction: Why Advanced Data Structures Matter

As your Python projects grow in complexity, the simple lists and dictionaries you learned as a beginner start to show their limitations. When you're processing thousands of records, building recommendation systems, or creating high-performance applications, choosing the right data structure can mean the difference between code that runs in milliseconds versus minutes.

In this chapter, we'll explore Python's advanced built-in data structures and dive into algorithmic thinking that will make you a more effective developer. You'll learn not just what* these structures do, but when* and why to use them in real-world scenarios.


Tuples, Sets, and Frozensets

Tuples: Immutable Sequences

While you might already know tuples as "immutable lists," their real power lies in their specific use cases and performance characteristics.

# Basic tuple creation
coordinates = (10, 20)
rgb_color = (255, 128, 0)

# Tuples are great for structured data
person = ("Alice", 30, "Engineer")
name, age, job = person  # Tuple unpacking

# Multiple assignment using tuples
x, y = 10, 20  # Actually creates a tuple (10, 20) then unpacks it

When to use tuples:

# Tuples as dictionary keys
locations = {
    (0, 0): "Origin",
    (10, 20): "Point A",
    (-5, 15): "Point B"
}

# Named tuples for better readability
from collections import namedtuple

Point = namedtuple('Point', ['x', 'y'])
Employee = namedtuple('Employee', ['name', 'age', 'department'])

# Usage
p1 = Point(10, 20)
print(p1.x, p1.y)  # More readable than p1[0], p1[1]

emp = Employee("Alice", 30, "Engineering")
print(f"{emp.name} works in {emp.department}")

Sets: Fast Membership Testing

Sets are unordered collections of unique elements, optimized for membership testing and set operations.

# Creating sets
numbers = {1, 2, 3, 4, 5}
fruits = set(['apple', 'banana', 'orange'])
empty_set = set()  # Note: {} creates an empty dict, not set

# Fast membership testing - O(1) average case
if 'apple' in fruits:  # Much faster than list membership testing
    print("Found apple!")

# Set operations
set_a = {1, 2, 3, 4}
set_b = {3, 4, 5, 6}

print(set_a | set_b)  # Union: {1, 2, 3, 4, 5, 6}
print(set_a & set_b)  # Intersection: {3, 4}
print(set_a - set_b)  # Difference: {1, 2}
print(set_a ^ set_b)  # Symmetric difference: {1, 2, 5, 6}

Real-world use cases:

# Example: Finding common interests between users
alice_interests = {"python", "data_science", "machine_learning", "hiking"}
bob_interests = {"python", "web_development", "hiking", "photography"}

common_interests = alice_interests & bob_interests
print(f"Common interests: {common_interests}")
# Output: Common interests: {'python', 'hiking'}

Frozensets: Immutable Sets

Frozensets are the immutable version of sets, useful when you need hashable collections.

# Creating frozensets
immutable_tags = frozenset(['python', 'programming', 'tutorial'])

# Can be used as dictionary keys
category_tags = {
    frozenset(['python', 'programming']): "Python Development",
    frozenset(['html', 'css', 'javascript']): "Web Development",
    frozenset(['data', 'analysis', 'pandas']): "Data Science"
}

# Useful for caching expensive computations
def expensive_operation(tags):
    # Convert to frozenset for hashing
    frozen_tags = frozenset(tags)
    if frozen_tags in cache:
        return cache[frozen_tags]

    result = perform_computation(tags)
    cache[frozen_tags] = result
    return result

The collections Module: Specialized Containers

Python's collections module provides specialized alternatives to built-in containers, each optimized for specific use cases.

defaultdict: Handling Missing Keys

from collections import defaultdict

# Instead of this error-prone code:
regular_dict = {}
for item in ['apple', 'banana', 'apple', 'orange', 'banana']:
    if item in regular_dict:
        regular_dict[item] += 1
    else:
        regular_dict[item] = 1

# Use defaultdict:
counter = defaultdict(int)
for item in ['apple', 'banana', 'apple', 'orange', 'banana']:
    counter[item] += 1  # No KeyError, defaults to 0

print(dict(counter))  # {'apple': 2, 'banana': 2, 'orange': 1}

# Grouping data
from collections import defaultdict

students = [
    ('Alice', 'Math'), ('Bob', 'Science'), ('Charlie', 'Math'),
    ('Diana', 'Science'), ('Eve', 'Math')
]

by_subject = defaultdict(list)
for name, subject in students:
    by_subject[subject].append(name)

print(dict(by_subject))
# {'Math': ['Alice', 'Charlie', 'Eve'], 'Science': ['Bob', 'Diana']}

Counter: Counting Made Easy

from collections import Counter

# Counting elements
text = "hello world"
char_count = Counter(text)
print(char_count)  # Counter({'l': 3, 'o': 2, 'h': 1, 'e': 1, ' ': 1, 'w': 1, 'r': 1, 'd': 1})

# Most common elements
words = ['apple', 'banana', 'apple', 'cherry', 'banana', 'apple']
word_count = Counter(words)
print(word_count.most_common(2))  # [('apple', 3), ('banana', 2)]

# Counter arithmetic
counter1 = Counter(['a', 'b', 'c', 'a'])
counter2 = Counter(['a', 'b', 'b', 'd'])
print(counter1 + counter2)  # Counter({'a': 3, 'b': 3, 'c': 1, 'd': 1})
print(counter1 - counter2)  # Counter({'c': 1, 'a': 1})

deque: Double-ended Queues

from collections import deque

# Efficient operations at both ends
d = deque(['a', 'b', 'c'])
d.appendleft('x')  # O(1) - much faster than list.insert(0, 'x')
d.append('y')      # O(1)
print(d)  # deque(['x', 'a', 'b', 'c', 'y'])

print(d.popleft())  # O(1) - 'x'
print(d.pop())      # O(1) - 'y'

# Rotating elements
d = deque(range(10))
d.rotate(3)  # Rotate right by 3
print(d)  # deque([7, 8, 9, 0, 1, 2, 3, 4, 5, 6])

# Useful for implementing queues and circular buffers
class CircularBuffer:
    def __init__(self, size):
        self.buffer = deque(maxlen=size)

    def add(self, item):
        self.buffer.append(item)  # Automatically removes oldest if full

    def get_recent(self, n):
        return list(self.buffer)[-n:]

OrderedDict: Preserving Insertion Order

Note: As of Python 3.7+, regular dictionaries maintain insertion order, but OrderedDict still has unique features.

from collections import OrderedDict

# Before Python 3.7, order wasn't guaranteed in regular dicts
ordered = OrderedDict([('first', 1), ('second', 2), ('third', 3)])

# Move to end
ordered.move_to_end('first')
print(ordered)  # OrderedDict([('second', 2), ('third', 3), ('first', 1)])

# Useful for LRU cache implementation
class LRUCache:
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = OrderedDict()

    def get(self, key):
        if key in self.cache:
            self.cache.move_to_end(key)  # Mark as recently used
            return self.cache[key]
        return None

    def put(self, key, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        elif len(self.cache) >= self.capacity:
            self.cache.popitem(last=False)  # Remove least recently used
        self.cache[key] = value

Heaps, Queues, and Priority Queues

Heaps with heapq

Python's heapq module provides heap operations on regular lists, implementing a min-heap.

import heapq

# Creating a heap
numbers = [3, 1, 4, 1, 5, 9, 2, 6]
heapq.heapify(numbers)  # Convert list to heap in-place
print(numbers)  # [1, 1, 2, 3, 5, 9, 4, 6] - not sorted, but heap property maintained

# Heap operations
heapq.heappush(numbers, 0)  # Add element
smallest = heapq.heappop(numbers)  # Remove and return smallest
print(smallest)  # 0

# Finding k largest/smallest elements
data = [1, 3, 5, 7, 9, 2, 4, 6, 8, 0]
print(heapq.nlargest(3, data))   # [9, 8, 7]
print(heapq.nsmallest(3, data))  # [0, 1, 2]

# Priority queue implementation
class PriorityQueue:
    def __init__(self):
        self._queue = []
        self._index = 0

    def push(self, item, priority):
        heapq.heappush(self._queue, (priority, self._index, item))
        self._index += 1

    def pop(self):
        return heapq.heappop(self._queue)[-1]  # Return just the item

    def is_empty(self):
        return len(self._queue) == 0

# Usage
pq = PriorityQueue()
pq.push('task1', 3)
pq.push('task2', 1)  # Higher priority (lower number)
pq.push('task3', 2)

while not pq.is_empty():
    print(pq.pop())  # Outputs: task2, task3, task1

Queue Module for Threading

import queue
import threading
import time

# Thread-safe queues
q = queue.Queue()  # FIFO queue
lq = queue.LifoQueue()  # LIFO queue (stack)
pq = queue.PriorityQueue()  # Priority queue

# Producer-consumer pattern
def producer(q):
    for i in range(5):
        q.put(f"item_{i}")
        time.sleep(0.1)

def consumer(q):
    while True:
        item = q.get()
        print(f"Processing {item}")
        time.sleep(0.2)
        q.task_done()

# Usage
work_queue = queue.Queue()
threading.Thread(target=producer, args=(work_queue,)).start()
threading.Thread(target=consumer, args=(work_queue,)).start()

Sorting and Searching Algorithms

Advanced Sorting Techniques

# Custom sorting with key functions
students = [
    {'name': 'Alice', 'grade': 85, 'age': 20},
    {'name': 'Bob', 'grade': 90, 'age': 22},
    {'name': 'Charlie', 'grade': 85, 'age': 19}
]

# Sort by multiple criteria
from operator import itemgetter

# Sort by grade (descending), then by age (ascending)
students.sort(key=itemgetter('grade', 'age'), reverse=True)
print(students)

# Custom comparison function
def compare_students(student):
    return (-student['grade'], student['age'])  # Negative for descending

students.sort(key=compare_students)

# Stable sorting for complex scenarios
import functools

def multi_sort(data, *sort_keys):
    """Sort by multiple keys in order of priority"""
    for key, reverse in reversed(sort_keys):
        data.sort(key=key, reverse=reverse)
    return data

# Usage
multi_sort(students, 
    (lambda x: x['grade'], True),   # Primary: grade descending
    (lambda x: x['age'], False)     # Secondary: age ascending
)

Binary Search and Bisect

import bisect

# Binary search on sorted lists
sorted_list = [1, 3, 5, 7, 9, 11, 13, 15]

# Find insertion point
pos = bisect.bisect_left(sorted_list, 6)
print(f"Insert 6 at position {pos}")  # 3

# Insert while maintaining sort order
bisect.insort(sorted_list, 6)
print(sorted_list)  # [1, 3, 5, 6, 7, 9, 11, 13, 15]

# Practical example: Grade boundaries
def grade_lookup(score):
    breakpoints = [60, 70, 80, 90]
    grades = ['F', 'D', 'C', 'B', 'A']
    return grades[bisect.bisect(breakpoints, score)]

print(grade_lookup(85))  # 'B'
print(grade_lookup(95))  # 'A'

Big O Notation & Performance Considerations

Understanding algorithmic complexity is crucial for writing efficient code. Here's a practical guide to Big O notation and how it applies to Python data structures:

Time Complexity Comparison

Operation List Dict Set Deque
Access by index O(1) N/A N/A O(1)
Search O(n) O(1) O(1) O(n)
Insert at beginning O(n) N/A N/A O(1)
Insert at end O(1) N/A N/A O(1)
Delete by value O(n) O(1) O(1) O(n)

Practical Performance Examples

import time
import random

def time_operation(func, *args):
    start = time.time()
    result = func(*args)
    end = time.time()
    return result, end - start

# Comparing list vs set membership testing
data_list = list(range(100000))
data_set = set(data_list)
target = 99999

# List search - O(n)
_, list_time = time_operation(lambda: target in data_list)

# Set search - O(1)
_, set_time = time_operation(lambda: target in data_set)

print(f"List search: {list_time:.6f} seconds")
print(f"Set search: {set_time:.6f} seconds")
print(f"Set is {list_time/set_time:.0f}x faster")

# Deque vs list for front operations
from collections import deque

def append_front_list(n):
    result = []
    for i in range(n):
        result.insert(0, i)  # O(n) operation
    return result

def append_front_deque(n):
    result = deque()
    for i in range(n):
        result.appendleft(i)  # O(1) operation
    return result

# Test with smaller numbers to see the difference
n = 1000
_, list_time = time_operation(append_front_list, n)
_, deque_time = time_operation(append_front_deque, n)

print(f"List insert at front: {list_time:.6f} seconds")
print(f"Deque insert at front: {deque_time:.6f} seconds")

Memory Considerations

import sys

# Memory usage comparison
regular_list = [i for i in range(1000)]
tuple_data = tuple(regular_list)
set_data = set(regular_list)

print(f"List size: {sys.getsizeof(regular_list)} bytes")
print(f"Tuple size: {sys.getsizeof(tuple_data)} bytes")
print(f"Set size: {sys.getsizeof(set_data)} bytes")

# Generator for memory-efficient processing
def memory_efficient_processing(data):
    """Process large datasets without loading everything into memory"""
    for item in data:
        if item % 2 == 0:  # Some condition
            yield item * 2

# Instead of creating a large list
large_data = range(1000000)  # This doesn't create a list in memory
processed = memory_efficient_processing(large_data)

# Process one item at a time
for item in processed:
    if item > 100:  # Stop early if needed
        break
    # Process item

Practical Application: Building a URL Shortener

Let's tie everything together with a practical example that uses multiple advanced data structures:

from collections import defaultdict, deque
import hashlib
import time
import heapq

class URLShortener:
    def __init__(self, max_cache_size=1000):
        self.urls = {}  # short -> long URL mapping
        self.reverse_urls = {}  # long -> short URL mapping
        self.access_count = defaultdict(int)  # URL access statistics
        self.recent_access = deque(maxlen=100)  # Recent access history
        self.cache = {}  # LRU cache for frequent URLs
        self.max_cache_size = max_cache_size
        self.cache_access_times = {}  # For LRU implementation

    def _generate_short_url(self, long_url):
        """Generate short URL using hash"""
        hash_object = hashlib.md5(long_url.encode())
        return hash_object.hexdigest()[:8]

    def shorten(self, long_url):
        """Shorten a URL"""
        if long_url in self.reverse_urls:
            return self.reverse_urls[long_url]

        short_url = self._generate_short_url(long_url)
        self.urls[short_url] = long_url
        self.reverse_urls[long_url] = short_url
        return short_url

    def expand(self, short_url):
        """Expand a short URL with caching"""
        current_time = time.time()

        # Check cache first
        if short_url in self.cache:
            self.cache_access_times[short_url] = current_time
            self.access_count[short_url] += 1
            self.recent_access.append((short_url, current_time))
            return self.cache[short_url]

        # Get from main storage
        if short_url not in self.urls:
            return None

        long_url = self.urls[short_url]

        # Add to cache with LRU eviction
        if len(self.cache) >= self.max_cache_size:
            # Remove least recently used
            lru_url = min(self.cache_access_times.keys(), 
                         key=lambda k: self.cache_access_times[k])
            del self.cache[lru_url]
            del self.cache_access_times[lru_url]

        self.cache[short_url] = long_url
        self.cache_access_times[short_url] = current_time
        self.access_count[short_url] += 1
        self.recent_access.append((short_url, current_time))

        return long_url

    def get_popular_urls(self, top_n=10):
        """Get most popular URLs using heap"""
        return heapq.nlargest(top_n, self.access_count.items(), 
                             key=lambda x: x[1])

    def get_recent_activity(self):
        """Get recent URL access activity"""
        return list(self.recent_access)

# Usage example
shortener = URLShortener()

# Shorten some URLs
short1 = shortener.shorten("https://www.example.com/very/long/url/path")
short2 = shortener.shorten("https://www.google.com")

print(f"Shortened URL: {short1}")

# Expand URLs (with caching)
expanded = shortener.expand(short1)
print(f"Expanded URL: {expanded}")

# Access statistics
print(f"Popular URLs: {shortener.get_popular_urls()}")
print(f"Recent activity: {shortener.get_recent_activity()}")

Key Takeaways

  1. Choose the right data structure for your specific use case - sets for membership testing, deques for double-ended operations, heaps for priority queues.

  2. Understand time complexity - O(1) operations scale much better than O(n) operations as your data grows.

  3. Use specialized collections from the collections module when built-in types don't fit your needs perfectly.

  4. Consider memory usage alongside time complexity - sometimes a slower algorithm uses significantly less memory.

  5. Combine structures effectively - real-world applications often benefit from using multiple data structures together, each optimized for different operations.

  6. Profile your code - measure actual performance rather than assuming. The most elegant solution isn't always the fastest.

By mastering these advanced data structures and understanding their performance characteristics, you'll be able to write Python code that scales efficiently and handles real-world data processing tasks with confidence.

Next, let's dive into generators, decorators and context managers.