Complete Feature Reference¶
This document provides a comprehensive reference for every feature extracted by JoyfulJay. Understanding these features is essential for building effective ML models for network traffic analysis.
Why These Features Matter¶
Network traffic analysis traditionally required decrypting traffic to understand its contents. JoyfulJay takes a different approach: it extracts behavioral features from encrypted traffic that reveal patterns without exposing content. These features capture:
- How fast packets are sent (timing)
- How large packets are (size)
- What protocols are negotiated (TLS, SSH, QUIC)
- How connections behave (TCP state, handshakes)
- What patterns emerge (bursts, padding, fingerprints)
These behavioral signals are remarkably consistent within application types and can distinguish encrypted Tor traffic from VPN traffic from regular HTTPS - all without decryption.
Feature Overview¶
JoyfulJay extracts 200+ features organized into 13 feature groups:
| Group | Features | Description |
|---|---|---|
| flow_meta | 27 | Flow identification, timing, packet/byte counts |
| timing | 35 | Inter-arrival times, bursts, idle periods |
| size | 39 | Packet lengths, payload statistics |
| tcp | 26 | TCP flags, handshake, connection state |
| tls | 23 | TLS version, ciphers, JA3/JA3S fingerprints |
| quic | 12 | QUIC version, connection IDs |
| ssh | 10 | SSH version, HASSH fingerprints |
| dns | 15 | DNS queries, response codes |
| fingerprint | 8 | Tor, VPN, DoH detection |
| entropy | 9 | Payload entropy, byte distribution |
| padding | 17 | Padding detection, obfuscation |
| connection | 20 | Graph-based network analysis |
Flow Metadata Features¶
The flow_meta group provides fundamental flow identification and statistics. These features form the foundation for understanding any network connection.
What is a "Flow"?¶
A flow represents a bidirectional network conversation between two endpoints. JoyfulJay identifies flows by the 5-tuple: source IP, source port, destination IP, destination port, and protocol. All packets matching this tuple are grouped into the same flow.
The initiator (source) is the endpoint that sent the first packet - typically the client. The responder (destination) is the endpoint that received the first packet - typically the server.
Identification Features¶
| Feature | Type | Description |
|---|---|---|
flow_id | str | Unique 32-character hash identifying this flow. Computed from the 5-tuple (IPs, ports, protocol). Useful for cross-referencing flows without exposing raw addresses. Only included if include_flow_id=True. |
src_ip | str | Source IP address (flow initiator). The endpoint that sent the first packet. Can be anonymized with anonymize_ips=True to produce a SHA-256 hash instead. |
dst_ip | str | Destination IP address (flow responder). The endpoint that received the first packet. Can be anonymized with anonymize_ips=True. |
src_port | int | Source port number (ephemeral port). Usually a high-numbered port (49152-65535) assigned by the OS. |
dst_port | int | Destination port number (service port). Indicates the service: 443=HTTPS, 80=HTTP, 22=SSH, 53=DNS, etc. |
dst_port_class | str | Service classification based on destination port. Values: "well_known" (0-1023), "registered" (1024-49151), "dynamic" (49152-65535). |
dst_port_class_num | int | Numeric port classification: 0=well_known, 1=registered, 2=dynamic. Useful for ML models. |
protocol | int | IP protocol number: 6=TCP, 17=UDP, 1=ICMP, 50=ESP (IPSec), 51=AH (IPSec). |
Example values:
src_ip: "192.168.1.100"
dst_ip: "142.250.189.46"
src_port: 52341
dst_port: 443
dst_port_class: "well_known"
protocol: 6 (TCP)
Timing Features¶
| Feature | Type | Description |
|---|---|---|
start_time | float | Unix timestamp (seconds since 1970-01-01) when the first packet was observed. |
end_time | float | Unix timestamp when the last packet was observed. |
duration | float | Flow duration in seconds (end_time - start_time). Short flows (<1s) often indicate DNS queries, handshakes, or failed connections. Long flows (>60s) indicate persistent connections like streaming or SSH sessions. |
time_first | float | Alias for start_time (Tranalyzer compatibility). |
time_last | float | Alias for end_time (Tranalyzer compatibility). |
What duration tells you: - < 0.1s: DNS query, failed connection, port scan - 0.1s - 1s: Short HTTP request, API call - 1s - 60s: Typical web browsing session - > 60s: Streaming video, SSH session, file download
Packet Count Features¶
| Feature | Type | Description |
|---|---|---|
total_packets | int | Total packets in the flow (both directions combined). |
packets_fwd | int | Packets from initiator to responder (client-to-server). Typically requests, uploads, commands. |
packets_bwd | int | Packets from responder to initiator (server-to-client). Typically responses, downloads, data. |
packets_ratio | float | Ratio of forward to backward packets (packets_fwd / packets_bwd). Values >1 indicate client-heavy traffic (uploads, commands); <1 indicate server-heavy (downloads, streaming). |
What packet ratios tell you: - ~1.0: Interactive session (SSH, chat) with balanced exchange - < 0.5: Download-heavy (streaming video, file download) - > 2.0: Upload-heavy (file upload, POST requests) - Very high (>10): Scanning, one-way traffic, or failed connection
Byte Count Features¶
| Feature | Type | Description |
|---|---|---|
total_bytes | int | Total bytes including all headers (Ethernet, IP, TCP/UDP, payload). |
bytes_fwd | int | Bytes sent by initiator. Includes protocol headers. |
bytes_bwd | int | Bytes sent by responder. Includes protocol headers. |
payload_bytes_fwd | int | Application-layer bytes sent by initiator. Excludes IP and TCP/UDP headers - just the actual data. |
payload_bytes_bwd | int | Application-layer bytes sent by responder. |
payload_bytes_total | int | Total application-layer bytes both directions. |
bytes_ratio | float | Ratio of forward to backward bytes (bytes_fwd / bytes_bwd). |
Understanding the difference: - total_bytes includes all overhead (headers, ACKs) - payload_bytes_total is just application data - A TLS handshake might have total_bytes=5000 but payload_bytes_total=3000 due to header overhead
Rate Features¶
| Feature | Type | Description |
|---|---|---|
packets_per_second | float | Average packet rate over the flow duration (total_packets / duration). High rates (>1000 pps) may indicate streaming or bulk transfer. Low rates (<10 pps) indicate interactive sessions. |
bytes_per_second | float | Average throughput in bytes/second (total_bytes / duration). |
avg_packet_size | float | Average packet size in bytes (total_bytes / total_packets). Small averages (~100-200) indicate interactive traffic; large averages (~1000-1400) indicate bulk transfers. |
Protocol Stack Features (Tranalyzer-compatible)¶
| Feature | Type | Description |
|---|---|---|
flow_stat | int | TCP connection state bitmap. Encodes what TCP events were observed. Bits: 0=SYN seen, 1=SYN-ACK seen, 2=FIN from initiator, 3=FIN from responder, 4=RST seen, 6=proper termination. |
num_hdrs | int | Number of protocol layers detected. Typically 2-3: Ethernet, IP, TCP/UDP. |
hdr_desc | str | Protocol stack description string like "ETH-IP-TCP" or "ETH-IP6-UDP". Indicates which protocols are present. |
Interpreting flow_stat: - 0x43 (binary 1000011) = SYN + SYN-ACK + proper termination = Normal complete TCP connection - 0x11 (binary 10001) = SYN + RST = Connection refused - 0x03 (binary 11) = SYN + SYN-ACK but no FIN/RST = Connection still active or timed out
Timing Features¶
The timing group captures the temporal behavior of network flows. These features are crucial for detecting application types and anomalies based on communication patterns.
Understanding Inter-Arrival Time (IAT)¶
Inter-Arrival Time (IAT) is the time between consecutive packets. It reveals the "rhythm" of communication:
- Very low IAT (< 1ms): Burst of data, likely bulk transfer
- Regular IAT (~10-50ms): Streaming video, audio
- Variable IAT: Interactive session, human-paced
- Long IAT (> 1s): Idle connection, keep-alive
IAT Statistics (Overall)¶
| Feature | Type | Description |
|---|---|---|
iat_min | float | Minimum IAT in seconds. Very small values (<1ms) indicate burst transmission or fast server response. |
iat_max | float | Maximum IAT in seconds. Large values indicate idle periods, timeouts, or user think time. |
iat_mean | float | Mean IAT. Characterizes the typical pace of communication. |
iat_std | float | Standard deviation of IAT. Low std indicates regular/constant timing (streaming); high std indicates bursty or variable patterns (browsing). |
iat_median | float | Median IAT (50th percentile). Less sensitive to outliers than mean - better represents "typical" timing. |
iat_sum | float | Sum of all IATs. Approximately equals flow duration. |
iat_p25 | float | 25th percentile IAT. Fast packets. |
iat_p75 | float | 75th percentile IAT. Slower packets. |
iat_p90 | float | 90th percentile IAT. Captures "slow" packet gaps. |
iat_p99 | float | 99th percentile IAT. Captures extreme delays. |
Typical patterns: - Video streaming: iat_mean ~30-40ms, iat_std low - Web browsing: iat_mean ~100-500ms, iat_std high - File download: iat_min very low (bursts), iat_max moderate
Directional IAT Statistics¶
Separate statistics for each direction reveal asymmetric behavior - essential for understanding client-server dynamics.
| Feature | Type | Description |
|---|---|---|
iat_fwd_min | float | Minimum IAT for forward (client-to-server) packets. |
iat_fwd_max | float | Maximum IAT for forward packets. |
iat_fwd_mean | float | Mean IAT for forward packets. High values indicate slow client typing or "think time". |
iat_fwd_std | float | Standard deviation of forward IAT. |
iat_fwd_median | float | Median forward IAT. |
iat_bwd_min | float | Minimum IAT for backward (server-to-client) packets. |
iat_bwd_max | float | Maximum IAT for backward packets. |
iat_bwd_mean | float | Mean IAT for backward packets. Low values indicate streaming server response. |
iat_bwd_std | float | Standard deviation of backward IAT. |
iat_bwd_median | float | Median backward IAT. |
Example interpretation: - SSH session: High iat_fwd_mean (human typing), low iat_bwd_mean (instant echo) - Video streaming: Low iat_bwd_mean (constant stream), high iat_fwd_mean (occasional ACKs)
Burstiness Features¶
Burstiness measures how irregular the traffic pattern is. The coefficient of variation (CV = std/mean) quantifies this.
| Feature | Type | Description |
|---|---|---|
burstiness_index | float | CV of IAT. Values near 0 = constant rate (streaming); values >1 = highly bursty (web browsing). |
burstiness_index_fwd | float | Burstiness of forward traffic. |
burstiness_index_bwd | float | Burstiness of backward traffic. |
Interpreting burstiness: - < 0.5: Very regular timing (constant bitrate streaming) - 0.5 - 1.0: Moderately variable (adaptive streaming) - > 1.0: Highly bursty (web requests, interactive) - > 2.0: Extremely variable (browsing with long pauses)
Burst and Idle Metrics¶
JoyfulJay identifies bursts (sequences of rapid packets) and idle periods (gaps between bursts).
| Feature | Type | Description |
|---|---|---|
burst_count | int | Number of distinct bursts in the flow. Many short bursts indicate interactive traffic. |
avg_burst_packets | float | Average packets per burst. Large bursts (>10 packets) indicate bulk transfers within each burst. |
avg_burst_duration | float | Average burst duration in seconds. |
max_burst_packets | int | Maximum packets in any single burst. |
idle_count | int | Number of idle periods (gaps exceeding threshold). |
avg_idle_duration | float | Average idle period duration in seconds. |
max_idle_duration | float | Maximum idle period. Very long idles may indicate keep-alive connections or user abandonment. |
first_response_time | float | Time from first packet to first response. Estimates RTT + server processing time. |
Burst threshold: Configurable via burst_threshold_ms (default: 50ms). Packets closer than this are considered part of the same burst.
Sequence Features (Optional)¶
When include_sequences=True, raw sequences are included for deep learning models:
| Feature | Type | Description |
|---|---|---|
iat_sequence | list[float] | Raw IAT sequence, padded/truncated to max_sequence_length. Useful for RNNs/LSTMs. |
timestamp_sequence | list[float] | Relative timestamps from flow start. |
SPLT Features (Optional)¶
When include_splt=True, the Sequence of Packet Lengths and Times format is included:
| Feature | Type | Description |
|---|---|---|
splt | list[tuple] | List of (length, IAT, direction) tuples. Direction: 1=forward, -1=backward. |
splt_lengths | list[int] | Just the packet lengths from SPLT. |
splt_times | list[float] | Just the IATs from SPLT. |
splt_directions | list[int] | Just the directions from SPLT. |
SPLT example:
# First 5 packets of a TLS handshake
[(512, 0.0, 1), # ClientHello, first packet (IAT=0)
(1400, 0.015, -1), # ServerHello, 15ms later
(100, 0.002, 1), # Client response, 2ms later
(1500, 0.001, -1), # Certificate, 1ms later
(50, 0.003, 1)] # Finished, 3ms later
Size Features¶
The size group analyzes packet size distributions. Size features are highly discriminative for application identification because different applications have characteristic packet sizes.
Why Size Matters¶
- DNS queries: Small packets (~100 bytes)
- HTTP requests: Medium packets (~500-1000 bytes)
- Bulk data transfer: Large packets (~1400-1500 bytes, near MTU)
- TLS handshakes: Specific sizes based on cipher suite
- Tor cells: Fixed 586-byte cells
Overall Packet Size Statistics¶
| Feature | Type | Description |
|---|---|---|
pkt_len_min | int | Minimum packet size. Often TCP ACK at ~40-60 bytes. |
pkt_len_max | int | Maximum packet size. Often near MTU (~1500) for bulk transfer. |
pkt_len_mean | float | Mean packet size. Small mean (<200) = interactive; large (>1000) = bulk transfer. |
pkt_len_std | float | Standard deviation. Low std = uniform sizes (padding/encryption). |
pkt_len_median | float | Median packet size. |
pkt_len_p25 | float | 25th percentile packet size. |
pkt_len_p75 | float | 75th percentile packet size. |
pkt_len_p90 | float | 90th percentile packet size. |
pkt_len_variance | float | Variance of packet sizes (std^2). |
Directional Size Statistics¶
| Feature | Type | Description |
|---|---|---|
pkt_len_fwd_min | int | Minimum forward packet size. |
pkt_len_fwd_max | int | Maximum forward packet size. |
pkt_len_fwd_mean | float | Mean forward packet size. Small = requests; large = uploads. |
pkt_len_fwd_std | float | Standard deviation of forward sizes. |
pkt_len_fwd_median | float | Median forward size. |
pkt_len_bwd_min | int | Minimum backward packet size. |
pkt_len_bwd_max | int | Maximum backward packet size. |
pkt_len_bwd_mean | float | Mean backward packet size. Large = downloads, responses. |
pkt_len_bwd_std | float | Standard deviation of backward sizes. |
pkt_len_bwd_median | float | Median backward size. |
Payload Statistics¶
Payload excludes protocol headers, measuring only application data.
| Feature | Type | Description |
|---|---|---|
payload_len_min | int | Minimum payload size (0 for pure ACKs). |
payload_len_max | int | Maximum payload size. |
payload_len_mean | float | Mean payload size. |
payload_len_std | float | Standard deviation of payload sizes. |
payload_len_fwd_mean | float | Mean forward payload size. |
payload_len_fwd_std | float | Std dev of forward payload. |
payload_len_bwd_mean | float | Mean backward payload size. |
payload_len_bwd_std | float | Std dev of backward payload. |
Payload Packet Counts¶
| Feature | Type | Description |
|---|---|---|
packets_with_payload | int | Packets containing application data (payload > 0). |
packets_with_payload_fwd | int | Forward packets with payload. |
packets_with_payload_bwd | int | Backward packets with payload. |
header_only_ratio | float | Fraction of packets without payload (pure ACKs, control packets). High ratio indicates acknowledgment-heavy flow. |
Distribution Analysis¶
| Feature | Type | Description |
|---|---|---|
dominant_pkt_size | int | Most common packet size (mode). |
dominant_pkt_ratio | float | Fraction of packets at dominant size. High values (>0.5) indicate fixed-size protocols like Tor. |
Layer 7 (Application) Statistics¶
L7 = Layer 7 = Application layer (payload only, no headers).
| Feature | Type | Description |
|---|---|---|
l7_bytes_fwd | int | Total application-layer bytes forward. |
l7_bytes_bwd | int | Total application-layer bytes backward. |
l7_bytes_total | int | Total application-layer bytes. |
l7_pkt_min_fwd | int | Minimum non-zero forward payload. |
l7_pkt_max_fwd | int | Maximum forward payload. |
l7_pkt_min_bwd | int | Minimum non-zero backward payload. |
l7_pkt_max_bwd | int | Maximum backward payload. |
Asymmetry Metrics¶
| Feature | Type | Description |
|---|---|---|
pkt_asymmetry | float | Packet count asymmetry: (fwd - bwd) / (fwd + bwd). Range -1 to +1. Positive = more client packets. |
byte_asymmetry | float | Byte count asymmetry. Positive = upload-heavy, negative = download-heavy. |
Asymmetry patterns: - ~0: Balanced exchange (chat, gaming) - < -0.5: Download-heavy (streaming, web) - > +0.5: Upload-heavy (backup, file upload)
TCP Features¶
The tcp group analyzes TCP-specific behavior. Essential for understanding connection health, detecting attacks, and identifying anomalies.
Why TCP Analysis Matters¶
TCP has a rich set of control flags that reveal connection state: - SYN: "I want to connect" - SYN-ACK: "OK, I accept" - ACK: "I received your data" - FIN: "I'm done, closing gracefully" - RST: "Abort! Something's wrong" - PSH: "Process this immediately"
Analyzing these flags reveals connection health, attack patterns, and application behavior.
Basic Detection¶
| Feature | Type | Description |
|---|---|---|
tcp_is_tcp | bool | True if this is a TCP flow (protocol=6). False for UDP, ICMP, etc. |
Flag Counts¶
| Feature | Type | Description |
|---|---|---|
tcp_syn_count | int | SYN packets (without ACK). Normal: 1. Multiple SYNs indicate retransmission or SYN flood attack. |
tcp_synack_count | int | SYN-ACK packets. Normal: 1. Should match successful connection acceptance. |
tcp_fin_count | int | FIN packets. Normal termination has 2 (one per direction). |
tcp_rst_count | int | RST packets. Any non-zero value indicates abnormal termination - connection refused, timeout, or error. |
tcp_ack_count | int | ACK packets. Most data packets have ACK flag set. |
tcp_psh_count | int | PSH (push) packets. Indicates application wants immediate delivery. High counts indicate interactive traffic. |
tcp_urg_count | int | URG (urgent) packets. Rarely used in modern applications. Non-zero may indicate legacy protocols or attacks. |
tcp_ece_count | int | ECN Echo. Indicates network congestion notification. |
tcp_cwr_count | int | Congestion Window Reduced. Response to ECE, indicates sender backed off. |
Flag Ratios¶
| Feature | Type | Description |
|---|---|---|
tcp_syn_ratio | float | Fraction of packets with SYN. Should be very low for normal flows. |
tcp_fin_ratio | float | Fraction of packets with FIN. |
tcp_rst_ratio | float | Fraction of packets with RST. High values indicate connection problems or attacks. |
tcp_psh_ratio | float | Fraction of packets with PSH. High values indicate interactive traffic. |
Connection State¶
| Feature | Type | Description |
|---|---|---|
tcp_initiator_syn | bool | True if initiator sent SYN (normal connection start). |
tcp_responder_synack | bool | True if responder sent SYN-ACK (server accepted connection). |
tcp_complete_handshake | bool | True if 3-way handshake completed (SYN + SYN-ACK seen). False indicates failed connection attempt. |
tcp_graceful_close | bool | True if both sides sent FIN (proper termination). |
tcp_reset_close | bool | True if connection was aborted via RST. |
Packet Categories¶
| Feature | Type | Description |
|---|---|---|
tcp_data_packets | int | Packets containing application data (payload > 0). |
tcp_ack_only_packets | int | Pure ACK packets (no data, no control flags). Used for flow control. |
Anomaly Detection¶
| Feature | Type | Description |
|---|---|---|
tcp_flags_anomaly | bool | True if anomalous flag combinations detected, indicating potential attacks: |
Detected anomalies: - Null scan: No flags set (used for OS fingerprinting) - XMAS scan: FIN+PSH+URG set (Christmas tree pattern) - SYN+FIN: Invalid combination (scan technique) - SYN+RST: Invalid combination (scan technique)
Aggregate Flags (Tranalyzer-compatible)¶
| Feature | Type | Description |
|---|---|---|
tcp_fstat | int | Flow status bitmap: bit 0=SYN, 1=SYN-ACK, 2=ACK beyond handshake, 3=FIN fwd, 4=FIN bwd, 5=RST, 6=data transferred, 7=anomaly. |
tcp_flags_agg | int | OR of all TCP flags seen in the flow. |
tcp_flags_fwd | int | OR of all TCP flags in forward direction. |
tcp_flags_bwd | int | OR of all TCP flags in backward direction. |
TLS Features¶
The tls group extracts metadata from TLS handshakes without decrypting traffic. This is essential for HTTPS analysis, certificate monitoring, and client fingerprinting.
How TLS Analysis Works¶
When a TLS connection starts, the client and server exchange unencrypted handshake messages that negotiate encryption parameters. JoyfulJay parses these messages to extract:
- ClientHello: Client's supported versions, ciphers, extensions
- ServerHello: Server's chosen parameters
- Certificate: Server's certificate chain
- Key Exchange: Cryptographic parameters
Detection and Version¶
| Feature | Type | Description |
|---|---|---|
tls_detected | bool | True if TLS handshake was detected. |
tls_version | int | TLS version as hex value: 0x0301=TLS 1.0, 0x0302=TLS 1.1, 0x0303=TLS 1.2, 0x0304=TLS 1.3. |
tls_version_str | str | Human-readable: "TLS 1.0", "TLS 1.1", "TLS 1.2", "TLS 1.3". |
Cipher Information¶
| Feature | Type | Description |
|---|---|---|
tls_cipher_suite | int | Selected cipher suite code from ServerHello. Identifies the encryption algorithm used. |
tls_cipher_count | int | Number of cipher suites offered in ClientHello. Browsers offer many (20+); specific apps offer few (3-5). Useful for fingerprinting. |
tls_extension_count | int | Number of TLS extensions. Part of the JA3 fingerprint. |
Server Name Indication (SNI)¶
| Feature | Type | Description |
|---|---|---|
tls_sni | str | Server Name Indication - the hostname the client is connecting to. Critical for identifying the destination website without decryption. Empty for IP-only connections. |
tls_alpn | str | Application-Layer Protocol Negotiation - the protocol being used. Common values: "h2" (HTTP/2), "http/1.1", "h3" (HTTP/3 over QUIC). |
SNI is extremely valuable: - Identifies which website the user is visiting - Essential for virtual hosting (multiple sites on one IP) - Used by firewalls for domain-based filtering
JA3 Fingerprinting¶
JA3 creates a fingerprint from the TLS ClientHello that identifies the client application. Different browsers, operating systems, and applications produce different JA3 hashes.
| Feature | Type | Description |
|---|---|---|
ja3_hash | str | MD5 hash of client fingerprint (32 hex characters). Same application typically produces the same JA3. |
ja3s_hash | str | JA3S - server fingerprint from ServerHello. |
JA3 composition:
JA3 applications: - Identify browsers vs. malware vs. bots - Detect TLS-based C2 communication - Fingerprint devices and applications - Detect anomalous client behavior
Certificate Metadata¶
| Feature | Type | Description |
|---|---|---|
tls_cert_count | int | Number of certificates in chain. 0 indicates session resumption (no certificate exchange needed). |
tls_cert_total_length | int | Total bytes of certificate data. Large values indicate long certificate chains. |
tls_cert_first_length | int | Size of the server's certificate (first in chain). |
tls_cert_chain_length | int | Certificate chain depth. |
Session Resumption¶
TLS session resumption allows clients to reconnect faster by reusing previous session parameters.
| Feature | Type | Description |
|---|---|---|
tls_session_id_len | int | Length of session ID. Non-zero indicates client requesting resumption. |
tls_session_ticket_ext | bool | True if session ticket extension present (TLS 1.2 resumption). |
tls_session_resumed | bool | True if session was actually resumed (no certificate exchange). |
tls_psk_ext | bool | True if Pre-Shared Key extension present (TLS 1.3 resumption). |
tls_early_data_ext | bool | True if 0-RTT early data extension present (TLS 1.3 optimization). |
Key Exchange¶
| Feature | Type | Description |
|---|---|---|
tls_key_exchange_group | int | Named group/curve ID. Common values: 0x001D=x25519, 0x0017=secp256r1 (P-256). |
tls_key_exchange_group_name | str | Human-readable name: "x25519", "secp256r1", "secp384r1", "ffdhe2048". |
tls_key_exchange_length | int | Key exchange data length in bits. |
tls_handshake_packets | int | Number of TLS handshake packets observed. |
QUIC Features¶
The quic group analyzes QUIC protocol traffic (HTTP/3). QUIC is Google's UDP-based protocol that combines TLS 1.3 with transport layer functionality.
| Feature | Type | Description |
|---|---|---|
quic_detected | bool | True if QUIC traffic detected. |
quic_version | int | QUIC version as 32-bit integer. |
quic_version_str | str | Human-readable: "QUIC v1", "QUIC v2", "QUIC draft-29". |
quic_dcid_len | int | Destination Connection ID length (bytes). Used for connection migration. |
quic_scid_len | int | Source Connection ID length (bytes). |
quic_pn_length | int | Packet Number length (1-4 bytes). |
quic_initial_packets | int | Count of Initial (handshake) packets. |
quic_0rtt_detected | bool | True if 0-RTT packets detected (session resumption). |
quic_retry_detected | bool | True if Retry packets detected (address validation). |
quic_spin_bit | bool | True if spin bit is being used for RTT measurement. |
quic_alpn | str | Application protocol: "h3", "h3-29", etc. |
quic_sni | str | Server Name Indication (if extracted from Initial packet). |
SSH Features¶
The ssh group analyzes SSH protocol traffic, extracting version information and HASSH fingerprints.
| Feature | Type | Description |
|---|---|---|
ssh_detected | bool | True if SSH traffic detected. |
ssh_version | str | SSH protocol version ("2.0"). |
ssh_client_software | str | Client software name (e.g., "OpenSSH_8.9", "PuTTY"). |
ssh_server_software | str | Server software name (e.g., "OpenSSH_9.1", "dropbear"). |
ssh_client_version | str | Client SSH version string. |
ssh_server_version | str | Server SSH version string. |
ssh_hassh | str | HASSH client fingerprint - MD5 hash of key exchange algorithms. Identifies SSH client implementations. |
ssh_hassh_server | str | HASSH server fingerprint. |
ssh_kex_packets | int | Number of key exchange packets. |
ssh_encrypted_packets | int | Number of encrypted data packets (after key exchange completes). |
HASSH is like JA3 for SSH - it fingerprints clients based on their cryptographic algorithm preferences.
Fingerprint Features¶
The fingerprint group detects specific traffic types like Tor, VPN, and DNS-over-HTTPS based on behavioral patterns.
Tor Detection¶
Tor traffic has distinctive characteristics: - Fixed-size cells (~586 bytes) - Low packet size variance - Specific timing patterns - Known ports (443, 9001, 9030)
| Feature | Type | Description |
|---|---|---|
likely_tor | bool | True if traffic pattern matches Tor. |
tor_confidence | float | Confidence score (0.0-1.0). |
VPN Detection¶
Different VPN protocols have unique signatures: - OpenVPN: UDP/TCP port 1194, specific header patterns - WireGuard: UDP port 51820, small handshake - IPSec: ESP (protocol 50), IKE (UDP 500/4500) - L2TP: UDP port 1701
| Feature | Type | Description |
|---|---|---|
likely_vpn | bool | True if VPN traffic detected. |
vpn_confidence | float | Confidence score (0.0-1.0). |
vpn_type | str | Detected type: "openvpn", "wireguard", "ipsec-esp", "ipsec-ah", "ipsec-ike", "l2tp", or empty. |
DNS-over-HTTPS (DoH) Detection¶
DoH encrypts DNS queries in HTTPS, making them harder to monitor. Detection uses: - Known DoH provider SNIs (dns.google, cloudflare-dns.com) - Short flow duration - Small payload sizes typical of DNS
| Feature | Type | Description |
|---|---|---|
likely_doh | bool | True if DNS-over-HTTPS detected. |
doh_confidence | float | Confidence score (0.0-1.0). |
Overall Classification¶
| Feature | Type | Description |
|---|---|---|
traffic_type | str | Classification: "tor", "vpn:wireguard", "vpn:openvpn", "doh", or "encrypted". |
Entropy Features¶
The entropy group analyzes payload randomness to distinguish encrypted, compressed, and plaintext traffic.
Understanding Entropy¶
Shannon entropy measures randomness on a scale of 0-8 bits per byte: - ~8 bits: Maximum entropy - encrypted or compressed data - ~5-7 bits: High entropy - mixed binary data - ~4-5 bits: Medium - text with structure - ~1-3 bits: Low - repetitive patterns, sparse data
| Feature | Type | Description |
|---|---|---|
entropy_payload | float | Shannon entropy of payload (0-8 bits/byte). ~8 = encrypted/compressed, ~4-6 = text. |
entropy_initiator | float | Entropy of initiator's payload. |
entropy_responder | float | Entropy of responder's payload. |
entropy_ratio | float | Ratio of initiator to responder entropy. |
byte_distribution_uniformity | float | How uniform the byte distribution is (0-1). 1 = perfectly uniform (encrypted). |
printable_ratio | float | Fraction of printable ASCII bytes (32-126). High values indicate text. |
null_ratio | float | Fraction of null bytes (0x00). Binary formats often have many nulls. |
high_byte_ratio | float | Fraction of bytes in range 128-255. High values indicate binary or encrypted data. |
payload_bytes_sampled | int | Number of bytes analyzed (limited by entropy_sample_bytes config). |
Padding Features¶
The padding group detects traffic obfuscation and padding techniques used to hide traffic patterns.
Size Distribution¶
| Feature | Type | Description |
|---|---|---|
pkt_size_variance | float | Variance of packet sizes. Low variance indicates fixed-size padding. |
pkt_size_cv | float | Coefficient of variation (std/mean). |
is_constant_size | bool | True if >95% of packets have the same size. |
dominant_size_mode | int | Most common packet size. |
dominant_size_ratio | float | Fraction at dominant size. |
unique_size_count | int | Number of unique sizes. Low = padding. |
size_entropy | float | Normalized entropy of size distribution. |
Timing Analysis¶
| Feature | Type | Description |
|---|---|---|
iat_variance | float | Variance of inter-arrival times. |
iat_cv | float | CV of IAT. Low values indicate constant-rate traffic shaping. |
is_constant_rate | bool | True if traffic is constant-rate (CV < 0.1). |
Tor Cell Detection¶
| Feature | Type | Description |
|---|---|---|
tor_cell_count | int | Packets matching Tor cell size (580-600 bytes). |
tor_cell_ratio | float | Fraction of Tor-cell-sized packets. |
is_tor_like | bool | True if pattern strongly suggests Tor. |
Burst Analysis¶
| Feature | Type | Description |
|---|---|---|
burst_padding_ratio | float | Overhead (header) bytes as fraction of burst. |
burst_overhead_bytes | int | Total header bytes in bursts. |
avg_burst_payload_efficiency | float | Payload/total ratio per burst. Low efficiency suggests padding. |
padding_score | float | Combined padding score (0-1). Higher = more likely padded/obfuscated. |
Connection Features¶
The connection group provides graph-based analysis across all flows. Requires: pip install joyfuljay[graphs]
Simple Metrics¶
| Feature | Type | Description |
|---|---|---|
conn_src_unique_dsts | int | Unique destinations from source. High = scanning behavior. |
conn_dst_unique_srcs | int | Unique sources to destination. High = server endpoint. |
conn_src_dst_flows | int | Flows between this exact pair. |
conn_src_port_flows | int | Flows to this destination port. |
conn_src_total_flows | int | Total outbound flows from source. |
conn_dst_total_flows | int | Total inbound flows to destination. |
conn_src_total_packets | int | Total packets sent by source. |
conn_src_total_bytes | int | Total bytes sent by source. |
conn_dst_total_packets | int | Total packets received by destination. |
conn_dst_total_bytes | int | Total bytes received by destination. |
conn_src_unique_ports | int | Unique destination ports from source. |
Graph Metrics (NetworkX)¶
| Feature | Type | Description |
|---|---|---|
conn_src_out_degree | int | Out-degree in connection graph. |
conn_dst_in_degree | int | In-degree in connection graph. |
conn_src_betweenness | float | Betweenness centrality (0-1). High = bridge/gateway node. |
conn_dst_betweenness | float | Betweenness of destination. |
conn_src_community | int | Community ID of source. |
conn_dst_community | int | Community ID of destination. |
conn_same_community | bool | True if same community. |
conn_src_clustering | float | Clustering coefficient of source. |
conn_dst_clustering | float | Clustering coefficient of destination. |
Feature Selection for ML¶
Traffic Classification (App Identification)¶
config = jj.Config(
features=["flow_meta", "timing", "size", "tls"],
bidirectional_split=True,
)
Encrypted Traffic Detection (Tor, VPN)¶
Anomaly Detection¶
config = jj.Config(
features=["timing", "size", "entropy", "tcp"],
include_raw_sequences=True,
)
Most Discriminative Features¶
Based on research and experiments:
- Timing:
iat_mean,iat_std,burstiness_index - Size:
pkt_len_mean,pkt_len_std,pkt_asymmetry - TLS:
ja3_hash,tls_sni,tls_cipher_count - Connection:
tcp_complete_handshake,duration
See Also¶
- Extractors Reference - Per-extractor documentation
- Configuration - All configuration options
- Tutorials - Step-by-step guides